Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create KGTK augment command #638

Open
wants to merge 34 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
ecd2cd7
Merge pull request #16 from usc-isi-i2/dev
GrantXie Feb 3, 2022
b6cc57c
Add files via upload
GrantXie Feb 3, 2022
5f0f65e
Add files via upload
GrantXie Feb 3, 2022
c4c3428
Add files via upload
GrantXie Feb 3, 2022
1039555
Create augment.md
GrantXie Feb 10, 2022
13a3fd2
Update augment.md
GrantXie Feb 10, 2022
243806c
Update augment.md
GrantXie Feb 10, 2022
27244cd
Update augment files
GrantXie Feb 10, 2022
92cc3cc
Update augment command
GrantXie Feb 10, 2022
50dc913
Delete augment.md
GrantXie Feb 10, 2022
1901ef7
Create augment.md
GrantXie Feb 10, 2022
37f6670
Update mkdocs.yml
GrantXie Feb 10, 2022
e24897a
Update augment.md
GrantXie Feb 10, 2022
1e7bce1
Add files via upload
GrantXie Feb 10, 2022
81806aa
Delete augment-FB15K-samele.tsv
GrantXie Feb 10, 2022
616169d
Add files via upload
GrantXie Feb 10, 2022
897b205
pep8
GrantXie Feb 10, 2022
26b26c8
pep8
GrantXie Feb 10, 2022
76d945f
pep8
GrantXie Feb 10, 2022
f0ed0ab
pep8
GrantXie Feb 10, 2022
608d2af
pep8
GrantXie Feb 10, 2022
dc2c73a
pep8
GrantXie Feb 10, 2022
8107acd
Update augment.md
GrantXie Feb 10, 2022
98dcdea
Update augment.md
GrantXie Feb 10, 2022
6521edd
Update augment.md
GrantXie Feb 10, 2022
729c6ed
Add files via upload
GrantXie Feb 17, 2022
0ac73db
pep8
GrantXie Feb 17, 2022
fa33ea0
Update augment.md
GrantXie Feb 17, 2022
1a3307b
Add files via upload
GrantXie Feb 17, 2022
b954642
Update augment.md
GrantXie Feb 17, 2022
2765d1c
Update test_augment.py
GrantXie Feb 17, 2022
eb78b3b
Update test_augment.py
GrantXie Feb 22, 2022
291b688
Update test_augment.py
GrantXie Feb 22, 2022
2427413
Update test_augment.py
GrantXie Feb 22, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 177 additions & 0 deletions docs/transform/augment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
## Summary

This command will augmented graph from a KGTK Edge file with numeric value in float (or date) on node2. This command will automatically detect date in wikidata format and transform it to float in year
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the grammar. Also I can't understand what this command will do from this description. Please update


### Input File

The input file should be a KGTK Edge file with the following columns or their aliases:

- `node1`: the subject column (source node)
- `label`: the predicate column (property name)
- `node2`: the object column (target node)



### The Output File

The output file is an edge file for each mode that contains the following columns:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what extra edges will be added? Example?


- `node1`: this column contains each node
- `label`: this column contains only name of intervals
- `node2`: this column contains range (bin) for numeric value


## Usage
```
usage: kgtk augment [-h] [-i INPUT_FILE] [-o OUTPUT_FILE] [--dataset DATASET]
[--train-file-name TRAIN_FILE_NAME]
[--numerical-literal-name NUM_LITERAL_NAME]
[--valid-file-name VALID_FILE_NAME]
[--test-file-name TEST_FILE_NAME] [--bins BINS]
[--aug_mode AUG_MODE] [--prediction-type PREDICTION_TYPE]
[--reverse REVERSE] [--output-path OUTPUT_PATH]
[--train-literal-name TRAIN_LITERAL_NAME]
[--entity-triple-name ENTITY_TRIPLE_NAME]
[--valid-literal-name VALID_LITERAL_NAME]
[--test-literal-name TEST_LITERAL_NAME]
[--include-original INCLUDE_ORIGINAL]
[--old-id-column-name COLUMN_NAME]
[--new-id-column-name COLUMN_NAME]
[--overwrite-id [optional true|false]]
[--verify-id-unique [optional true|false]]
[--id-style {compact-prefix,empty,node1-label-node2,node1-label-num,node1-label-node2-num,node1-label-node2-id,prefix###,wikidata,wikidata-with-claim-id}]
[--id-prefix PREFIX] [--initial-id INTEGER]
[--id-prefix-num-width INTEGER]
[--id-concat-num-width INTEGER]
[--value-hash-width VALUE_HASH_WIDTH]
[--claim-id-hash-width CLAIM_ID_HASH_WIDTH]
[--claim-id-column-name CLAIM_ID_COLUMN_NAME]
[--id-separator ID_SEPARATOR] [-v [optional True|False]]

Augment Graph File

optional arguments:
-h, --help show this help message and exit
-i INPUT_FILE, --input-file INPUT_FILE
The KGTK input file. (May be omitted or '-' for
stdin.)
-o OUTPUT_FILE, --output-file OUTPUT_FILE
The KGTK output file. (May be omitted or '-' for
stdout.)
--dataset DATASET Specify the location of dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exactly is the location of dataset?

--train-file-name TRAIN_FILE_NAME
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these new parameters need to have longer descriptions

Specify name for training file
--numerical-literal-name NUM_LITERAL_NAME
Specify name for numerical literal file
--valid-file-name VALID_FILE_NAME
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?? I will only add this comment here, add longer description help messages

Specify name for valid file
--test-file-name TEST_FILE_NAME
Specify name for test file
--bins BINS Specify number of bins to use
--aug_mode AUG_MODE Specify name for test file, seperated by comma, or All
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the options?

for using all modes
--prediction-type PREDICTION_TYPE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does lp, np mean? any defaults?

Specify prediction type to use (lp, np)
--reverse REVERSE Specify whether to include reverse links
--output-path OUTPUT_PATH
Specify path to store output files
--train-literal-name TRAIN_LITERAL_NAME
Specify name for training file
--entity-triple-name ENTITY_TRIPLE_NAME
Specify name for entity triple file
--valid-literal-name VALID_LITERAL_NAME
Specify name for valid file
--test-literal-name TEST_LITERAL_NAME
Specify name for test file
--include-original INCLUDE_ORIGINAL
Specify whether to include original edges
--old-id-column-name COLUMN_NAME
The name of the old ID column. (default=id).
--new-id-column-name COLUMN_NAME
The name of the new ID column. (default=id).
--overwrite-id [optional true|false]
When true, replace existing ID values. When false,
copy existing ID values. When --overwrite-id is
omitted, it defaults to False. When --overwrite-id is
supplied without an argument, it is True.
--verify-id-unique [optional true|false]
When true, verify ID uniqueness using an in-memory set
of IDs. When --verify-id-unique is omitted, it
defaults to False. When --verify-id-unique is supplied
without an argument, it is True.
--id-style {compact-prefix,empty,node1-label-node2,node1-label-num,node1-label-node2-num,node1-label-node2-id,prefix###,wikidata,wikidata-with-claim-id}
The ID generation style. (default=prefix###).
--id-prefix PREFIX The prefix for a prefix### ID. (default=E).
--initial-id INTEGER The initial numeric value for a prefix### ID.
(default=1).
--id-prefix-num-width INTEGER
The width of the numeric value for a prefix### ID.
(default=1).
--id-concat-num-width INTEGER
The width of the numeric value for a concatenated ID.
(default=4).
--value-hash-width VALUE_HASH_WIDTH
How many characters should be used in a value hash?
(default=6)
--claim-id-hash-width CLAIM_ID_HASH_WIDTH
How many characters should be used to hash the claim
ID? 0 means do not hash the claim ID. (default=8)
--claim-id-column-name CLAIM_ID_COLUMN_NAME
The name of the claim_id column. (default=claim_id)
--id-separator ID_SEPARATOR
The separator user between ID subfields. (default=-)

-v [optional True|False], --verbose [optional True|False]
Print additional progress messages (default=False).
```

## Examples


### Default (augment only, without prediction)

The following file will be used to illustrate some of the capabilities of `kgtk augment`.

```bash
head -5 examples/docs/augment-FB15K-sample.tsv
```

| node1 | label | node2 |
| -- | -- | -- |
|/m/06rf7 |<http://rdf.freebase.com/ns/location.geocode.longitude>| 9.70404945|
|/m/06rf7 |<http://rdf.freebase.com/ns/location.geocode.latitude>| 54.20867775|
|/m/04258w| <http://rdf.freebase.com/ns/people.person.date_of_birth>| 1912.66666667|
|/m/04258w| <http://rdf.freebase.com/ns/people.deceased_person.date_of_death>| 1997.83333333|


```bash
kgtk augment --dataset augment-FB15K-sample --output-path fb_augment
```

An example result file

```bash
head -6 fb_augment/augment_output_QOC_8/output.tsv
```

|node1 |label|node2 |
|---------------------------------------------|-----|---------|
|/m/06rf7| Interval-<http://rdf.freebase.com/ns/location.geocode.longitude>_right| Interval-<http://rdf.freebase.com/ns/location.geocode.longitude>(11.866667999999999_inf)
|/m/04p_hy| Interval-<http://rdf.freebase.com/ns/location.geocode.longitude>_right| Interval-<http://rdf.freebase.com/ns/location.geocode.longitude>(-0.00537_-3.603)|
|/m/0c1xm |Interval-<http://rdf.freebase.com/ns/location.geocode.longitude>_right| Interval-<http://rdf.freebase.com/ns/location.geocode.longitude>(-0.00537_-3.603)|
|/m/0c5x_ |Interval-<http://rdf.freebase.com/ns/location.geocode.longitude>_right |Interval-<http://rdf.freebase.com/ns/location.geocode.longitude>(-0.00537_-3.603)|
|/m/0j_1v |Interval-<http://rdf.freebase.com/ns/location.geocode.longitude>_right| Interval-<http://rdf.freebase.com/ns/location.geocode.longitude>(-7.999999_11.866667)|
### link / numeric prediction

1. To augment the dataset with link prediction, make sure the directory `data/{dataset}` contains at least four files:
1. `train.txt`: The training entity triples.
2. `valid.txt`: The validation entity triples.
3. `test.txt`: The testing entity triples.
4. `numerical_literals.txt`: The literal triples.
5. Once you get the above files, augment the graph with `kgtk augment --dataset {dataset} --bins {bins} --prediction-type lp`.
2. To augment the dataset with numericaprediction, make sure the directory `data/{dataset}` contains at least four files:
1. `train_kge`: The entity triples.
2. `train_100`: The training literal triples.
3. `dev`: The validation literal triples.
4. `test`: The test literal triples.
5. Once you get the above files, augment the graph with `kgtk augment --dataset {dataset} --bins {bins} --prediction-type np`.
Loading