-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create KGTK augment command #638
base: dev
Are you sure you want to change the base?
Changes from all commits
ecd2cd7
b6cc57c
5f0f65e
c4c3428
1039555
13a3fd2
243806c
27244cd
92cc3cc
50dc913
1901ef7
37f6670
e24897a
1e7bce1
81806aa
616169d
897b205
26b26c8
76d945f
f0ed0ab
608d2af
dc2c73a
8107acd
98dcdea
6521edd
729c6ed
0ac73db
fa33ea0
1a3307b
b954642
2765d1c
eb78b3b
291b688
2427413
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,177 @@ | ||
## Summary | ||
|
||
This command will augmented graph from a KGTK Edge file with numeric value in float (or date) on node2. This command will automatically detect date in wikidata format and transform it to float in year | ||
|
||
### Input File | ||
|
||
The input file should be a KGTK Edge file with the following columns or their aliases: | ||
|
||
- `node1`: the subject column (source node) | ||
- `label`: the predicate column (property name) | ||
- `node2`: the object column (target node) | ||
|
||
|
||
|
||
### The Output File | ||
|
||
The output file is an edge file for each mode that contains the following columns: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what extra edges will be added? Example? |
||
|
||
- `node1`: this column contains each node | ||
- `label`: this column contains only name of intervals | ||
- `node2`: this column contains range (bin) for numeric value | ||
|
||
|
||
## Usage | ||
``` | ||
usage: kgtk augment [-h] [-i INPUT_FILE] [-o OUTPUT_FILE] [--dataset DATASET] | ||
[--train-file-name TRAIN_FILE_NAME] | ||
[--numerical-literal-name NUM_LITERAL_NAME] | ||
[--valid-file-name VALID_FILE_NAME] | ||
[--test-file-name TEST_FILE_NAME] [--bins BINS] | ||
[--aug_mode AUG_MODE] [--prediction-type PREDICTION_TYPE] | ||
[--reverse REVERSE] [--output-path OUTPUT_PATH] | ||
[--train-literal-name TRAIN_LITERAL_NAME] | ||
[--entity-triple-name ENTITY_TRIPLE_NAME] | ||
[--valid-literal-name VALID_LITERAL_NAME] | ||
[--test-literal-name TEST_LITERAL_NAME] | ||
[--include-original INCLUDE_ORIGINAL] | ||
[--old-id-column-name COLUMN_NAME] | ||
[--new-id-column-name COLUMN_NAME] | ||
[--overwrite-id [optional true|false]] | ||
[--verify-id-unique [optional true|false]] | ||
[--id-style {compact-prefix,empty,node1-label-node2,node1-label-num,node1-label-node2-num,node1-label-node2-id,prefix###,wikidata,wikidata-with-claim-id}] | ||
[--id-prefix PREFIX] [--initial-id INTEGER] | ||
[--id-prefix-num-width INTEGER] | ||
[--id-concat-num-width INTEGER] | ||
[--value-hash-width VALUE_HASH_WIDTH] | ||
[--claim-id-hash-width CLAIM_ID_HASH_WIDTH] | ||
[--claim-id-column-name CLAIM_ID_COLUMN_NAME] | ||
[--id-separator ID_SEPARATOR] [-v [optional True|False]] | ||
|
||
Augment Graph File | ||
|
||
optional arguments: | ||
-h, --help show this help message and exit | ||
-i INPUT_FILE, --input-file INPUT_FILE | ||
The KGTK input file. (May be omitted or '-' for | ||
stdin.) | ||
-o OUTPUT_FILE, --output-file OUTPUT_FILE | ||
The KGTK output file. (May be omitted or '-' for | ||
stdout.) | ||
--dataset DATASET Specify the location of dataset. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what exactly is the location of dataset? |
||
--train-file-name TRAIN_FILE_NAME | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All these new parameters need to have longer descriptions |
||
Specify name for training file | ||
--numerical-literal-name NUM_LITERAL_NAME | ||
Specify name for numerical literal file | ||
--valid-file-name VALID_FILE_NAME | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ?? I will only add this comment here, add longer description help messages |
||
Specify name for valid file | ||
--test-file-name TEST_FILE_NAME | ||
Specify name for test file | ||
--bins BINS Specify number of bins to use | ||
--aug_mode AUG_MODE Specify name for test file, seperated by comma, or All | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what are the options? |
||
for using all modes | ||
--prediction-type PREDICTION_TYPE | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what does lp, np mean? any defaults? |
||
Specify prediction type to use (lp, np) | ||
--reverse REVERSE Specify whether to include reverse links | ||
--output-path OUTPUT_PATH | ||
Specify path to store output files | ||
--train-literal-name TRAIN_LITERAL_NAME | ||
Specify name for training file | ||
--entity-triple-name ENTITY_TRIPLE_NAME | ||
Specify name for entity triple file | ||
--valid-literal-name VALID_LITERAL_NAME | ||
Specify name for valid file | ||
--test-literal-name TEST_LITERAL_NAME | ||
Specify name for test file | ||
--include-original INCLUDE_ORIGINAL | ||
Specify whether to include original edges | ||
--old-id-column-name COLUMN_NAME | ||
The name of the old ID column. (default=id). | ||
--new-id-column-name COLUMN_NAME | ||
The name of the new ID column. (default=id). | ||
--overwrite-id [optional true|false] | ||
When true, replace existing ID values. When false, | ||
copy existing ID values. When --overwrite-id is | ||
omitted, it defaults to False. When --overwrite-id is | ||
supplied without an argument, it is True. | ||
--verify-id-unique [optional true|false] | ||
When true, verify ID uniqueness using an in-memory set | ||
of IDs. When --verify-id-unique is omitted, it | ||
defaults to False. When --verify-id-unique is supplied | ||
without an argument, it is True. | ||
--id-style {compact-prefix,empty,node1-label-node2,node1-label-num,node1-label-node2-num,node1-label-node2-id,prefix###,wikidata,wikidata-with-claim-id} | ||
The ID generation style. (default=prefix###). | ||
--id-prefix PREFIX The prefix for a prefix### ID. (default=E). | ||
--initial-id INTEGER The initial numeric value for a prefix### ID. | ||
(default=1). | ||
--id-prefix-num-width INTEGER | ||
The width of the numeric value for a prefix### ID. | ||
(default=1). | ||
--id-concat-num-width INTEGER | ||
The width of the numeric value for a concatenated ID. | ||
(default=4). | ||
--value-hash-width VALUE_HASH_WIDTH | ||
How many characters should be used in a value hash? | ||
(default=6) | ||
--claim-id-hash-width CLAIM_ID_HASH_WIDTH | ||
How many characters should be used to hash the claim | ||
ID? 0 means do not hash the claim ID. (default=8) | ||
--claim-id-column-name CLAIM_ID_COLUMN_NAME | ||
The name of the claim_id column. (default=claim_id) | ||
--id-separator ID_SEPARATOR | ||
The separator user between ID subfields. (default=-) | ||
|
||
-v [optional True|False], --verbose [optional True|False] | ||
Print additional progress messages (default=False). | ||
``` | ||
|
||
## Examples | ||
|
||
|
||
### Default (augment only, without prediction) | ||
|
||
The following file will be used to illustrate some of the capabilities of `kgtk augment`. | ||
|
||
```bash | ||
head -5 examples/docs/augment-FB15K-sample.tsv | ||
``` | ||
|
||
| node1 | label | node2 | | ||
| -- | -- | -- | | ||
|/m/06rf7 |<http://rdf.freebase.com/ns/location.geocode.longitude>| 9.70404945| | ||
|/m/06rf7 |<http://rdf.freebase.com/ns/location.geocode.latitude>| 54.20867775| | ||
|/m/04258w| <http://rdf.freebase.com/ns/people.person.date_of_birth>| 1912.66666667| | ||
|/m/04258w| <http://rdf.freebase.com/ns/people.deceased_person.date_of_death>| 1997.83333333| | ||
|
||
|
||
```bash | ||
kgtk augment --dataset augment-FB15K-sample --output-path fb_augment | ||
``` | ||
|
||
An example result file | ||
|
||
```bash | ||
head -6 fb_augment/augment_output_QOC_8/output.tsv | ||
``` | ||
|
||
|node1 |label|node2 | | ||
|---------------------------------------------|-----|---------| | ||
|/m/06rf7| Interval-<http://rdf.freebase.com/ns/location.geocode.longitude>_right| Interval-<http://rdf.freebase.com/ns/location.geocode.longitude>(11.866667999999999_inf) | ||
|/m/04p_hy| Interval-<http://rdf.freebase.com/ns/location.geocode.longitude>_right| Interval-<http://rdf.freebase.com/ns/location.geocode.longitude>(-0.00537_-3.603)| | ||
|/m/0c1xm |Interval-<http://rdf.freebase.com/ns/location.geocode.longitude>_right| Interval-<http://rdf.freebase.com/ns/location.geocode.longitude>(-0.00537_-3.603)| | ||
|/m/0c5x_ |Interval-<http://rdf.freebase.com/ns/location.geocode.longitude>_right |Interval-<http://rdf.freebase.com/ns/location.geocode.longitude>(-0.00537_-3.603)| | ||
|/m/0j_1v |Interval-<http://rdf.freebase.com/ns/location.geocode.longitude>_right| Interval-<http://rdf.freebase.com/ns/location.geocode.longitude>(-7.999999_11.866667)| | ||
### link / numeric prediction | ||
|
||
1. To augment the dataset with link prediction, make sure the directory `data/{dataset}` contains at least four files: | ||
1. `train.txt`: The training entity triples. | ||
2. `valid.txt`: The validation entity triples. | ||
3. `test.txt`: The testing entity triples. | ||
4. `numerical_literals.txt`: The literal triples. | ||
5. Once you get the above files, augment the graph with `kgtk augment --dataset {dataset} --bins {bins} --prediction-type lp`. | ||
2. To augment the dataset with numericaprediction, make sure the directory `data/{dataset}` contains at least four files: | ||
1. `train_kge`: The entity triples. | ||
2. `train_100`: The training literal triples. | ||
3. `dev`: The validation literal triples. | ||
4. `test`: The test literal triples. | ||
5. Once you get the above files, augment the graph with `kgtk augment --dataset {dataset} --bins {bins} --prediction-type np`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the grammar. Also I can't understand what this command will do from this description. Please update