-
Notifications
You must be signed in to change notification settings - Fork 20
Dataset Submission Guideline
This wiki page provides a guideline for contributing a dataset. The guideline first introduces the GLI file structure required for each dataset, then provides a step-by-step instruction on dataset submission.
The GLI files consist of both files that the contributor creates manually and files that are automatically generated by calling GLI helper functions. The list of GLI files are illustrated in the figure below, with more detailed explanations available in the following sub-sections.
👉Click me to show the sub-sections about the files👈
The data conversion script is a piece of code where the contributor preprocesses the raw data and calls the GLI helper functions to convert the dataset into GLI format. Here are some examples of the GLI helper functions provided in gli.io
:
gli.io.save_graph()
- This function takes arguments such as the name of the dataset, the edges, and the node/edge/graph attributes as inputs, and automatically save the graph data into files in GLI format.
gli.io.save_task_node_classification()
- This function takes arguments such as the name of the dataset, the target node attribute as prediction label, and the train/valid/test splits as inputs, and automatically save the node classification task information into files in GLI format.
gli.io.save_task_time_dependent_link_prediction()
- This function takes arguments such as the name of the dataset, the edge attribute that stores the formation time, and the time cutoffs for train/valid/test splits as inputs, and automatically save the task information of link prediction using time cutoffs into files in GLI format.
And more ...
- As you probably have noticed, a remarkable feature of GLI is that we store the graph data and the task information separately.
- The
gli.io.save_graph()
function is able to accommodate a wide range of graphs include heterogeneous graphs and dynamic graphs. - We also have included individual helper functions to accommodate different graph learning tasks.
Please see cora.ipynb
as an example of the conversion script. The conversion script should be named as <dataset name>.ipynb
or <dataset name>.py
.
The conversion script, by calling the GLI helper functions, will generate two groups of files in GLI format: GLI Data Storage Files and GLI Task Configuration Files.
GLI Data Storage Files
- Any information in the dataset that is NOT specific to a task should be stored in these files. Examples include the graph structure and node/edge/graph attributes.
- This group of files consist of a
metadata.json
file that serves as an index file, and one or multiple.npz
files that actually store the graph data. - These files are generated by calling
gli.io.save_graph()
.
GLI Task Configuration Files
- Any information that is specific to a task should be stored in these files. Examples include the task type, data splits, information about which node attribute serves as the node label in a node classification, or number of classes.
- For each graph learning task defined on the dataset, this group of files consist of a
task_<task type>_<task id>.json
file that serves as an index file, and one or multiple.npz
files that store the task specific data. - These files are generated by calling
gli.io.save_task_<task type>()
.
In addition to the data conversion script, the contributors should also manually create a README.md
and a LICENSE
for the dataset.
README.md
- A document that contains the necessary information about the dataset and task(s), including description, citation(s), available task(s), and extra required packages for
<dataset name>.ipynb
or<dataset name>.py
. See this template for details.
LICENSE
- The license associated to the dataset (instead of the data conversion code).
Notes:
- Please find the link to the GLI cloud storage in Step 4 here.
- You can run
make pytest DATASET=<dataset name>
at the root of the repository to test your dataset implementation locally. - We also have a more detailed dataset submission workflow for your reference.