This repo provides the source code & data of our paper "DRAGON: Deep Bidirectional Language-Knowledge Graph Pretraining" (NeurIPS 2022).
DRAGON is a new foundation model (improvement of BERT) that is pre-trained jointly from text and knowledge graphs for improved language, knowledge and reasoning capabilities. Specifically, it was trained with two simultaneous self-supervised objectives, language modeling and link prediction, that encourage deep bidirectional reasoning over text and knowledge graphs.
DRAGON can be used as a drop-in replacement for BERT. It achieves better performance in various NLP tasks, and is particularly effective for knowledge and reasoning-intensive tasks such as multi-step reasoning and low-resource QA.
Run the following commands to create a conda environment:
conda create -y -n dragon python=3.8
conda activate dragon
pip install torch==1.10.1+cu113 torchvision -f https://download.pytorch.org/whl/cu113/torch_stable.html
pip install transformers==4.9.1 wandb nltk spacy==2.1.6
python -m spacy download en
pip install scispacy==0.3.0
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_sm-0.3.0.tar.gz
pip install torch-scatter==2.0.9 torch-sparse==0.6.12 torch-geometric==2.0.0 -f https://pytorch-geometric.com/whl/torch-1.10.1+cu113.html
You can download pretrained DRAGON models below. Place the downloaded model files under ./models
Model | Domain | Size | Pretraining Text | Pretraining Knowledge Graph | Download Link |
---|---|---|---|---|---|
DRAGON | General | 360M parameters | BookCorpus | ConceptNet | general_model |
DRAGON | Biomedicine | 360M parameters | PubMed | UMLS | biomed_model |
You can download all the preprocessed data from [here]. This includes the ConceptNet knowledge graph as well as CommonsenseQA, OpenBookQA and RiddleSense datasets. Specifically, run:
wget https://nlp.stanford.edu/projects/myasu/DRAGON/data_preprocessed.zip
unzip data_preprocessed.zip
mv data_preprocessed data
(Optional) If you would like to preprocess the raw data from scratch, you can download the raw data β ConceptNet Knowledge graph, CommonsenseQA, OpenBookQA β by:
./download_raw_data.sh
To preprocess the raw data, run:
CUDA_VISIBLE_DEVICES=0 python preprocess.py -p <num_processes> --run common csqa obqa
You can specify the GPU you want to use in the beginning of the command CUDA_VISIBLE_DEVICES=...
. The script will:
- Setup ConceptNet (e.g., extract English relations from ConceptNet, merge the original 42 relation types into 17 types)
- Convert the QA datasets into .jsonl files (e.g., stored in
data/csqa/statement/
) - Identify all mentioned concepts in the questions and answers
- Extract subgraphs for each q-a pair
You can download all the preprocessed data from [here]. This includes the UMLS biomedical knowledge graph and MedQA dataset.
(Optional) If you would like to preprocess MedQA from scratch, follow utils_biomed/preprocess_medqa.ipynb
and then run
CUDA_VISIBLE_DEVICES=0 python preprocess.py -p <num_processes> --run medqa
The resulting file structure should look like this:
.
βββ README.md
βββ models/
βββ general_model.pt
βββ biomed_model.pt
βββ data/
βββ cpnet/ (preprocessed ConceptNet KG)
βββ csqa/
βββ train_rand_split.jsonl
βββ dev_rand_split.jsonl
βββ test_rand_split_no_answers.jsonl
βββ statement/ (converted statements)
βββ grounded/ (grounded entities)
βββ graphs/ (extracted subgraphs)
βββ ...
βββ obqa/
βββ umls/ (preprocessed UMLS KG)
βββ medqa/
To train DRAGON on CommonsenseQA, OpenBookQA, RiddleSense, MedQA, run:
scripts/run_train__csqa.sh
scripts/run_train__obqa.sh
scripts/run_train__riddle.sh
scripts/run_train__medqa.sh
(Optional) If you would like to pretrain DRAGON (i.e. self-supervised pretraining), run
scripts/run_pretrain.sh
As a quick demo, this script uses sentences from CommonsenseQA as training data. If you wish to use a larger, general corpus like BookCorpus, follow Section 5 (Use your own dataset) to prepare the training data.
For CommonsenseQA, OpenBookQA, RiddleSense, MedQA, run:
scripts/run_eval__csqa.sh
scripts/run_eval__obqa.sh
scripts/run_eval__riddle.sh
scripts/run_eval__medqa.sh
You can download trained model checkpoints in the next section.
CommonsenseQA
Trained model | In-house Dev acc. | In-house Test acc. |
---|---|---|
DRAGON [link] | 0.7928 | 0.7615 |
OpenBookQA
Trained model | Dev acc. | Test acc. |
---|---|---|
DRAGON [link] | 0.7080 | 0.7280 |
RiddleSense
Trained model | In-house Dev acc. | In-house Test acc. |
---|---|---|
DRAGON [link] | 0.6869 | 0.7157 |
MedQA
Trained model | Dev acc. | Test acc. |
---|---|---|
BioLinkBERT + DRAGON [link] | 0.4308 | 0.4768 |
Note: The models were trained and tested with HuggingFace transformers==4.9.1.
- Convert your dataset to
{train,dev,test}.statement.jsonl
in .jsonl format (seedata/csqa/statement/train.statement.jsonl
) - Create a directory in
data/{yourdataset}/
to store the .jsonl files - Modify
preprocess.py
and perform subgraph extraction for your data - Modify
utils/parser_utils.py
to support your own dataset
If you find our work helpful, please cite the following:
@InProceedings{yasunaga2022dragon,
author = {Michihiro Yasunaga and Antoine Bosselut and Hongyu Ren and Xikun Zhang and Christopher D. Manning and Percy Liang and Jure Leskovec},
title = {Deep Bidirectional Language-Knowledge Graph Pretraining},
year = {2022},
booktitle = {Neural Information Processing Systems (NeurIPS)},
}
This repo is built upon the following works:
GreaseLM: Graph REASoning Enhanced Language Models for Question Answering
https://github.com/snap-stanford/GreaseLM
QA-GNN: Question Answering using Language Models and Knowledge Graphs
https://github.com/michiyasunaga/qagnn