This is the official repository implementing BioBLP, presented in "BioBLP: A Modular Framework for Learning on Multimodal Biomedical Knowledge Graphs", published in the Journal of Biomedical Semantics (link).
BioBLP is a framework that allows encoding a diverse set of multimodal data that can appear in biomedical knowledge graphs. It is based on the idea of learning embeddings for each modality separately, and then combining them into a single multimodal embedding space. The framework is modular, and allows for easy integration of new modalities.
To cite our work, please use the following:
@article{bioblp,
author = {Daniel Daza and
Dimitrios Alivanistos and
Payal Mitra and
Thom Pijnenburg and
Michael Cochez and
Paul Groth},
title = {BioBLP: a modular framework for learning on multimodal biomedical
knowledge graphs},
journal = {J. Biomed. Semant.},
volume = {14},
number = {1},
pages = {20},
year = {2023},
url = {https://doi.org/10.1186/s13326-023-00301-y},
doi = {10.1186/S13326-023-00301-Y},
}
We recommend using Anaconda to manage the dependencies. The following command will create and activate a new conda environment with all the required dependencies.
conda create -f environment.yml && conda activate bioblp
The data can be downloaded from here as a tar.gz file. This corresponds to our version of BioKG that has been decoupled from the benchmarks (see the paper for more details), and it also includes the necessary attribute data for proteins, molecules, and diseases.
The file should be placed inside the data
folder and decompressed:
tar xzf biokgb.tar.gz
Use the bioblp.train
module to train a link prediction model. For example, to train a BioBLP-D model (which encodes disease descriptions) using the RotatE scoring function, use:
python -m bioblp.train \
--train_triples=data/biokgb/graph/biokg.links-train.csv \
--valid_triples=data/biokgb/graph/biokg.links-valid.csv \
--test_triples=data/biokgb/graph/biokg.links-test.csv \
--text_data=data/biokgb/properties/biokg_meshid_to_descr_name.tsv \
--model=rotate --dimension=256 --loss_fn=crossentropy --optimizer=adam \
--learning_rate=2e-5 --warmup_fraction=0.05 --num_epochs=100 \
--batch_size=1024 --eval_batch_size=64 --num_negatives=512 --in_batch_negatives=True
The above command on a NVIDIA A100 40G GPU takes about 9 hours to train.
We use Weights and Biases to log the experiments, which is disabled by default. To enable it, add --log_wandb=True
to the command above.
More examples will be added soon.
- Pre-generate the input dataset with flags indicating if they are known or novel links.
- Run
bioblp.benchmarking.preprocess.py
to prepare BM dataset for ML by shuffling, splits, etc. bioblp.benchmarking.featurize.py
can be used to featurize a list of pair wise entities into vectors composed from individual vector entities.
Custom usage:
$ python -m bioblp.benchmarking.featurize -i data/benchmarks/processed/dpi_benchmark_p2n-1-10.tsv -o data/features -t kgem -f models/1baon0eg/ -j concatenate