the Biomedical Entity Linking Benchmark (BELB) is a collection of datasets and knowledge bases to train and evaluate biomedical entity linking models.
If you use BELB in your work, please cite:
@article{10.1093/bioinformatics/btad698,
author = {Garda, Samuele and Weber-Genzel, Leon and Martin, Robert and Leser, Ulf},
title = {{BELB}: a {B}iomedical {E}ntity {L}inking {B}enchmark},
journal = {Bioinformatics},
pages = {btad698},
year = {2023},
month = {11},
issn = {1367-4811},
doi = {10.1093/bioinformatics/btad698},
url = {https://doi.org/10.1093/bioinformatics/btad698},
eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btad698/53483107/btad698.pdf},
}
Corpus | Entity | Public | Versioned | Website | Download |
---|---|---|---|---|---|
NCBI Gene | Gene | ✅ | ❌ | homepage | kb, history |
NCBI Taxonomy | Species | ✅ | ❌ | homepage | kb, history |
CTD Diseases (MEDIC) | Disease | ✅ | ❌ | homepage | kb |
CTD Chemicals | Chemical | ✅ | ❌ | homepage | kb |
dbSNP | Variant | ✅ | ✅ | homepage | kb,history |
Cellosaurus | Cell line | ✅ | ❌ | homepage | kb, history |
UMLS | General | ❌ | ✅ | homepage | - |
Corpus | Entity | Public | Website | Download |
---|---|---|---|---|
GNormPLus (improved BC2) | Gene | ✅ | homepage | link |
NLM-Gene | Gene | ✅ | homepage | link |
NCBI-Disease | Disease | ✅ | homepage | link |
BC5CDR | Disease, Chemical | ✅ | homepage | link |
NLM-Chem | Chemical | ✅ | homepage | link |
Linnaeus | Species | ✅ | homepage | link |
S800 | Species | ✅ | homepage | link |
BioID | Cell, Species, Gene | ✅ | homepage | link |
Osiris | Gene, Variant | ✅ | homepage | link |
Thomas2011 | Variant | ✅ | homepage | link |
tmVar (v3) | Gene, Species, Variant | ✅ | homepage | link |
MedMentions | UMLS | ✅ | homepage | link |
We assume that all data will be stored in a single directory.
This reduces flexibility, but due to the inter-connection of all data (corpora and KB) this is a trade-off to ease accessibility.
Download PubTator raw data (compressed:~19GB) and PMCID->PMID
mapping (compressed: ~155MB).
This is needed to add annotations to certain corpora and add the text to those which provide only annotations.
mkdir -p <PUBTATOR>
cd <PUBTATOR>
wget https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/bioconcepts2pubtatorcentral.offset.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz
python -m scripts.build_pubtator
--pubtator <PUBTATOR>/bioconcepts2pubtatorcentral.offset.gz
--pmicid_pmid <PUBTATOR>/PMC-ids.csv.gz
--output pubtator.db
--overwrite
All knowledge bases will be automatically downloaded for you, with two exceptions: dbSNP and UMLS.
As dbSNP is a large resource (>100GB) it is best to launch a separate process to fetch it.
Essentially it boils down to:
mkdir -p <DBSNP>
cd <DBSNP>
echo "Fetch dbSNP latest release..."
wget --continue "ftp://ftp.ncbi.nlm.nih.gov/snp/redesign/latest_release/JSON/refsnp-chr*.bz2"
wget --continue "ftp://ftp.ncbi.nlm.nih.gov/snp/redesign/latest_release/JSON/refsnp-unsupported.json.bz2"
wget --continue "ftp://ftp.ncbi.nlm.nih.gov/snp/redesign/latest_release/JSON/refsnp-withdrawn.json.bz2"
echo "Identify corrupted files: please delete and re-initiate download for all corrupted files..."
find . -name *.bz2 -exec bunzip2 --test {} \;
See here for more details.
See here for more details on how to request a license.
You need to download the 2017AA full version
as this is the one used by the corpus MedMentions.
In principle the parser should work with later versions too,
it expects as input a folder (usually cold META
)
where containing the files MRCONSO.RFF
and MRCUI.RFF
.
The 2017AA
is the last one that does not provide direct access to the UMLS raw data ("Metathesaurus Files").
To access the data w/o setting up a mysql database you can to the following:
unzip umls-2017AA-full.zip
cd 2017AA-full
# poorly disguised zip files...
unzip 2017aa-1-meta.nlm 2017aa-2-meta.nlm
cd 2017AA/META
gunzip MRCONSO.RRF.aa.gz MRCONSO.RRF.ab.gz MRCUI.RRF.gz
cat MRCONSO.RRF.aa MRCONSO.RRF.ab > MRCONSO.RFF
Once you have downloaded these two resources you can launch the script:
python -m belb.scripts.build_kbs --dir <BELB> --cores 20 --umls <path/to/umls/META> --dbsnp <path/to/dbsnp>
This will fetch all the other kbs data and convert them to a unified schema and store them as TSV files.
Each kb can be processed individually with its corresponding module, e.g.:
python -m belb.kbs.umls
--dir /belb/directory
--data_dir /path/to/umls/data
--db ./db.yaml
By default all kbs are stored as sqlite databases.
The db.yaml
can be edited to your liking if you wish to store the data into a database.
This feature is only paritally tested and it supports only postgres.
Once all kbs are ready you can create all benchmark corpora via:
python -m belb.scripts.build_corpora --dir <BELB> --pubtator <BELB>/pubtator/pubtator.db
Similarly to kbs, you can also create a single corpus:
python -m belb.corpora.ncbi_disease --dir /belb/directory --sentences
This will fetch the ncbi disease corpus, preprocess it, split text into sentences (--sentences
)
and store it into the belb directory.
Every resource (corpus, kb) is represented by a module which acts as a standalone script as well. This means you can programmatically access a resource:
from belb.kbs.kb import BelbKb
from belb.kbs.ncbi_gene import NcbiGeneKbConfig
from belb.corpora.nlm_gene import NlmGeneCorpusParser
For ease of access we provide a two classes to instantiate corpora and kbs respectively
simply by providing an identifying name
(a poor reproduction of what you see in the Auto*
classes in the transformers library).
from belb import AutoBelbCorpus, AutoBelbKb
from belb.resources import Corpora, Kbs
corpus = AutoBelbCorpus.from_name(directory="path_to_belb", name=Corpora.NCBI_DISEASE.name)
kb = AutoBelbKb.from_name(directory="path_to_belb", name=Kbs.CTD_DISEASES.name)
- BioRED - data
- CRAFT (v4.0)
- BC5-CHEMDNER-patents-GPRO
- AskAPatient - data
- TwADR-L - data
- COMETA - data: "COMETA is available by contacting the last author via e-mail or following the instructions on https://www.siphs.org/."
- ShARe
- 2019 n2c2/UMass Lowell shared task
- TAC2017ADR - data
Create snapshot regularly for ease of reproducibility. This would require contacting resources providers and verify that it is doable, i.e. redistribution issues may arise.