Skip to content

Latest commit

 

History

History
62 lines (32 loc) · 2.53 KB

README.md

File metadata and controls

62 lines (32 loc) · 2.53 KB

BioNLP-Corpus

Repository used to collect biomedical corpus on the Internet!

BC2GM

https://github.com/spyysalo/bc2gm-corpus

Provides a corpus of scientific texts, used for BioCreative, a competition in which participants are given well defined text-mining or information extraction tasks in the biological domain. BC2GM-corpus consists mainly of the training and testing corpora from BioCreative I and the testing corpus for the current task consists of an additional 5,000 sentences that were held 'in reserve'.

BC4CHEMD

https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/track-3-cdr/

https://github.com/cambridgeltl/MTL-Bioinformatics-2016/tree/master/data/BC4CHEMD-IOBES

https://github.com/cambridgeltl/MTL-Bioinformatics-2016

The 2015 CDR challenge is now successfully completed! Please find the overview paper below:

Wei CH, Peng Y, Leaman R, Davis AP, Mattingly CJ, Li J, Wiegers TC, and Lu Z. Overview of the BioCreative V Chemical Disease Relation (CDR) Task. Proceedings of the Fifth BioCreative Challenge Evaluation Workshop, pp. 154-166

BC5CDR-chem, BC5CDR-disease

https://github.com/cambridgeltl/MTL-Bioinformatics-2016/tree/master/data/BC5CDR-chem-IOB

https://github.com/wonjininfo/CollaboNet

The resulting BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions.

GENIA

https://github.com/spyysalo/genia-pos

s800

https://github.com/spyysalo/s800

SPECIES: a standalone command line application capable of identifying taxonomic mentions in documents and mapping them to corresponding NCBI Taxonomy database entries.

Given a folder with plain text files, SPECIES based on its taxonomic name and synonym dictionary reports the taxonomic mentions (start, end position in each document), the detected term and the corresponding NCBI Taxonomy database record identifier.

Besides binomials following the Linnaean naming convention, recognised taxonomic mentions include acronyms, common names and abbreviations, as well as misspellings and the rest of the naming types supported by the NCBI Taxonomy.

Revised JNLPBA

https://arxiv.org/abs/1901.10219

https://github.com/spyysalo/jnlpba

NCBI-disease

https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/

The NCBI disease corpus is fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community.

linnaeus

https://github.com/spyysalo/linnaeus-corpus

https://github.com/wonjininfo/CollaboNet