This repository contains an end-to-end NLP pipeline for enzyme function extraction that uses the EnzChemRED dataset to fine-tune NLP methods. The pipeline consists of four main steps:
- Literature triage
- Named entity recognition (NER)
- Named entity normalization (NEN)
- Relation extraction (RE)
Retrieves papers relevant to enzyme functions using LitSuggest.
Located in the literature_triage
folder.
Tags chemical and protein mentions in the text using AIONER after fine-tuning using EnzChemRED.
Located in the aioner
folder.
Links chemical and protein mentions to stable unique database identifiers using MTCR and UniProt matching.
- Chemical linking to ChEBI: Located in the
mtcr
folder - Protein linking to UniProt: Located in the
uniprot_norm
folder
Extracts information about enzymes and the chemical conversions they catalyze using a BioREx model fine-tuned for this purpose using EnzChemRED.
Located in the biorex
folder.
If you use EnzChemRED in your research, please cite:
- Lai, PT., Coudert, E., Aimo, L. et al. EnzChemRED, a rich enzyme chemistry relation extraction dataset. Sci Data 11, 982 (2024). https://doi.org/10.1038/s41597-024-03835-7
@article{lai2024enzchemred,
title = {EnzChemRED, a rich enzyme chemistry relation extraction dataset},
author = {Po-Ting Lai and Elisabeth Coudert and Lucila Aimo and Kristian Axelsen and Lionel Breuza and Edouard de Castro and Marc Feuermann and Anne Morgat and Lucille Pourcel and Ivo Pedruzzi and Sylvain Poux and Nicole Redaschi and Catherine Rivoire and Anastasia Sveshnikova and Chih-Hsuan Wei and Robert Leaman and Ling Luo and Zhiyong Lu and Alan Bridge},
journal = {Scientific Data},
volume = {11},
number = {1},
pages = {982},
year = {2024},
publisher = {Nature Publishing Group UK London},
doi = {10.1038/s41597-024-03835-7}
}