Skip to content

Tokenization, sentence segmentation, POS tagging and dependency parsing for biomedical texts (BMC Bioinformatics 2019)

Notifications You must be signed in to change notification settings

datquocnguyen/BioPosDep

Repository files navigation

Biomedical POS tagging and dependency parsing models

Biomedical POS tagging and dependency parsing models are trained on GENIA and CRAFT. See our following paper for more details:

@Article{NguyenK2019,
author="Nguyen, Dat Quoc and Verspoor, Karin",
title="From POS tagging to dependency parsing for biomedical event extraction",
journal="BMC Bioinformatics",
year="2019",
month="Feb",
day="12",
volume="20",
number="1",
pages="72",
doi="10.1186/s12859-019-2604-0",
url="https://doi.org/10.1186/s12859-019-2604-0"
}

Our models are free for non-commercial use and distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA) License.

pos dep

Usage

The first step is to perform POS tagging and dependency parsing using NLP4J models. Here, NLP4J would also perform TOKENIZATION and SENTENCE SEGMENTATION if input files are raw text corpora. Then, the output of NLP4J will be used as input for other dependency parsing models.

Perform biomedical POS tagging and dependency parsing using retrained NLP4J models

Installation

Download NLP4J models from https://github.com/datquocnguyen/BioPosDep/archive/master.zip (70MB) or clone these models using git:

$ git clone https://github.com/datquocnguyen/BioPosDep.git

To run the models, it is expected that Java is already set to run in command line or terminal.

Command line

# Using models trained on GENIA
BioPosDep/NLP4J$ bin/nlpdecode -c config-GENIA.xml -i <filepath> -format <string> [-ie <string> -oe <string>]

# Using models trained on CRAFT
BioPosDep/NLP4J$ bin/nlpdecode -c config-CRAFT.xml -i <filepath> -format <string> [-ie <string> -oe <string>]

-i       <filepath> : input path (required)
-format  <string>   : format of the input data (raw|line|tsv; default: raw)
-ie      <string>   : input file extension (default: *)
-oe      <string>   : output file extension (default: nlp)
  • -i specifies the input path pointing to either a file or a directory. When the path points to a file, only the specific file is processed. When the path points to a directory, all files with the file extension -ie under the specific directory are processed.
  • -format specifies the format of the input file: raw, line, or tsv
    • raw accepts texts in any format
    • line expects a sentence per line
    • tsv expects columns delimited by \t and sentences separated by \n
  • -ie specifies the input file extension. The default value * implies files with any extension. This option is used only when the input path -i points to a directory.
  • -oe specifies the output file extension appended to each input filename. The corresponding output file, consisting of the NLP output, will be generated.

Examples

# For a raw corpus input
BioPosDep/NLP4J$ bin/nlpdecode -c config-GENIA.xml -i ../data/raw.txt -format raw -oe genia
BioPosDep/NLP4J$ bin/nlpdecode -c config-CRAFT.xml -i ../data/raw.txt -format raw -oe craft

# For a sentence-segmented corpus input (without tokenization!)
BioPosDep/NLP4J$ bin/nlpdecode -c config-GENIA.xml -i ../data/sentence_segmented.txt -format line -oe genia
BioPosDep/NLP4J$ bin/nlpdecode -c config-CRAFT.xml -i ../data/sentence_segmented.txt -format line -oe craft

# For a "pre-processed" tokenized and sentence-segmented corpus
	# Convert into a column-based format
BioPosDep/NLP4J$ python ../get_ColumnFormat.py ../data/tokenized_sentence_segmented.txt
	# Apply models using "tsv". Here we expect word forms at the second column (i.e. column index of 1). 
	# Adjust <column index="1" field="form"/> in config-GENIA.xml and config-CRAFT.xml if users already have a column-formated corpus with a different index of the word form column.
BioPosDep/NLP4J$ bin/nlpdecode -c config-GENIA.xml -i ../data/tokenized_sentence_segmented.txt.column -format tsv -oe genia
BioPosDep/NLP4J$ bin/nlpdecode -c config-CRAFT.xml -i ../data/tokenized_sentence_segmented.txt.column -format tsv -oe craft

From the examples above, output files .genia and .craft are generated in folder data, containing POS and dependency annotations.

NOTE

Those NLP4J output files are in a 9-column format. To further apply other dependency parsing models, they must be converted to 10-column format:

# Command line
BioPosDep$ python convert_NLP4J_to_CoNLL.py <NLP4J_output_filepath>

# Examples
BioPosDep$ python convert_NLP4J_to_CoNLL.py data/raw.txt.genia
BioPosDep$ python convert_NLP4J_to_CoNLL.py data/raw.txt.craft
Two 10-column output files raw.txt.genia.conll and raw.txt.craft.conll are generated in folder data, which will be used as inputs for other models.

Using retrained Stanford Biaffine parsing models

Installation

# Install prerequisite packages  
BioPosDep/StanfordBiaffineParser-v2$ virtualenv .TF1_0
BioPosDep/StanfordBiaffineParser-v2$ source .TF1_0/bin/activate
BioPosDep/StanfordBiaffineParser-v2$ pip install tensorflow==1.0
BioPosDep/StanfordBiaffineParser-v2$ pip install numpy==1.11.0
BioPosDep/StanfordBiaffineParser-v2$ pip install scipy==1.0.0
BioPosDep/StanfordBiaffineParser-v2$ pip install matplotlib==2.1.2
BioPosDep/StanfordBiaffineParser-v2$ pip install backports.lzma
  • Download file Pre-trained-Biaffine-v2.zip from HERE.
  • Unzip the file, then copy/move folder models and file PubMed-shuffle-win2-500Kwords.txt into folder BioPosDep/StanfordBiaffineParser-v2.

Command line

# Using model trained on GENIA
BioPosDep/StanfordBiaffineParser-v2$ python main.py --save_dir models/GENIA parse <input_file_path>

# Using model trained on CRAFT
BioPosDep/StanfordBiaffineParser-v2$ python main.py --save_dir models/CRAFT parse <input_file_path>

# Output parsed files are by default saved in the model directory with the same name as the input file.
# NOTE: We can also specify the output directory with the --output_dir flag and/or the output file name with the --output_file flag.

Examples

# Activate TensorFlow 1.0 before running models:
BioPosDep/StanfordBiaffineParser-v2$ source .TF1_0/bin/activate
BioPosDep/StanfordBiaffineParser-v2$ python main.py --save_dir models/GENIA parse ../data/raw.txt.genia.conll
BioPosDep/StanfordBiaffineParser-v2$ python main.py --save_dir models/CRAFT parse ../data/raw.txt.craft.conll

Two output parsed files raw.txt.genia.conll and raw.txt.craft.conll are generated in folders models/GENIA and models/CRAFT, respectively.

Using retrained jPTDP models

See https://github.com/datquocnguyen/jPTDP for details.

About

Tokenization, sentence segmentation, POS tagging and dependency parsing for biomedical texts (BMC Bioinformatics 2019)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published