BacReader

For BioNLP-OST 2019 task Bacteria Biotope

Table of content

Overview
Pipeline
Data
- Dictionaries
- Free text and entities
Other resources
- Ab3P
- Word Vector Space Model
Dependencies
Run instructions
Results

Overview

The project aims to provide solutions for microorganism-related free-text entity nomralization /linking as suggested by the task.

To be more specific, pre-annotated microorganism-related entities, as well as different types of dictionaries were provided. The system links the pre-annotated entities to standard concepts in dictionaries with a 2-step method combining a perfect-match module and an ensemble shallow CNN module by converting entities to pre-trained word vectors.

NOTE: species names were normalized with ensemble editing distance method, which was not present here.

Pipeline

generate_table.py

Convert input to dataframe format for easier downstream processing. For training and dev datasets, labels were added.

Abbreviations were extracted with Ab3P.

generate_vsm.py

Convert entities to word vectors with pre-trained word vector space model for downstream system.

Different conversion rules were tested. Most of them were commented out except for the one used for ensemble method as described in the paper(generate_five_fold_dataset(prediction=True))

main.py

perfect match module

Link free-text entities to standard concepts by exact-matching rules.

The rules included:
- Hyphens were replaced with spaces.
- Characters except alphabetic letters and spaces were removed.
- Case-insensitive string matching was performed between the free text entities and standard entities.
- All types of 'xx cheese' was linked to the corresponding cheese category by string matching

CNN module

Input: 8x200, output: 1x139

 self.model.add(Conv1D(filters=arg, kernel_size=4, padding='same',
                       input_shape=(entity_embedding_size, vector_len)))
 self.model.add(MaxPooling1D(entity_embedding_size))
 self.model.add(Dense(139))
 self.model.compile(loss='cosine_proximity', optimizer=SGD())

Ensemble voting module
- Identify the most similar standard concepts via cosine similarity
- Return the standard concepts with the majority votes

Evaluation

The prediction result (OntoBiotope_result.tsv) was formatted as the given label files (.a2 file) with [online_test_output.py] (https://github.com/OXPHOS/BioNLP/blob/master/src/online_test_output.py).

The test suite could be found here.

Data

Dictionaries

Two types of entities were involved in the task: phenotype, which describes microbial characteristics, and habitat, which describes physical places where microorganisms could be observed. Dictionary with 3602 standard concepts was also provided by the task.

As can be seen, in the original dictionary, each concept is assigned to a unique ID, while its hierarchical information of its direct parents is also listed. In our model, the hierarchical information is omitted.

Free text and entities

The statistics of all available entities are listed below.

Datasets	Article#	Habitat_total	Phenotype_total	H+P_total	Habitat_de-duplicated	Phenotype_de-duplicated	H+P_de-duplicated
Train	133	1118	369	1487	627	176	803
Dev	66	610	161	771	348	97	445
Test	97	924	252	1176	596	148	744

.a1 files include different pieces of literature. Numbering, type of the row, start and end positions, as well as the corresponding text were presented.

.a2 files include the label to each annotated text in .a1 file. Entity type as well as the unique code in the corresponding standard dictionary was provided.

Other resources

Ab3P

Ab3P is an abbreviation detection tool developed specifically for biomedical concepts. It reached 96.5% precision and 83.2% recall on 1250 randomly selected MEDLINE records as suggested by Sohn et al (Sohn S, Comeau DC, Kim W, Wilbur WJ. (2008) Abbreviation definition identification based on automatic precision estimates. BMC Bioinformatics. 25;9:402. PubMed ID: 1881755).

Ab3P-detected abbreviations were provided as separate input files by the task organizers.

Word Vector Space Model

A set of word vectors induced on a combination of PubMed and PMC texts with texts extracted from a recent English Wikipedia dump The 4GB vectors can be downloaded here as wikipedia-pubmed-and-PMC-w2v.bin (Sampo Pyysalo, Filip Ginter, Hans Moen, Tapio Salakoski and Sophia Ananiadou. Distributional Semantics Resources for Biomedical Text Processing. LBM 2013.)

Dependencies

keras==2.2.4
scikit-learn==0.21.2
gensim==3.4.0
pandas==0.25.1
nummpy==1.16.5

The detailed description of the dependencies could be found here

Run instructions

Rename /sample_data to /input_data and change the corresponding paths if necessary.

Datasets could be downloaded at the Bacteria Biotope website

Then, run the scripts as the pipeline suggested.

Results

Team	Habitat	Phenotype
PADIA_BacReader	0.684	0.758
Challenge-provided baseline	0.559	0.581
AmritaCen_healthcare	0.522	0.646
BLARI_GMU	0.615	0.646
BOUN-ISIK	0.687	0.566

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
elements		elements
sample_data		sample_data
src		src
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BacReader

For BioNLP-OST 2019 task Bacteria Biotope

Table of content

Overview

Pipeline

generate_table.py

generate_vsm.py

main.py

Evaluation

Data

Dictionaries

Free text and entities

Other resources

Ab3P

Word Vector Space Model

Dependencies

Run instructions

Results

About

Releases

Packages

Languages

OXPHOS/BioNLP

Folders and files

Latest commit

History

Repository files navigation

BacReader

For BioNLP-OST 2019 task Bacteria Biotope

Table of content

Overview

Pipeline

generate_table.py

generate_vsm.py

main.py

Evaluation

Data

Dictionaries

Free text and entities

Other resources

Ab3P

Word Vector Space Model

Dependencies

Run instructions

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages