variant2literature (v2l)

Extract and normalize variants from academic papers in xml, pdf, doc, docx, xlsx, csv formats.

Required Data and Packages

CRF++:

download CRF++.0.58.tar.gz
put CRF++.0.58.tar.gz in variant2literature/

download following files and put them in `variant2literature/models/`

FasterRCNN model:

https://www.dropbox.com/s/g980k8hpqj1q8cn/faster_rcnn.pth

UCSC tables (hg19):

NCBI gene_info

ucsc.hg19.fasta

download from ucsc and convert it to fasta format or download from GATK bundle and decompress it.
rename it to ucsc.hg19.fasta (if the filename is not ucsc.hg19.fasta).

tmVar

download tmVar 2.0
copy tmVarJava/CRF/MentionExtractionUB.Model to variant2literature/models/

GNormPlus

download GNormPlus
copy GNormPlusJava/Dictionary/GNR.Model to variant2literature/models/ and
copy GNormPlusJava/Dictionary/PT_CTDGene.txt to variant2literature/models/

Usage

Run variant2literature in Docker

Prerequisites

Linux OS
Docker 18.09.0 or higher
CUDA 8.0 or higher
nvidia-docker

Setup

build docker image by make build
compile fasterRCNN by make compile
start docker container by make run
start mysql docker container by make run-db
load data into database by make load-db (run only once unless MYSQL_VOLUME is changed)

Index Papers

put paper directories in input/
run make index
query by make query or make query OUTPUT_FILE=output.txt

Delete Indexes

run make truncate

Stop and Remove Docker Container

run make rm
run make rm-db

Directly run in Linux

Setup

apt-get install -y software-properties-common
add-apt-repository ppa:deadsnakes/ppa
apt-get update
apt-get install -y \
        build-essential cmake \
        python3.6-dev python3-pip python3-tk \
        libpoppler-cpp-dev libmagic-dev libxrender-dev \
        libsm6 libxext6 libglib2.0-0 \
        libreoffice poppler-utils

Install python and required package

ln -s /usr/bin/python3.6 /usr/local/bin/python
python -m pip install -U pip==18.1
pip install torch==0.4.1
# If you have CUDA 9.2, please use the following command to install pytorch instead
# pip install http://download.pytorch.org/whl/cu92/torch-0.4.1-cp36-cp36m-linux_x86_64.whl

pip install -r requirements.txt \
    && python -c "import nltk; nltk.download('punkt'); nltk.download('averaged_perceptron_tagger')"

Initialize CRF++

cp CRF++-0.58.tar.gz /opt/
cd /opt && tar zxvf CRF++-0.58.tar.gz
cd /opt/CRF++-0.58 \
    && ./configure && make && make install && cd python \
    && cp /opt/CRF++-0.58/crfpp.h . \
    && python setup.py build && ldconfig \
    && python setup.py install

Initialize table detector

cd table_detector/lib && bash make.sh

Install mysql

apt-get install mariadb-server
service mysql start

Change mysql root password

mysql_secure_installation
# then enter default password `s8fjYJd92oP`

If you get en error like 1698, "Access denied for user 'root'@'localhost'", please set the root user to use the mysql_native_password plugin.

mysql> USE mysql;
mysql> UPDATE user SET plugin='mysql_native_password' WHERE User='root';
mysql> FLUSH PRIVILEGES;
mysql> exit;

then restart mysql

service mysql restart

Load mysql data

ln -s ./ /app

export MYSQL_HOST=127.0.0.1
export MYSQL_PORT=3306
export MYSQL_ROOT_PASSWORD=s8fjYJd92oP

cd mysqldb
python models.py

Run table detector

export CUDA_VISIBLE_DEVICES=0
export NUM_TABLE_DETECTORS=1
export LOAD_BALANCER_HOST='localhost'

cd table_detector && python table_detector.py

Index papers

Put paper directories in input/, then execute

python main.py --n-process 1 --input input/

If your input files are plain text, or you're running on a device without GPU, please add --no-table-detect to disable the table detector.
The results will be saved in mysql database, please use query.py to query or use SQL command directly. For example:

mysql> USE gene;
mysql> SELECT * FROM var_pmid WHERE _id='<paper_directory_name>';

Query

python query.py

License

This project is licensed under the GPLv3 License - see the LICENSE file for details.

Acknowledgments

The fasterRCNN implementation used here is written by Jianwei Yang and Jiasen Lu.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assign_gene		assign_gene
gene_ner		gene_ner
input		input
models		models
mysqldb		mysqldb
normalize_var		normalize_var
parse_data		parse_data
table_detector		table_detector
var_ner		var_ner
var_utils		var_utils
.dockerignore		.dockerignore
.gitignore		.gitignore
.pylintrc		.pylintrc
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main.py		main.py
query.py		query.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

variant2literature (v2l)

Required Data and Packages

CRF++:

download following files and put them in `variant2literature/models/`

FasterRCNN model:

UCSC tables (hg19):

NCBI gene_info

ucsc.hg19.fasta

tmVar

GNormPlus

Usage

Run variant2literature in Docker

Prerequisites

Setup

Index Papers

Delete Indexes

Stop and Remove Docker Container

Directly run in Linux

Setup

Install python and required package

Initialize CRF++

Initialize table detector

Install mysql

Change mysql root password

Load mysql data

Run table detector

Index papers

Query

License

Acknowledgments

About

Releases

Packages

Contributors 3

Languages

License

ailabstw/variant2literature

Folders and files

Latest commit

History

Repository files navigation

variant2literature (v2l)

Required Data and Packages

CRF++:

download following files and put them in variant2literature/models/

FasterRCNN model:

UCSC tables (hg19):

NCBI gene_info

ucsc.hg19.fasta

tmVar

GNormPlus

Usage

Run variant2literature in Docker

Prerequisites

Setup

Index Papers

Delete Indexes

Stop and Remove Docker Container

Directly run in Linux

Setup

Install python and required package

Initialize CRF++

Initialize table detector

Install mysql

Change mysql root password

Load mysql data

Run table detector

Index papers

Query

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

download following files and put them in `variant2literature/models/`

Packages