Extract and normalize variants from academic papers in xml, pdf, doc, docx, xlsx, csv formats.
- download CRF++.0.58.tar.gz
put CRF++.0.58.tar.gz invariant2literature/
- http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/ncbiRefSeq.txt.gz
- http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/ensGene.txt.gz
- http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/knownGene.txt.gz
- http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/kgAlias.txt.gz
- http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/ensemblToGeneName.txt.gz
- http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/snp150.txt.gz
- https://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz
- https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz
- download from ucsc and convert it to fasta format or download from GATK bundle and decompress it.
rename it toucsc.hg19.fasta
(if the filename is notucsc.hg19.fasta
).
- download tmVar 2.0
copytmVarJava/CRF/MentionExtractionUB.Model
tovariant2literature/models/
- download GNormPlus
copyGNormPlusJava/Dictionary/GNR.Model
tovariant2literature/models/
and
copyGNormPlusJava/Dictionary/PT_CTDGene.txt
tovariant2literature/models/
- Linux OS
- Docker 18.09.0 or higher
- CUDA 8.0 or higher
- nvidia-docker
- build docker image by
make build
- compile fasterRCNN by
make compile
- start docker container by
make run
- start mysql docker container by
make run-db
- load data into database by
make load-db
(run only once unless MYSQL_VOLUME is changed)
- put paper directories in input/
- run
make index
- query by
make query
ormake query OUTPUT_FILE=output.txt
- run
make truncate
- run
make rm
- run
make rm-db
apt-get install -y software-properties-common
add-apt-repository ppa:deadsnakes/ppa
apt-get update
apt-get install -y \
build-essential cmake \
python3.6-dev python3-pip python3-tk \
libpoppler-cpp-dev libmagic-dev libxrender-dev \
libsm6 libxext6 libglib2.0-0 \
libreoffice poppler-utils
ln -s /usr/bin/python3.6 /usr/local/bin/python
python -m pip install -U pip==18.1
pip install torch==0.4.1
# If you have CUDA 9.2, please use the following command to install pytorch instead
# pip install http://download.pytorch.org/whl/cu92/torch-0.4.1-cp36-cp36m-linux_x86_64.whl
pip install -r requirements.txt \
&& python -c "import nltk; nltk.download('punkt'); nltk.download('averaged_perceptron_tagger')"
cp CRF++-0.58.tar.gz /opt/
cd /opt && tar zxvf CRF++-0.58.tar.gz
cd /opt/CRF++-0.58 \
&& ./configure && make && make install && cd python \
&& cp /opt/CRF++-0.58/crfpp.h . \
&& python setup.py build && ldconfig \
&& python setup.py install
cd table_detector/lib && bash make.sh
apt-get install mariadb-server
service mysql start
mysql_secure_installation
# then enter default password `s8fjYJd92oP`
If you get en error like 1698, "Access denied for user 'root'@'localhost'"
, please set the root user to use the mysql_native_password plugin.
mysql> USE mysql;
mysql> UPDATE user SET plugin='mysql_native_password' WHERE User='root';
mysql> FLUSH PRIVILEGES;
mysql> exit;
then restart mysql
service mysql restart
ln -s ./ /app
export MYSQL_HOST=127.0.0.1
export MYSQL_PORT=3306
export MYSQL_ROOT_PASSWORD=s8fjYJd92oP
cd mysqldb
python models.py
export CUDA_VISIBLE_DEVICES=0
export NUM_TABLE_DETECTORS=1
export LOAD_BALANCER_HOST='localhost'
cd table_detector && python table_detector.py
Put paper directories in input/
, then execute
python main.py --n-process 1 --input input/
If your input files are plain text, or you're running on a device without GPU, please add --no-table-detect
to disable the table detector.
The results will be saved in mysql database, please use query.py
to query or use SQL command directly. For example:
mysql> USE gene;
mysql> SELECT * FROM var_pmid WHERE _id='<paper_directory_name>';
python query.py
This project is licensed under the GPLv3 License - see the LICENSE file for details.
The fasterRCNN implementation used here is written by Jianwei Yang and Jiasen Lu.