This repo contains the scripts and models to search CORD-19 using S-BERT embeddings via nmslib or faiss.
Sentence embeddings are not perfect for searching (see this issue) however can provide insight into the data that basic search functionality cannot. There is still room to improve the retrieval of relevant documents.
We're not versed in the medical field, so any feedback or improvements we deeply encourage in the form of issues/PRs!
We've included pre-trained models and the FAISS index to start your own server with instructions below.
Finally we provide a front-end that can be used to search through the dataset and extract information via a UI. Instructions and installation for the front-end can be found here.
We currently are hosting the server on a gcp instance, if anyone can contribute for a more permanent hosting solution it would be appreciated.
We assume you have installed PyTorch and the necessary CUDA packages from here. We suggest using Conda to make installation easier.
# Install FAISS
conda install faiss-cpu -c pytorch # Other instructions can be found at https://github.com/facebookresearch/faiss/blob/master/INSTALL.md
git clone https://github.com/SeanNaren/CORD-19-ANN.git --recursive
cd CORD-19-ANN/
pip install -r requirements.txt
pip install .
We also provide a docker container:
docker pull seannaren/cord-19-ann
sudo docker run -it --net=host --ipc=host --entrypoint=/bin/bash --rm seannaren/cord-19-ann
We currently offer sentence models trained on BlueBERT (base uncased model) and BioBERT (base cased model) with the appropriate metadata/index. We currently serve S-BlueBERT however it is interchangeable.
Download the corresponding Model and Index file. We suggest using S-BioBERT and assume you have done so for the subsequent commands. They are interchangeable however.
Model | Index | Test MedNLI Accuracy | Test STS Benchmark Cosine Pearson |
---|---|---|---|
S-BioBERT Base Cased | BioBERT_faiss_PCAR128_SQ8 | 0.7482 | 0.7122 |
S-BlueBERT Base Uncased | BlueBERT_faiss_PCAR128_SQ8 | 0.7525 | 0.6923 |
S-Bert Base Cased | 0.5689 | 0.7265 |
wget https://github.com/SeanNaren/CORD-19-ANN/releases/download/V1.0/cord_19_dataset_formatted_2020_03_27.tar.gz
tar -xzvf cord_19_dataset_formatted_2020_03_27.tar.gz cord_19_dataset_formatted/
We assume you've chosen the s-biobert model, it should be straightforward to swap in any other pre-trained models offered in this repo by modifying the paths below.
We recommend using the server but we do offer a simple script to search given a text file of sentences:
echo "These RNA transcripts may be spliced to give rise to mRNAs encoding the envelope (Env) glycoproteins (Fig. 1a)" > sentences.txt
python search_index.py --index_path biobert_mli_faiss_PCAR128_SQ8 --index_type faiss --model_name_or_path s-biobert_base_cased_mli/ --dataset_path cord_19_dataset_formatted/ --input_path sentences.txt --output_path output.json
To start the server:
YOUR_IP=0.0.0.0
YOUR_PORT=1337
python index_server.py --index_path biobert_mli_faiss_PCAR128_SQ8 --index_type faiss --model_name_or_path s-biobert_base_cased_mli/ --dataset_path cord_19_dataset_formatted/ --address $YOUR_IP --port $YOUR_PORT --silent
To test the server:
curl --header "Content-Type: application/json" \
--request POST \
--data '["These RNA transcripts may be spliced to give rise to mRNAs encoding the envelope (Env) glycoproteins (Fig. 1a)"]' \
http://$YOUR_IP:$YOUR_PORT/query
The output from the index is a JSON object containing the top K hits from the index, an example of the API is given below:
[
{
"query": "These RNA transcripts may be spliced to give rise to mRNAs encoding the envelope (Env) glycoproteins (Fig. 1a)",
"hits": [
{
"title": "Title",
"authors": [
"..."
],
"abstract": [
"..."
],
"paragraph": "Paragraph that included the hit",
"sentence": "The semantically similar sentence",
"distance": 42,
}
]
}
]
The process requires a GPU enabled node such as a GCP n8 node with a nvidia-tesla-v100 to generate the embeddings, with at-least 20GB RAM.
Currently we tokenize at the sentence level using SciSpacy, however future work may look into using paragraph level tokenization.
mkdir datasets/
python download_data.py
python extract_sentences.py --num_workers 16
Using sentence-transformers we can fine-tune either model. BlueBERT offers only uncased models whereas BioBERT offer a cased model. We've converted them into PyTorch format and included them in releases, to download:
wget https://github.com/SeanNaren/CORD-19-ANN/releases/download/V1.0/s-biobert_base_cased_mli.tar.gz
wget https://github.com/SeanNaren/CORD-19-ANN/releases/download/V1.0/s-bluebert_base_uncased_mli.tar.gz
tar -xzvf s-biobert_base_cased_mli.tar.gz
tar -xzvf s-bluebert_base_uncased_mli.tar.gz
python generate_embeddings.py --model_name_or_path s-biobert_base_cased_mli/ --embedding_path biobert_embeddings.npy --device cuda --batch_size 256 # If you want to use biobert
python generate_embeddings.py --model_name_or_path s-bluebert_base_uncased_mli/ --embedding_path bluebert_embeddings.npy --device cuda --batch_size 256 # If you want to use bluebert
You can also use the standard pre-trained model from the S-BERT repo like below, however we suggest using the fine-tuned models offered in this repo.
python generate_embeddings.py --model_name_or_path bert-base-nli-mean-tokens --embedding_path pretrained_embeddings.npy --device cuda --batch_size 256
This takes a few hours on a V100 GPU.
If you'd like to include the MedNLI dataset during training, you'll need to download the dataset from here. Getting access requires credentialed access which requires some efforts and a waiting period of up to two weeks.
Once trained the model is saved to the output/
folder by default. Inside there you'll find checkpoints such as output/training_nli/biobert-2020-03-30_10-51-49/
after training has finished. Use this as the model path when generating your embeddings.
wget https://github.com/SeanNaren/CORD-19-ANN/releases/download/V1.0/biobert_cased_v1.1.tar.gz
wget https://github.com/SeanNaren/CORD-19-ANN/releases/download/V1.0/bluebert_base_uncased.tar.gz
tar -xzvf biobert_cased_v1.1.tar.gz
tar -xzvf bluebert_base_uncased.tar.gz
mkdkir datasets/
python sentence-transformers/examples/datasets/get_data.py --output_path datasets/
python sentence-transformers/examples/training_nli_transformers.py --model_name_or_path biobert_cased_v1.1/
python sentence-transformers/examples/training_nli_transformers.py --model_name_or_path bluebert_base_uncased/ --do_lower_case
# Training with medNLI
python sentence-transformers/examples/training_nli_transformers.py --model_name_or_path biobert_cased_v1.1/ --mli_dataset_path path/to/mednli/
python sentence-transformers/examples/training_nli_transformers.py --model_name_or_path bluebert_base_uncased/ --mli_dataset_path path/to/mednli/ --do_lower_case
To exclude the MedNLI but still evaluate on the data (still requires the MedNLI dataset), use the --exclude_mli
.
We have the ability to use faiss or nmslib given the parameter below. We've exposed the FAISS config string for modifying the index. More details about selecting the index can be seen here.
python create_index.py --output_path index --embedding_path pretrained_embeddings.npy --index_type faiss # Swap to scibert_embeddings.npy if using fine-tuned SciBERT embeddings
We also took the example clustering script out of sentence-transformers and added it to this repository for using the pre-trained models. An example below:
python cluster_sentences.py --input_path sentences.txt --model_name_or_path biobert_cased_v1.1/ --device cpu
There is also a more interactive version available using the Google Colab demo:
Thanks to the authors of the various libraries that made this possible!