Bachelor's thesis available via this link.
The models here are trained to detect uncertain words and sentences in text and the class of uncertainty:
- Dynamic. Indicates necessity, dispositions, external circumstances, wishes, intents, plans, and desires. Example: I have to go.
- Doxastic. Expresses the speaker’s beliefs. Example: He believes that the Earth is flat.
- Investigation. Propositions, for which the truth value cannot be stated until further analysis is done. Example: We examined the role of NF-kappa B in protein activation.
- Conditional. Used for conditionals. Example: If it rains, we’ll stay in.
- Epistemic. Uncertainty, for which it is known that the proposition is neither true nor false. Example: It may be raining.
The dataset used here is the re-annotated CoNLL-2010 shared task dataset for uncertainty detection (Szeged Uncertainty Corpus). It's available for download here. The dataset consists of two main parts:
- Wikipedia (WikiWeasel).
- Biological (BioScope).
The goal of this project was to compare the performance of two different BERT training procedures:
- Train a domain-specific model: SciBERT and BioBERT on Biological part of the dataset.
- Train the general-domain BERT on the Wikipedia part and perform transfer-learning.
20 models with different seeds are trained and F1-score is compared with statistical tests.
I showed that it's not possible to conclude on this dataset which approach yields better results. But most importantly, SciBERT almost always outperforms BioBERT and doesn't benefit as much from additional Wikipedia data. If you decide to train a domain-specific language model, train it from a random initialization with a domain-specific dictionary rather that start with BERT initialization.
SciBERT model for uncertainty detection on biological texts is available on my Google Drive.
The demo allows to experiment with the model and annotate an arbitrary text for uncertainty.
Instructions to run the demo:
- Clone the repository:
git clone https://github.com/PeterZhizhin/BERTUncertaintyDetection
- Create a virtualenv and install dependencies:
python -m venv .env
source .env/bin/activate
pip install -U spacy
python -m spacy download en_core_web_sm
pip install aiohttp jinja2 aiohttp-jinja2 transformers torch torchvision
- Go to the demo folder:
cd demo
- Download the model extract the archive remember the path to the model folder.
- Run the server:
python demo_server.py --model_path [PATH TO FOLDER WITH THE MODEL] --labels_path ../labels.txt
All the model training was done on a Slurm cluster of National Research University Higher School of Economics. So, all the scripts for training require a Slurm cluster with GPUs by default. If you wish to train models without a Slurm cluster, you may change the training scripts.
- Clone the repo:
git clone https://github.com/PeterZhizhin/BERTUncertaintyDetection
- Install dependencies:
python -m venv .env
source .env/bin/activate
pip install -U spacy
python -m spacy download en_core_web_sm
pip install aiohttp jinja2 aiohttp-jinja2 transformers torch torchvision lxml
- Download the dataset, extract it and place to the
uncertainty_dataset
folder. - Make all shell scripts executable:
chmod +x *.sh
chmod +x huggingface_models/*.sh
- Create the datasets for training and evaluation:
./create_biomedical_ner_dataset_train_test.sh
./create_biomedical_classification_dataset_train_test.sh
./create_wiki_classification_dataset_train_test.sh
./create_wiki_ner_dataset_train_test.sh
- Train all models:
cd huggingface_models
sbatch --wait ./train_all_models_on_wiki_and_bio_slurm.sh
sbatch --wait ./transfer_all_models_on_wiki_and_bio_slurm.sh
sbatch --wait ./train_all_classification_models_on_wiki_and_bio_slurm.sh
sbatch --wait ./transfer_all_classification_models_on_wiki_and_bio_slurm.sh
- All models are now available in
ner_experiments
andclassification_experiment
.