Infernal (INFERence in NAtural Language) is a model for performing natural language inference / recognizing textual entailment based on handcrafted features. It was implemented primarily for Portuguese, but most of it can be reused for other languages.
If you publish research using or expanding on Infernal, please cite:
- Erick Fonseca and Sandra M. Aluísio. Syntactic Knowledge for Natural Language Inference in Portuguese. In: Proceedings of the 2018 International Conference on the Computational Processing of Portuguese (PROPOR). 2018. (accepted for publication)
@inproceedings{infernal,
author = {Erick Fonseca and Sandra M. Alu\'isio},
title = {{Syntactic Knowledge for Natural Language Inference in Portuguese}},
year = {2018},
booktitle = {Proceedings of the 13th International Conference on the Computational Processing of Portuguese (PROPOR 2018)}
}
In order to run the full pre-processing pipeline, you will need (besides the
libraries in requirements.txt
, the following:
-
A working installation of Stanford CoreNLP, not necessarily in the same machine running Infernal. You will need trained models for parsing and POS tagging in Portuguese; these are available here.
-
The DELAF dictionary. By default, Infernal expects the dictionary file
Delaf2015v04.dic
to be in a directorydata
under the infernal root. -
The spaCy Portuguese model. Download it with
python -m spacy download pt
. -
OpenWordNet-PT. Copy the
own-pt.nt
(originally gzipped) file to thedata
directory under infernal root.
The config.py
file (under the directory infernal
) has some filenames and endpoints to be configured.
You should change file names to match how you saved the required files above.
Also, configure properly the CoreNLP URL and port. The path to POS tagger and dependency tagger are directories inside the CoreNLP root folder (again, it might be in the same machine or not).
Then, install the module (maybe with --user
if you're in a shared environment and not using virtualenv):
python setup.py install
First, it is a good idea to convert the original OpenWordNet-PT file from NT to a pre-processed pickle, which is much faster to read. NT is a generic data format, while the pickle has everything in the format used by Infernal.
python scripts/serialize-wordnet.py data/own-pt.nt data/own-pt.pickle
Next, take the raw pairs and tokenize, parse, find lemmas and named entities. Make sure the CoreNLP endpoint is running and run:
python scripts/preprocess.py pairs.xml pairs.pickle
You should repeat this process for training, validation and test data.
Now that you have nice parsed pairs, you can compute the features for them. After computing features (which take some time) and saving them, you can try different classifiers without repeating the feature extraction.
python scripts/extract-features.py pairs.pickle word-embeddings.npy features.npy [--load-label-dict DICT] [--save-label-dict DICT]
The label dict is a simple dictionary converting labels (entailment,
neutral, paraphrase) to number codes. The first time you run
extract-features.py
, save the label dictionary to your data directory. Then,
when you run it again for validation and test data, load the dict and the labels
will get the same codes.
The word embeddings should be a 2-d array saved in the numpy .npz
format, and
have shape (vocabulary_size, embedding_size)
. Additionally, a file in the same
directory in .txt
format must have the embedding vocabulary.
Finally, train a model:
python scripts/train-infernal.py features.npy log-regression model/ -s
Run python scripts/train-infernal.py
to see the available options. Since
choosing the right algorithm and tuning the model can be quite complex, maybe
you'll want to change something in the training code.
To evaluate model performance:
python scripts/evaluate.py features.npy model/
The features.npy
(or whatever you generated with extract-features.py
)
has both features and classes, and is faster to read than parsing an XML file.