Original Repo: https://github.com/INK-USC/NExT
Installation Instructions
sh create_data_dirs.sh
pip install -r rqs.txt
- If on Linux, you must download pytorch with CUDA 10.1 compatibility
- For more instructions, check here
python -m spacy download en_core_web_sm
- modify nltk source code: in
nltk/parse/chart.py
, line 680, modify functionparse
, changefor edge in self.select(start=0, end=self._num_leaves,lhs=root):
tofor edge in self.select(start=0, end=self._num_leaves):
- place TACRED's train.json, dev.json, test.json into
data
folder - run through
Prepare Tacred Data.ipynb
to prepare TACRED data cd training
python pre_train_find_module.py --build_data
(defaults achieve 90% f1)python train_next_classifier.py --build_data --experiment_name="insertAnyTextHere"
(defaults achieve 42.4% f1)- there are several available params you can set from the command line
- builds data for both strict and soft labeling, only uses strict data
- data pre-processing will take sometime, due to tuning of parser.
- Note: match_batch_size == batch_size in the bilstm case
For both step 8 and 9, subsequent trials on the same dataset don't need the "--build_data" flag; data that has already been computed does not need to be computed again, it is stored to disk.
Directory Descriptions:
CCG_new : everything to do with parsing of explanations and creation of strict and soft labeling functions
- main file: CCG_new/parser.py
models : model files
tests : test files, tests are a good place to understand a lot of the functions. train_util_functions.py doesn't have tests around it yet though. To run a test, run the following: pytest test_file_name.py
. Example: pytest ccg_util_test.py
. Some tests might not pass due to data not existing, read comments in those tests to understand how to build needed data.
training : all code to do with training of models