Mini-project for the deep learning course based on A Structured Self-attentive Sentence Embedding by Lin et al.
The code has been adapted from the repo of Freda Shi.
To generate the dataset, you will need to install spacy and run:
python tokenizer-yelp.py --input [Yelp dataset] --output [output path, will be a json file] --dict [output dictionary path, will be a json file]
A small version of the tokenized dataset is available here.
In order to get the Glove vectors as PyTorch tensors, you can use torchtext, see here. For convenience, I did it for glove.6B.200d.txt.pt.
Now, provided you downloaded everything on Colab, the training can be done via:
python train.py data.train_data="/content/small/train_tok.json" data.val_data="/content/small/val_tok.json" data.test_data="/content/small/test_tok.json" data.dictionary="/content/small/dict_review_short.json" data.word_vector="content/glove.6B.200d.txt.pt" data.save="/content/self-attentive-sentence-embedding/models/model-small-6B.pt"