Skip to content

dataflowr/Project-self-attentive-sentence-embedding

Repository files navigation

A Structured Self-attentive Sentence Embedding

Mini-project for the deep learning course based on A Structured Self-attentive Sentence Embedding by Lin et al.

The code has been adapted from the repo of Freda Shi.

Preprocessing

To generate the dataset, you will need to install spacy and run:

python tokenizer-yelp.py --input [Yelp dataset] --output [output path, will be a json file] --dict [output dictionary path, will be a json file]

A small version of the tokenized dataset is available here.

In order to get the Glove vectors as PyTorch tensors, you can use torchtext, see here. For convenience, I did it for glove.6B.200d.txt.pt.

Running on Colab

Now, provided you downloaded everything on Colab, the training can be done via:

python train.py data.train_data="/content/small/train_tok.json" data.val_data="/content/small/val_tok.json" data.test_data="/content/small/test_tok.json" data.dictionary="/content/small/dict_review_short.json" data.word_vector="content/glove.6B.200d.txt.pt" data.save="/content/self-attentive-sentence-embedding/models/model-small-6B.pt"