Please run utils/make_vocab.py
to parse the labels. Then, please run python hw4.py --train --clip
until epoch 53.
Then run python hw4.py --train --clip --load --forcing (0.6,0.6,10) --epoch 53 --load Model1_b32lr0.0005s100decay0Adamdrop0.4le3he256hd512emb256att128forcing(0.9,0.8,20)clip
until epoch 65,
and finally run python hw4.py --train --clip --load --forcing (0.5,0.5,10) --epoch 65 --load Model1_b32lr0.0005s100decay0Adamdrop0.4le3he256hd512emb256att128forcing(0.6,0.6,10)clip
until epoch 76.
All other hyperparameters are set to default in the scripts.
The best performance is achieved from the following architecture:
- The listen-attend-spell model (with key-value dot attention) (template provided by course bootcamp)
- Locked dropout is applied between each two layers in the encoder.
- Weight tying is applied in the decoder.
I used CrossEntropy (masked) and the Adam Optimizer.
I used greedy decoding.
The learning rate is 5e-4, no weight decay; batch size is 32. The teacher forcing rate is described above. No Gumble noise is used, nor is the beam search.
I tested different model architectures, some additional techniques introduced in HW4P1, e.g., drop connect,
and various hyperparameters. The models I tested can be found in models.py
. Furthermore, I found that removing all
packing, i.e., use padded sequence throughout the network, actually runs faster with my GPUs.