forked from rossgoodwin/neural-language-model
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.txt
64 lines (48 loc) · 2.5 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
Approach based upon language model in Bengio et al ICML 09 "Curriculum Learning".
You will need my common python library:
http://github.com/turian/common
and my textSNE wrapper for t-SNE:
http://github.com:turian/textSNE
You will need Murmur for hashing.
easy_install Murmur
To train a monolingual language model, probably you should run:
[edit hyperparameters.language-model.yaml]
./build-vocabulary.py
./train.py
To train word-to-word multilingual model, probably you should run:
cd scripts; ln -s hyperparameters.language-model.sample.yaml s hyperparameters.language-model.yaml
# Create validation data:
./preprocess-validation.pl > ~/data/SemEval-2-2010/Task\ 3\ -\ Cross-Lingual\ Word\ Sense\ Disambiguation/validation.txt Tokenizer v3
# [optional: Lemmatize]
Tadpole --skip=tmp -t ~/dev/python/mt-language-model/neural-language-model/data/filtered-full-bilingual/en-nl/filtered-training.nl | perl -ne 's/\t/ /g; print lc($_);' | chop 3 | from-one-line-per-word-to-one-line-per-sentence.py > ~/dev/python/mt-language-model/neural-language-model/data/filtered-full-bilingual-lemmas/en-nl/filtered-training-lemmas.nl
#
[TODO:
* Initialize using monolingual language model in source language.
* Loss = logistic, not margin.
]
# [optional: Run the following if your alignment for language pair l1-l2
# is in form l2-l1]
./scripts/preprocess/reverse-alignment.pl
./w2w/build-vocabulary.py
# Then see the output with ./w2w/dump-vocabulary.py, to see if you want
# to adjust the w2w minfreq hyperparameter
./w2w/build-target-vocabulary.py
# Then see the output with ./w2w/dump-target-vocabulary.py
./w2w/build-initial-embeddings.py
# [optional: Filter the corpora only to include sentences with certain
# focus words.]
# You want to make sure this happens AFTER
# ./w2w/build-initial-embeddings.py, so you have good embeddings for words
# that aren't as common in the filtered corpora.
./scripts/preprocess/filter-sentences-by-lemma.py
# You should then move the filtered corpora to a new data directory.]
#[optional: This will cache all the training examples onto disk. This will
# happen automatically during training anyhow.]
./scripts/w2w/build-example-cache.py
./w2w/train.py
TODO:
* sqrt scaling of SGD updates
* Use normalization of embeddings?
* How do we initialize embeddings?
* Use tanh, not softsign?
* When doing SGD on embeddings, use sqrt scaling of embedding size?