Skip to content
mimno edited this page Jul 30, 2018 · 2 revisions

Support for training word embeddings in Mallet is included in the current development release on Github. It is not available in the 2.0.8 release.

Importing data

Unlike word2vec or fasttext, Mallet separates data import from model training. Word embeddings can be trained from the same format data files as topic models. The main difference is that embeddings typically do not remove high-frequency words, as these can provide information about the syntactic function of words.

bin/mallet import-file --input history.txt --keep-sequence --output history.seq

Training embeddings

To train embeddings with default parameters and same vectors to a file called vectors.txt

bin/mallet run cc.mallet.topics.WordEmbeddings --input history.seq --output vectors.txt

You will first see a few descriptive statistics of the collection and then, as the algorithm proceeds, information about progress. The progress line prints about every five seconds, and shows

  • the number of word tokens processed
  • the number of milliseconds run so far
  • the ratio of these two values (tokens per second)
  • the average value of the vector elements, which are initialized to small values and get bigger as we train
  • a "step" value that roughly indicates how much we are changing the vectors, which decreases as the learning rate decreases and the quality of the vectors increases

Options

There are several options, most of which are inherited from other implementations. We don't always have good reasons for setting the values of these options, so I've tried to indicate which ones have meaningful effects.

  • input A Mallet token sequence file.

  • output A space-delimited text file with one line per word, with the string at the beginning and the vector element values afterwards.

  • output-context The word2vec algorithm actually trains two embeddings against each other, of which one is kept. These are functionally identical mirror images of each other. Unlike GloVe vectors, where the two sets of vectors are almost identical, averaging them will produce quite different vectors from either of the original vectors.

  • output-stats-prefix For compatibility with other embedding systems, you can optionally include a line at the top of the file that lists the number of words in the vocabulary and the number of elements in each vector.

  • num-dimensions This option controls the number of dimensions for the latent vectors. Since people usually don't actually look at the vectors they train, this value is usually set to a large-ish round number like 100 or 300. The default here is 50, mostly because that's often a good enough number and it's fast to train. With topic models this parameter is all anyone wants to talk about, but for these models no one seems to care.

  • window-size This option controls the width of the sliding context window around each word. The default is 5, which means look five tokens to the left and five tokens to the right of the current word. Closer words have more weight. Using a smaller value like 2 will focus the algorithm on the immediate context of each word, so vectors will tend to encode information about the syntactic function of words in their context. For example, a noun might be close to another noun that occurs with similar determiners and prepositions. Using a larger value like 10 will make less distinction between words that are near a word and words that are immediately adjacent to a word, so vectors will tend to encode more semantic information.

  • num-threads If you increase this to, say, 4, then the collection will be partitioned into four equal sections and each thread will start working on its own section. This can make things faster of course, but also improve the vectors by adding some randomness.

  • num-iters This counts how many times to sweep through the data. Using more iterations will cause the learning rate to decrease more slowly. Values between 3 and 5 seem good enough.

  • frequency-factor In natural language, frequent words ("the") occur exponentially more often than less frequent words ("word"). It appears to be useful to downsample the top words in order to focus the algorithm on more content-bearing words. Values between 0.001 and 0.00001 seem to be good.

  • num-samples The SGNS objective wants two words that occur in close proximity to have a word vector and a context vector that are close to each other, but random pairs of words to have vectors that are far away. This variable changes the relative strength of the attraction and repulsion forces. It has a direct effect on running time -- more samples, longer running. The default is 5, which seems to be a good number.

  • example-word To get a sense of how the algorithm is proceeding, you can specify a query word. Each time the algorithm reports progress it will print the ten words with the closest (in cosine similarity) vectors to your query word. For example if the query is london I get:

      1.000000	3365	london
      0.877789	6412	paris
      0.864460	8981	boston
      0.860364	9143	chicago
      0.858702	7044	philadelphia
      0.843321	3584	york
      0.840233	6377	jersey
      0.834708	37473	macmillan
      0.829747	15344	angeles
      0.821647	9142	hall
    

    In this case the closest word to the query is the query itself, with 1.0 cosine similarity. I'm including this because it's a good check -- if the query isn't the only word with 1.0 similarity, something is wrong. The others seem good: major cities with strong connections to London. Note that the query is case sensitive. If I ask for London instead of london it can't find the word and silently ignores it. Also be on the lookout for similarities that are "too high". Values of 0.98 or higher usually indicate something is wrong. Good query words tend to be well represented in the corpus and have several similar words.

  • ordering Embeddings can be fussy. Especially for words that don't occur very often, or occur a lot in a few specific parts of the collection, seemingly small changes to the input corpus can have big effects on vector similarity. Artificially adding some variability to the collection can help to surface some of this variability. (See Antoniak and Mimno, "On the stability...", NAACL 2018 for details.) With the default value linear the algorithm reads all the documents in the order they were originally presented, which tends to amplify the impact of early documents. The value shuffled selects a random order, so that early documents have less weight. The value random implements a "bootstrap" sample: documents are sampled with replacement from the original collection, so they may occur multiple times or not at all. Especially if your collection is smaller than about 10 million word tokens you should consider running about 20 bootstrap samples.