This is a C++ implementation of supervised latent Dirichlet allocation (sLDA) for classification.
Note that this code depends on the GNU Scientific Library.
git clone https://github.com/chbrown/slda
cd slda
make
You may need to install the gsl
first. E.g., on a Mac:
brew install gsl
Estimate the model by executing:
slda est <data path> <label path> <settings path> <alpha> <k> <initialization> <output directory>
<data path>
should point to a single file containing your training data.-
This should be a file where each line is of the form:
<M> <term_1>:<count> <term_2>:<count> ... <term_N>:<count>
-
where
<M>
is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document. (For an example, see test/images/train-data.dat.)
-
<label path>
points to a file of labels- Each line should consist of a single integer, starting with 0, up to C-1, if we have C classes.
- This file should have the same number of lines as the file specified by
<data path>
.
<settings path>
should point to a file with various settings, e.g., settings.txt<alpha>
is a floating point hyperparameter (a prior)<k>
is the number of topics<initialization>
specifies the initialization method. There are three options:- "seeded"
- "random"
<model path>
(a path to some pre-existing model)
<output directory>
should point to a directory where the estimator's output will be stored. This directory will be created if it does not already exist.-
The estimator outputs models in two types of files:
<iteration>.model
is the model saved in the binary format, which is easy and fast to use for inference.<iteration>.model.text
is the model saved in the text format, which is convenient for printing topics or further analysis using a scripting language.
-
It also produces variational posterior Dirichlets in a file called:
<iteration>.gamma
-
Running the estimator on the 8-class image dataset produces the output:
010.gamma 010.model 010.model.text 020.gamma 020.model 020.model.text final.gamma final.model final.model.text likelihood.dat word-assignments.dat
-
Example usage:
./slda est test/images/train-data.dat test/images/train-label.dat \
settings.txt 1.0 10 random tmp/
To perform inference on a different set of data (in the same format as for estimation), execute:
slda inf <data path> <label path> <settings path> <model path> <output directory>
<data path>
,<label path>
, and<settings path>
are all the same as in the estimation step.<model path>
is the binaryfinal.model
file from the estimation step.<output directory>
is the output directory, where the predicted labels will be stored.- Each output file has one line per input document.
inf-gamma.dat
describes the variational posterior Dirichletsinf-labels.dat
displays the predicted labelsinf-likelihood.dat
depicts each document's likelihood
- Each output file has one line per input document.
Example usage:
./slda inf test/images/test-data.dat test/images/test-label.dat \
settings.txt tmp/final.model tmp/
This will also produce a final line of output, evaluating against the labels
specified in the <label path>
argument:
average accuracy: 0.679
The sample data in test/images was downloaded from
http://www.cs.cmu.edu/~chongw/data/images.tgz
on July 12, 2013.
Description of data from original site:
A preprocessed 8-class image dataset from Labelme.
UIUC Sports annotation files: annotations and meta information. The source image files can be found here. (Note: there might be some discrepancies and I don't seem to know why...)
Copyright © 2009, Chong Wang, David Blei and Li Fei-Fei
Licensed under both the GPL v2 and GPL v3, as well as any future version of the GNU General Public License.