Topic Modeling with MALLET

These instructions were written to describe how to build a topic model from web archive data extracted by Warcbase from a Pig script like one of those listed here. These produce part-files consisting of multiple lines in the format

YYYYMM   domain.ext  Text of a single web page.

The datescrapes directory named below refers to files generated by the script break-into-date-scrapes.py, which takes a set of part-files and rearranges their contents so that all the archived pages from a given web scrape are contained in a single file, named for the scrape date (e.g., 200509.txt).

We discovered that treating all the web pages of a domain from a given date as a single document probably isn't a good idea for the purposes of topic-modeling. It's probably better to treat every web page as a separate document.

The first two columns (date and domain) in the part-files produced by the Pig scripts linked above should be removed before importing your data into MALLET. This can be accomplished with these shell commands:

# Go into document directory, create new dir
$ cd datescrapes
$ mkdir justtext
# Select fields three and on using `cut`
$ for i in *.txt; do cat $i | cut -f 3- > justtext/$i ; done

Import Data into MALLET

Change paths as appropriate.

./bin/mallet import-dir --input /cliphomes/jrwiebe/cpp.text-greenparty/datescrapes/ --output /cliphomes/jrwiebe/mallet-data/greenparty.mallet --keep-sequence --remove-stopwords

Building Topic Models

./bin/mallet train-topics  --input /cliphomes/jrwiebe/mallet-data/greenparty.mallet  --num-topics 20 --optimize-interval 20 --num-threads 16 --output-state /cliphomes/jrwiebe/mallet-data/greenparty-topic-state.gz  --output-topic-keys /cliphomes/jrwiebe/mallet-data/mallet-data/greenparty_keys.txt --output-doc-topics /cliphomes/jrwiebe/mallet-data/mallet-data/greenparty_composition.txt

Transform Topic Composition Document

MALLET's --output-doc-topics option produces a tab-separated file with columns for document number, source (document filename), and a series of topic number/proportion column-pairs in no logical order. The script doctopic-matrix.py produces a CSV file with columns docname and topic1, ..., topicN. This can then be more easily analyzed and visualized.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TopicModelingWithMALLET.md

TopicModelingWithMALLET.md

Topic Modeling with MALLET

Import Data into MALLET

Building Topic Models

Transform Topic Composition Document

Files

TopicModelingWithMALLET.md

Latest commit

History

TopicModelingWithMALLET.md

File metadata and controls

Topic Modeling with MALLET

Import Data into MALLET

Building Topic Models

Transform Topic Composition Document