Skip to content

Latest commit

 

History

History
50 lines (42 loc) · 2.56 KB

TopicModelingWithMALLET.md

File metadata and controls

50 lines (42 loc) · 2.56 KB

Topic Modeling with MALLET

These instructions were written to describe how to build a topic model from web archive data extracted by Warcbase from a Pig script like one of those listed here. These produce part-files consisting of multiple lines in the format

YYYYMM   domain.ext  Text of a single web page.

The datescrapes directory named below refers to files generated by the script break-into-date-scrapes.py, which takes a set of part-files and rearranges their contents so that all the archived pages from a given web scrape are contained in a single file, named for the scrape date (e.g., 200509.txt).

We discovered that treating all the web pages of a domain from a given date as a single document probably isn't a good idea for the purposes of topic-modeling. It's probably better to treat every web page as a separate document.

The first two columns (date and domain) in the part-files produced by the Pig scripts linked above should be removed before importing your data into MALLET. This can be accomplished with these shell commands:

# Go into document directory, create new dir
$ cd datescrapes
$ mkdir justtext
# Select fields three and on using `cut`
$ for i in *.txt; do cat $i | cut -f 3- > justtext/$i ; done

Import Data into MALLET

Change paths as appropriate.

./bin/mallet import-dir --input /cliphomes/jrwiebe/cpp.text-greenparty/datescrapes/ --output /cliphomes/jrwiebe/mallet-data/greenparty.mallet --keep-sequence --remove-stopwords

Building Topic Models

./bin/mallet train-topics  --input /cliphomes/jrwiebe/mallet-data/greenparty.mallet  --num-topics 20 --optimize-interval 20 --num-threads 16 --output-state /cliphomes/jrwiebe/mallet-data/greenparty-topic-state.gz  --output-topic-keys /cliphomes/jrwiebe/mallet-data/mallet-data/greenparty_keys.txt --output-doc-topics /cliphomes/jrwiebe/mallet-data/mallet-data/greenparty_composition.txt

Transform Topic Composition Document

MALLET's --output-doc-topics option produces a tab-separated file with columns for document number, source (document filename), and a series of topic number/proportion column-pairs in no logical order. The script doctopic-matrix.py produces a CSV file with columns docname and topic1, ..., topicN. This can then be more easily analyzed and visualized.