These instructions were written to describe how to build a topic model from web archive data extracted by Warcbase from a Pig script like one of those listed here. These produce part-files consisting of multiple lines in the format
YYYYMM domain.ext Text of a single web page.
The datescrapes
directory named below refers to files generated by the script
break-into-date-scrapes.py,
which takes a set of part-files and rearranges their contents so that all
the archived pages from a given web scrape are contained in a single file,
named for the scrape date (e.g., 200509.txt
).
We discovered that treating all the web pages of a domain from a given date as a single document probably isn't a good idea for the purposes of topic-modeling. It's probably better to treat every web page as a separate document.
The first two columns (date and domain) in the part-files produced by the Pig scripts linked above should be removed before importing your data into MALLET. This can be accomplished with these shell commands:
# Go into document directory, create new dir
$ cd datescrapes
$ mkdir justtext
# Select fields three and on using `cut`
$ for i in *.txt; do cat $i | cut -f 3- > justtext/$i ; done
Change paths as appropriate.
./bin/mallet import-dir --input /cliphomes/jrwiebe/cpp.text-greenparty/datescrapes/ --output /cliphomes/jrwiebe/mallet-data/greenparty.mallet --keep-sequence --remove-stopwords
./bin/mallet train-topics --input /cliphomes/jrwiebe/mallet-data/greenparty.mallet --num-topics 20 --optimize-interval 20 --num-threads 16 --output-state /cliphomes/jrwiebe/mallet-data/greenparty-topic-state.gz --output-topic-keys /cliphomes/jrwiebe/mallet-data/mallet-data/greenparty_keys.txt --output-doc-topics /cliphomes/jrwiebe/mallet-data/mallet-data/greenparty_composition.txt
MALLET's --output-doc-topics
option produces a tab-separated file with
columns for document number, source (document filename), and a series
of topic number/proportion column-pairs in no logical order. The script
doctopic-matrix.py
produces a CSV file with columns docname and topic1, ..., topicN.
This can then be more easily analyzed and visualized.