HTRC-FeatureExtractor

Extracts a set of features (such as ngram counts, POS tags, etc.) from the HathiTrust corpus for aiding in conducting 'distant-reading' (aka non-consumptive) research.

Build

To generate a package that can be invoked via a shell script, run:
sbt stage
then find the result in target/universal/stage/ folder.
To generate a distributable ZIP package, run:
sbt dist
then find the result in target/universal/ folder.

Run

extract-features
  -l, --log-level  <LEVEL>    (Optional) The application log level; one of INFO,
                              DEBUG, OFF (default = INFO)
  -c, --num-cores  <N>        (Optional) The number of CPU cores to use (if not
                              specified, uses all available cores)
  -n, --num-partitions  <N>   (Optional) The number of partitions to split the
                              input set of HT IDs into, for increased
                              parallelism
  -o, --output  <DIR>         Write the output to DIR (should not exist, or be
                              empty)
  -p, --pairtree  <DIR>       The path to the paitree root hierarchy to process
  -s, --save-as-seq           (Optional) Saves the EF files as Hadoop sequence
                              files
      --spark-log  <FILE>     (Optional) Where to write logging output from
                              Spark to
  -h, --help                  Show help message
  -v, --version               Show version of this program

 trailing arguments:
  htids (not required)   The file containing the HT IDs to be searched (if not
                         provided, will read from stdin)

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github		.github
project		project
src/main		src/main
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HTRC-FeatureExtractor

Build

Run

About

Releases 4

Packages

Contributors 2

Languages

htrc/HTRC-FeatureExtractor

Folders and files

Latest commit

History

Repository files navigation

HTRC-FeatureExtractor

Build

Run

About

Resources

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 2

Languages

Packages