Skip to content

Extracts features (token counts, POS tags, etc.) from a list of HT volumes, to aid in non-consumptive research.

Notifications You must be signed in to change notification settings

htrc/HTRC-FeatureExtractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scala CI codecov GitHub release (latest SemVer including pre-releases)

HTRC-FeatureExtractor

Extracts a set of features (such as ngram counts, POS tags, etc.) from the HathiTrust corpus for aiding in conducting 'distant-reading' (aka non-consumptive) research.

Build

  • To generate a package that can be invoked via a shell script, run:
    sbt stage
    then find the result in target/universal/stage/ folder.
  • To generate a distributable ZIP package, run:
    sbt dist
    then find the result in target/universal/ folder.

Run

extract-features
  -l, --log-level  <LEVEL>    (Optional) The application log level; one of INFO,
                              DEBUG, OFF (default = INFO)
  -c, --num-cores  <N>        (Optional) The number of CPU cores to use (if not
                              specified, uses all available cores)
  -n, --num-partitions  <N>   (Optional) The number of partitions to split the
                              input set of HT IDs into, for increased
                              parallelism
  -o, --output  <DIR>         Write the output to DIR (should not exist, or be
                              empty)
  -p, --pairtree  <DIR>       The path to the paitree root hierarchy to process
  -s, --save-as-seq           (Optional) Saves the EF files as Hadoop sequence
                              files
      --spark-log  <FILE>     (Optional) Where to write logging output from
                              Spark to
  -h, --help                  Show help message
  -v, --version               Show version of this program

 trailing arguments:
  htids (not required)   The file containing the HT IDs to be searched (if not
                         provided, will read from stdin)

About

Extracts features (token counts, POS tags, etc.) from a list of HT volumes, to aid in non-consumptive research.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages