Utility library that can be used for performing header/body/footer identification over a set of pages from a volume. The algorithm uses fuzzy string matching (using the Levenshtein distance metric) to cluster similar lines across pages within a configurable window. The algorithm is optimized to use as little memory-copying as possible, both for performance reasons and to be able to process large amounts of text.
- To generate a package that can be referenced from other projects:
sbt test package
then find the result intarget/scala-2.13/
(or similar) folder.
libraryDependencies += "org.hathitrust.htrc" %% "running-headers" % VERSION
Scala 2.12.x
<dependency>
<groupId>org.hathitrust.htrc</groupId>
<artifactId>running-headers_2.12</artifactId>
<version>VERSION</version>
</dependency>
Scala 2.13.x
<dependency>
<groupId>org.hathitrust.htrc</groupId>
<artifactId>running-headers_2.13</artifactId>
<version>VERSION</version>
</dependency>