The llamapun library contains language and mathematics processing algorithms, used by the KWARC research group.
As of 2022, this repository can be considered in maintenance mode, as no further development is planned.
At its core, llamapun is a Rust implementation that aims at minimal footprint and optimal runtime, in order to safely scale to corpora of millions of documents and tens of billions ot tokens.
Requires stable rust, starting from rustc 1.34.0 (91856ed52 2019-04-10)
.
-
Source Data
- Built-in support for STEM documents in (LaTeXML-flavoured) HTML5.
-
Preprocessing
- Unicode normalization,
- Stopwords - based on widely accepted lists, enhanced for STEM texts,
- Semi-structured to plain text normalization (math, citations, tables, etc.),
- [TODO #3] Purification of text and math modality (e.g. move trailing dots left in math back into the sentence text),
- Stemming - adaptation of the Morpha stemmer,
- Tokenization - rule-based sentence segmentation, and SENNA word tokenization
-
Shallow Analysis
-
Representation Toolkit
- Document Narrative Model (DNM) addition to the XML DOM
- XPointer and string offset annotation support
- [TOPORT] Shared Packed parse forests for mathematical formulas (aka "disjunctive logical forms")
-
Programming API
- High-level iterators over the narrative elements of scientific documents
- Zero-cost abstractions over the source data, as well as over linguistic annotations of various granularity.
- High-throughput parallel processing via
rayon
, since 0.3.0.
-
Additional included examples
- math-aware corpus token models, via DNM plain text normalization
- math-aware dataset extraction for "statement classification" of paragraphs
- "node footprint" statistics for corpora, e.g. informing the MathML4 effort
- track sibling words to inline references in scientific articles, informing LaTeXML development
Disclaimers:
-
Please remember that all third-party tools (such as the SENNA NLP toolkit) enforce their own licensing constraints.
-
This Github repository is a successor to the now deprecated C+Perl LLaMaPUn implementation.