Utilites for processing various types of data for machine learning models
It helps build and manager vocabularies from corpi. It includes the following functionality:
- tokenise (Useful regex: r"([\w]+(?:(?!\s)\W?[\w]+)*)" )
- stem or unstem
- filters: ability to define filters that will accept or reject vocabulary entries (e.g. stopwords)
- token-level cleanups
- merging of multiple vocabularies
- Replace character(s) in all token
- Save vocabulary
- Load vocabulary
Utility to build and manage frequency matrices from corpi with the following functionality:
- turn corpus into frequncy matrix ( corpus, vocab )
- merge multiple vocabs and freq matrices together