Release standardizing, streamlining, and snuggling up to spaCy · chartbeat-labs/textacy

New and Changed:

Removed textacy.Doc, and split its functionality into two parts
- New: Added textacy.make_spacy_doc() as a convenient and flexible entry point
  for making spaCy Doc s from text or (text, metadata) pairs, with optional
  spaCy language pipeline specification. It's similar to textacy.Doc.__init__,
  with the exception that text and metadata are passed in together as a 2-tuple.
- New: Added a variety of custom doc property and method extensions to
  the global spacy.tokens.Doc class, accessible via its Doc._ "underscore"
  property. These are similar to the properties/methods on textacy.Doc,
  they just require an interstitial underscore. For example,
  textacy.Doc.to_bag_of_words() => spacy.tokens.Doc._.to_bag_of_words().
- New: Added functions for setting, getting, and removing these extensions.
  Note that they are set automatically when textacy is imported.
Simplified and improved performance of textacy.Corpus
- Documents are now added through a simpler API, either in Corpus.__init__
  or Corpus.add(); they may be one or a stream of texts, (text, metadata)
  pairs, or existing spaCy Doc s. When adding many documents, the spaCy
  language processing pipeline is used in a faster and more efficient way.
- Saving / loading corpus data to disk is now more efficient and robust.
- Note: Corpus is now a collection of spaCy Doc s rather than textacy.Doc s.
Simplified, standardized, and added Dataset functionality
- New: Added an IMDB dataset, built on the classic 2011 dataset
  commonly used to train sentiment analysis models.
- New: Added a base Wikimedia dataset, from which a reworked
  Wikipedia dataset and a separate Wikinews dataset inherit.
  The underlying data source has changed, from XML db dumps of raw wiki markup
  to JSON db dumps of (relatively) clean text and metadata; now, the code is
  simpler, faster, and totally language-agnostic.
- Dataset.records() now streams (text, metadata) pairs rather than a dict
  containing both text and metadata, so users don't need to know field names
  and split them into separate streams before creating Doc or Corpus
  objects from the data.
- Filtering and limiting the number of texts/records produced is now clearer
  and more consistent between .texts() and .records() methods on
  a given Dataset --- and more performant!
- Downloading datasets now always shows progress bars and saves to the same
  file names. When appropriate, downloaded archive files' contents are
  automatically extracted for easy inspection.
- Common functionality (such as validating filter values) is now standardized
  and consolidated in the datasets.utils module.
Quality of life improvements
- Reduced load time for import textacy from ~2-3 seconds to ~1 second,
  by lazy-loading expensive variables, deferring a couple heavy imports, and
  dropping a couple dependencies. Specifically:
  - ftfy was dropped, and a NotImplementedError is now raised
    in textacy's wrapper function, textacy.preprocess.fix_bad_unicode().
    Users with bad unicode should now directly call ftfy.fix_text().
  - ijson was dropped, and the behavior of textacy.read_json()
    is now simpler and consistent with other functions for line-delimited data.
  - mwparserfromhell was dropped, since the reworked Wikipedia dataset
    no longer requires complicated and slow parsing of wiki markup.
- Renamed certain functions and variables for clarity, and for consistency with
  existing conventions:
  - textacy.load_spacy() => textacy.load_spacy_lang()
  - textacy.extract.named_entities() => textacy.extract.entities()
  - textacy.data_dir => textacy.DEFAULT_DATA_DIR
  - filename => filepath and dirname => dirpath when specifying
    full paths to files/dirs on disk, and textacy.io.utils.get_filenames()
    => textacy.io.utils.get_filepaths() accordingly
  - SpacyDoc => Doc, SpacySpan => Span, SpacyToken => Token,
    SpacyLang => Language as variables and in docs
  - compiled regular expressions now consistently start with RE_
- Removed deprecated functionality
  - top-level spacy_utils.py and spacy_pipelines.py are gone;
    use equivalent functionality in the spacier subpackage instead
  - math_utils.py is gone; it was long neglected, and never actually used
- Replaced textacy.compat.bytes_to_unicode() and textacy.compat.unicode_to_bytes()
  with textacy.compat.to_unicode() and textacy.compat.to_bytes(), which
  are safer and accept either binary or text strings as input.
- Moved and renamed language detection functionality,
  textacy.text_utils.detect_language() => textacy.lang_utils.detect_lang().
  The idea is to add more/better lang-related functionality here in the future.
- Updated and cleaned up documentation throughout the code base.
- Added and refactored many tests, for both new and old functionality,
  significantly increasing test coverage while significantly reducing run-time.
  Also, added a proper coverage report to CI builds. This should help prevent
  future errors and inspire better test-writing.
- Bumped the minimum required spaCy version: v2.0.0 => v2.0.12,
  for access to their full set of custom extension functionality.

Fixed:

The progress bar during an HTTP download now always closes, preventing weird
nesting issues if another bar is subsequently displayed.
Filtering datasets by multiple values performed either a logical AND or OR
over the values, which was confusing; now, a logical OR is always performed.
The existence of files/directories on disk is now checked properly via
os.path.isfile() or os.path.isdir(), rather than os.path.exists().
Fixed a variety of formatting errors raised by sphinx when generating HTML docs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

standardizing, streamlining, and snuggling up to spaCy