Skip to content

Latest commit

 

History

History
145 lines (122 loc) · 6.04 KB

CHANGES.md

File metadata and controls

145 lines (122 loc) · 6.04 KB

Changes

Most recent releases are shown at the top. Each release shows:

  • Added: New classes, methods, functions, etc
  • Changed: Additional parameters, changes to inputs or outputs, etc
  • Fixed: Bug fixes that don't change documented behaviour

Note that the top-most release is changes in the unreleased master branch on Github. Parentheses after an item show the name or github id of the contributor of that change.

Keep a Changelog, Semantic Versioning.

[0.3.7dev] (Work In Progress)

Added

  • Anomalies to see significant deviations in fields coverage across multiple jobs, #138
  • Support to Bitbucket API, in order to access files from private repositories, #71
  • Extend inferred schema with additionalProperties: False and uniqueItems: True, #21
  • Fields Difference rule to find the difference between field values of two jobs. Supports normalization, nested fields, full access to the data, #167
  • Added outcome property on Result, in order to define a rule outcome based on message levells. #173

Changed

  • Reports rendering. Reports are being generated as HTML with a jinja2 template. Arche.report_all() displays the rules results grouped by outcome. The plots are displayed on the "plots" tab. #168
  • report_all() accepts uniques arg to find duplicates among columns/rows, #171

[0.3.6] (2019-07-12)

Added

Changed

  • Arche.report_all() does not shorten report by default, added short parameter.
  • Data is consistent with Dash and Spidermon: _type, _key fields are dropped from dataframe, raw data, basic schema, #104, #106
  • df.index now stores _key instead
  • basic_json_schema() works with deleted jobs
  • start is supported for Collections, #112
  • enum is counted as a category tag, #18
  • Garbage Symbols searches in str representation of nested fields instead of expanded df, #130
  • Show real coverage difference (negative\positive) instead of absolute, #114

Fixed

  • Arche.glance(), #88
  • Item links in Schema validation errors, #89
  • Empty NAN bars on category graphs, #93
  • data_quality_report(), #95
  • Wrong number of Collection Items if it contains item 0, #112

Removed

  • Responses Per Item Ratio rule
  • Deprecated expand parameter and removed flat_df, since Garbage Rule deal with nested data itself, #133

[0.3.5] (2019-05-14)

Added

  • Arche() supports any iterables with item dicts, fixing jsonschema consistency, #83
  • Items.from_array to read raw data from iterables, #83

Changed

Fixed

Removed

[0.3.4] (2019-05-06)

Fixed

  • basic_json_schema() fails with long 1.0 types, #80

[0.3.3] (2019-05-03)

Added

  • Accept dataframes as source or target, #69

Changed

  • data_quality_report plots the same "Fields Coverage" instead of green "Scraped Fields Coverage"
  • Plot theme changed from ggplot2 to seaborn, #62
  • Same target and source raise an error, was a warning before
  • Passed rules marked with green PASSED.

Fixed

Removed

  • Deprecated Arche.basic_json_schema(), use basic_json_schema()
  • Removed Quickstart.md as redundant - documentation lives in notebooks

[0.3.2] (2019-04-18)

Added

  • Allow reading private raw schemas directly from bitbucket, #58

Changed

  • Progress widgets are removed before printing graphs
  • New plotly v4 API

Fixed

  • Failing Compare Prices For Same Urls when url is nan, #67
  • Empty graphs in Jupyter Notebook, #63

Removed

  • Scraped Items History graphs

[0.3.1] (2019-04-12)

Fixed

  • Empty graphs due to lack of plotlyjs, #61

[0.3.0] (2019-04-12)

Fixed

  • Big notebook size, replaced cufflinks with plotly and ipython, #39

Changed

  • Fields Coverage now is printed as a bar plot, #9
  • Fields Counts renamed to Coverage Difference and results in 2 bar plots, #9, #51:
    • Coverage from job stats fields counts which reflects coverage for each field for both jobs
    • Coverage difference more than 5% which prints >5% difference between the coverages (was ratio difference before)
  • Compare Scraped Categories renamed to Category Coverage Difference and results in 2 bar plots for each category, #52:
    • Coverage for field which reflects value counts (categories) coverage for the field for both jobs
    • Coverage difference more than 10% for field which shows >10% differences between the category coverages
  • Boolean Fields plots Coverage for boolean fields graph which reflects normalized value counts for boolean fields for both jobs, #53

Removed

  • cufflinks dependency
  • Deprecated category_field tag

[2019.03.25]

Added

  • CHANGES.md
  • new arche.rules.duplicates.find_by() to find duplicates by chosen columns
import arche
from arche.readers.items import JobItems
df = JobItems(0, "235801/1/15").df
arche.rules.duplicates.find_by(df, ["title", "category"]).show()
  • basic_json_schema().json() prints a schema in JSON format
  • Result.show() to print a rule result, e.g.
from arche.rules.garbage_symbols import garbage_symbols
from arche.readers.items import JobItems
items = JobItems(0, "235801/1/15")
garbage_symbols(items).show()
  • notebooks to documentation

Changed

  • Tags rule returns unused tags, #2
  • basic_json_schema() prints a schema as a python dict

Deprecated

  • Arche().basic_json_schema() deprecated in favor of arche.basic_json_schema()

Removed

Fixed

  • Arche().basic_json_schema() not using items_numbers argument

2019.03.18

  • Last release without CHANGES updates