Most recent releases are shown at the top. Each release shows:
- Added: New classes, methods, functions, etc
- Changed: Additional parameters, changes to inputs or outputs, etc
- Fixed: Bug fixes that don't change documented behaviour
Note that the top-most release is changes in the unreleased master branch on Github. Parentheses after an item show the name or github id of the contributor of that change.
Keep a Changelog, Semantic Versioning.
- Anomalies to see significant deviations in fields coverage across multiple jobs, #138
- Support to Bitbucket API, in order to access files from private repositories, #71
- Extend inferred schema with
additionalProperties: False and uniqueItems: True
, #21 - Fields Difference rule to find the difference between field values of two jobs. Supports normalization, nested fields, full access to the data, #167
- Added
outcome
property on Result, in order to define a rule outcome based on message levells. #173
- Reports rendering. Reports are being generated as HTML with a jinja2 template.
Arche.report_all()
displays the rules results grouped by outcome. The plots are displayed on the "plots" tab. #168 report_all()
acceptsuniques
arg to find duplicates among columns/rows, #171
- Categories rule with a plot showing unique values and count per field. By default,
report_all()
only includes fields which have less or equal to 10 unique values. See https://arche.readthedocs.io/en/latest/nbs/Rules.html#Category-fields, #100 - Category documentation
Arche.report_all()
does not shorten report by default, addedshort
parameter.- Data is consistent with Dash and Spidermon:
_type, _key
fields are dropped from dataframe, raw data, basic schema, #104, #106 df.index
now stores_key
insteadbasic_json_schema()
works withdeleted
jobsstart
is supported for Collections, #112enum
is counted as acategory
tag, #18Garbage Symbols
searches in str representation of nested fields instead of expanded df, #130- Show real coverage difference (negative\positive) instead of absolute, #114
Arche.glance()
, #88- Item links in Schema validation errors, #89
- Empty NAN bars on category graphs, #93
data_quality_report()
, #95- Wrong number of Collection Items if it contains item 0, #112
- Responses Per Item Ratio rule
- Deprecated
expand
parameter and removedflat_df
, sinceGarbage Rule
deal with nested data itself, #133
Arche()
supports any iterables with item dicts, fixing jsonschema consistency, #83Items.from_array
to read raw data from iterables, #83
- If reading from pandas df directly, store raw data in numpy array. See gotchas http://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#support-for-integer-na
- basic_json_schema() fails with long
1.0
types, #80
- Accept dataframes as source or target, #69
- data_quality_report plots the same "Fields Coverage" instead of green "Scraped Fields Coverage"
- Plot theme changed from ggplot2 to seaborn, #62
- Same target and source raise an error, was a warning before
- Passed rules marked with green PASSED.
- Online documentation now renders graphs https://arche.readthedocs.io/en/latest/, #41
- Error colours are back in
report_all()
.
- Deprecated
Arche.basic_json_schema()
, usebasic_json_schema()
- Removed Quickstart.md as redundant - documentation lives in notebooks
- Allow reading private raw schemas directly from bitbucket, #58
- Progress widgets are removed before printing graphs
- New plotly v4 API
- Failing
Compare Prices For Same Urls
when url isnan
, #67 - Empty graphs in Jupyter Notebook, #63
- Scraped Items History graphs
- Empty graphs due to lack of plotlyjs, #61
- Big notebook size, replaced cufflinks with plotly and ipython, #39
- Fields Coverage now is printed as a bar plot, #9
- Fields Counts renamed to Coverage Difference and results in 2 bar plots, #9, #51:
- Coverage from job stats fields counts which reflects coverage for each field for both jobs
- Coverage difference more than 5% which prints >5% difference between the coverages (was ratio difference before)
- Compare Scraped Categories renamed to Category Coverage Difference and results in 2 bar plots for each category, #52:
- Coverage for
field
which reflects value counts (categories) coverage for the field for both jobs - Coverage difference more than 10% for
field
which shows >10% differences between the category coverages
- Coverage for
- Boolean Fields plots Coverage for boolean fields graph which reflects normalized value counts for boolean fields for both jobs, #53
cufflinks
dependency- Deprecated
category_field
tag
- CHANGES.md
- new
arche.rules.duplicates.find_by()
to find duplicates by chosen columns
import arche
from arche.readers.items import JobItems
df = JobItems(0, "235801/1/15").df
arche.rules.duplicates.find_by(df, ["title", "category"]).show()
basic_json_schema().json()
prints a schema in JSON formatResult.show()
to print a rule result, e.g.
from arche.rules.garbage_symbols import garbage_symbols
from arche.readers.items import JobItems
items = JobItems(0, "235801/1/15")
garbage_symbols(items).show()
- notebooks to documentation
- Tags rule returns unused tags, #2
basic_json_schema()
prints a schema as a python dict
Arche().basic_json_schema()
deprecated in favor ofarche.basic_json_schema()
Arche().basic_json_schema()
not usingitems_numbers
argument
- Last release without CHANGES updates