Simpler duplicates #171

manycoding · 2019-10-11T16:47:23Z

Add uniques arg to the main class

Refactored all find_by_ into two methods:

the one which reads tags and used in report_all
API

find_by now accepts list of columns and any combinations of columns. Combination means we check the equality by all the values in the given combination together, e.g. arche.rules.duplicates.find_by(df, [["url", "name"], "upc"]).show() will check that ups is unique, and all rows have unique url and name combination
Main find_by message changed from

997 duplicate(s) with same upc

to

product_source_url, product_name contains 23 duplicated values
upc contains 23 duplicated values

uniques is added to report_all. If schema contains any tag used in find_by_tag, it overwrites uniques
a.report_all(uniques=[["url", "name"], "upc"])

@peonone

codecov · 2019-10-11T16:52:56Z

Codecov Report

Merging #171 into master will decrease coverage by 0.21%.
The diff coverage is 80.48%.

@@           Coverage Diff            @@
##           master   #171      +/-   ##
========================================
- Coverage   81.21%    81%   -0.22%     
========================================
  Files          24     24              
  Lines        1624   1606      -18     
  Branches      279    279              
========================================
- Hits         1319   1301      -18     
+ Misses        252    251       -1     
- Partials       53     54       +1

Impacted Files	Coverage Δ
src/arche/figures/tables.py	`61.9% <ø> (+3.47%)`	⬆️
src/arche/quality_estimation_algorithm.py	`40.9% <0%> (-1.57%)`	⬇️
src/arche/data_quality_report.py	`77.27% <100%> (-0.26%)`	⬇️
src/arche/arche.py	`85.21% <50%> (-2.56%)`	⬇️
src/arche/rules/duplicates.py	`97.29% <96.55%> (-2.71%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f46c40e...e2b8fea. Read the comment docs.

simoess

Looks good. It will be useful.

simoess · 2019-10-24T19:41:18Z

src/arche/quality_estimation_algorithm.py

@@ -126,8 +101,7 @@ def generate_quality_estimation(
    else:
        quality_estimation = (
            adherence_to_schema_percent * 40 / 100
-            + duplicated_items_percent * 10 / 100
-            + duplicated_skus_percent * 5 / 100
+            + duplicated_items_percent * 15 / 100


From where these percentages come from? Why 15 percent of duplicated_items_percent?

They were set 1.5 years ago based on the experience of qa team at that moment.
I set 15 because it's the sum for duplicates, to keep the compatibility for now (see #154)

manycoding added 2 commits October 11, 2019 13:29

One find to rule them all

d5d1464

Update docs

ab1e39e

manycoding requested review from andersonberg, ejulio and simoess and removed request for ejulio October 11, 2019 16:47

Add uniques to report_all

96318b1

manycoding mentioned this pull request Oct 22, 2019

Refactor find_by_unique to use find_by #174

Closed

Merge master

8821545

andersonberg approved these changes Oct 24, 2019

View reviewed changes

Changes

e2b8fea

manycoding merged commit 476a9dd into master Oct 24, 2019

manycoding deleted the simpler_duplicates branch October 24, 2019 19:13

simoess reviewed Oct 24, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simpler duplicates #171

Simpler duplicates #171

manycoding commented Oct 11, 2019 •

edited

Loading

codecov bot commented Oct 11, 2019 •

edited

Loading

simoess left a comment

simoess Oct 24, 2019

manycoding Oct 24, 2019 •

edited

Loading

Simpler duplicates #171

Simpler duplicates #171

Conversation

manycoding commented Oct 11, 2019 • edited Loading

codecov bot commented Oct 11, 2019 • edited Loading

Codecov Report

simoess left a comment

Choose a reason for hiding this comment

simoess Oct 24, 2019

Choose a reason for hiding this comment

manycoding Oct 24, 2019 • edited Loading

Choose a reason for hiding this comment

manycoding commented Oct 11, 2019 •

edited

Loading

codecov bot commented Oct 11, 2019 •

edited

Loading

manycoding Oct 24, 2019 •

edited

Loading