High level API redesign #123

manycoding · 2019-06-25T22:01:00Z

I want to go away from schema, mostly because now schema-based features are all mixed together between Arche, Matt and standard schema.
I am thinking about fastai-like parameters https://docs.fast.ai/tabular.data.html#TabularList:

a = Arche(data, cat_names=["size"], cont_names=["price"], uniques=["id", ("url", "title", "price")])
So then duplicates will use uniques, i.e. check if all id are unique and all rows have unique url and title
Categories will use cat_names
cont_names is just an example, but can be used to determine numerical data, and then plot some stats like deviation, percentiles and such.

Thoughts?
@ejulio @raphapassini @victor-torres @alexander-matsievsky

The text was updated successfully, but these errors were encountered:

ejulio · 2019-07-01T14:30:34Z

This is a good idea.
Probably it would be easier than jsonschema to write some validations and checks 😄 .
Since, we don't need to bother about names, I'd suggest to use full names instead of abbreviations. In this case, category_names over cat_names.
If cat_names is a list of categories, I'd go with categories and if they are columns in the df then category_columns.
Same follows for other configurations.

Another idea is that, data shouldn't go with Arche.
I'd prefer to instantiate Arche as check template and then feed any data trough methods.
This would be a good fit for multi-job checks.

# since configs are arguments, we could write a jsonschema to arche params for example
a = Arche(my configs here)

a.report_all(job1_data)
a.report_all(job2_data)

manycoding · 2019-07-01T15:16:18Z

@ejulio

Since, we don't need to bother about names, I'd suggest to use full names instead of abbreviations. In this case, category_names over cat_names.

I kind of started to like these abbreviations after getting familiar with fastai. The learning curve is the same since you have to check docstrings anyway, but with shorter names the code is smaller.

Another idea is that, data shouldn't go with Arche.

I suggested something similar in #69

source_items = Items.from_something(start, count)
target_items = Items.from_something(start, count)

# since configs are arguments, we could write a jsonschema to arche params for example
a = Arche(schema, categories, continuous, uniques)

a.report(source_items)
a.report(target_items)

a.compare(source_items, target_items)

manycoding added the Type: Question Further information is requested label Jun 25, 2019

manycoding added this to the 0.3.7 milestone Jun 28, 2019

manycoding mentioned this issue Jul 4, 2019

Duplicates are confusing #131

Closed

manycoding added the Type: API label Jul 6, 2019

manycoding removed this from the 0.3.7 milestone Sep 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High level API redesign #123

High level API redesign #123

manycoding commented Jun 25, 2019 •

edited

Loading

ejulio commented Jul 1, 2019 •

edited

Loading

manycoding commented Jul 1, 2019

High level API redesign #123

High level API redesign #123

Comments

manycoding commented Jun 25, 2019 • edited Loading

ejulio commented Jul 1, 2019 • edited Loading

manycoding commented Jul 1, 2019

manycoding commented Jun 25, 2019 •

edited

Loading

ejulio commented Jul 1, 2019 •

edited

Loading