Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High level API redesign #123

Open
manycoding opened this issue Jun 25, 2019 · 2 comments
Open

High level API redesign #123

manycoding opened this issue Jun 25, 2019 · 2 comments
Labels
Type: API Type: Question Further information is requested

Comments

@manycoding
Copy link
Contributor

manycoding commented Jun 25, 2019

I want to go away from schema, mostly because now schema-based features are all mixed together between Arche, Matt and standard schema.
I am thinking about fastai-like parameters https://docs.fast.ai/tabular.data.html#TabularList:

a = Arche(data, cat_names=["size"], cont_names=["price"], uniques=["id", ("url", "title", "price")])
So then duplicates will use uniques, i.e. check if all id are unique and all rows have unique url and title
Categories will use cat_names
cont_names is just an example, but can be used to determine numerical data, and then plot some stats like deviation, percentiles and such.

Thoughts?
@ejulio @raphapassini @victor-torres @alexander-matsievsky

@manycoding manycoding added the Type: Question Further information is requested label Jun 25, 2019
@manycoding manycoding added this to the 0.3.7 milestone Jun 28, 2019
@ejulio
Copy link

ejulio commented Jul 1, 2019

This is a good idea.
Probably it would be easier than jsonschema to write some validations and checks 😄 .
Since, we don't need to bother about names, I'd suggest to use full names instead of abbreviations. In this case, category_names over cat_names.
If cat_names is a list of categories, I'd go with categories and if they are columns in the df then category_columns.
Same follows for other configurations.

Another idea is that, data shouldn't go with Arche.
I'd prefer to instantiate Arche as check template and then feed any data trough methods.
This would be a good fit for multi-job checks.

# since configs are arguments, we could write a jsonschema to arche params for example
a = Arche(my configs here)

a.report_all(job1_data)
a.report_all(job2_data)

@manycoding
Copy link
Contributor Author

@ejulio

Since, we don't need to bother about names, I'd suggest to use full names instead of abbreviations. In this case, category_names over cat_names.

I kind of started to like these abbreviations after getting familiar with fastai. The learning curve is the same since you have to check docstrings anyway, but with shorter names the code is smaller.

Another idea is that, data shouldn't go with Arche.

I suggested something similar in #69

source_items = Items.from_something(start, count)
target_items = Items.from_something(start, count)

# since configs are arguments, we could write a jsonschema to arche params for example
a = Arche(schema, categories, continuous, uniques)

a.report(source_items)
a.report(target_items)

a.compare(source_items, target_items)

@manycoding manycoding removed this from the 0.3.7 milestone Sep 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: API Type: Question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants