Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract the data frame of errors? #51

Open
shippy opened this issue May 17, 2018 · 2 comments
Open

Extract the data frame of errors? #51

shippy opened this issue May 17, 2018 · 2 comments

Comments

@shippy
Copy link

shippy commented May 17, 2018

I find that I often require two things from the same assumption-checking code:

  1. Fail the analysis if the assumptions are incorrect,
  2. Separate out two data frames: (1) a dataframe of the rows with faulty assumptions (to remand to data collection) and (2) a data frame that passes the checks (for further data analysis).
  3. Alternatively, get a single data frame with a column that indicates whether they passed the check.

I understand the original intention of engarde is to fail early, and it does provide some tools for (2), but there are two particular pain points:

  1. Getting back to a data frame with and without errors is a little tough. In some cases, that's easy: verify_all returns a dataframe in AssertionError.args[1]. In others, it is less so: none_missing returns a list of (index, column) tuples, which all have to be passed to pandas.DataFrame.loc separately.
  2. Engarde throws the first errors it encounters, which means that any other checks that might fail will only be discovered when this error is worked around.

Can engarde be used for my use case, or is that too far away from engarde's philosophy?

@TomAugspurger
Copy link
Collaborator

Interesting... I hadn't considered 1. Do you have any proposed APIs to support splitting the pipeline in two? I'm not quite sure what it would look like...

I did hit pain point 2 when I was using engarde more. Not sure how best to handle it either.

@shippy
Copy link
Author

shippy commented May 18, 2018

Hm :) Perhaps engarde.decorators.sieve? In my head, it would maybe look like this:

@ed.sieve
@ed.verify_all(rational)
def unload():
    url = "http://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Train.csv"
    trains = pd.read_csv(url, index_col=0)
    return trains

trains_good, trains_bad = unload()

sieve would have to catch all assertions, extract the indices of the rows that contain the error, and return a tuple of data frames. This might not make sense for all checks, but I think it makes sense for a lot of them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants