-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use errors='coerce' when converting to numerical or datetime in TableVectorizer
's _apply_cast
#666
Conversation
Answers to your question:
I'll review tomorrow, it's getting too late |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment: I think that we should code a mechanism to raise a warning. We can discuss this to gauge how important it is.
skrub/_table_vectorizer.py
Outdated
# we want to ignore entries that cannot be converted | ||
# to this dtype | ||
if pd.api.types.is_numeric_dtype(dtype): | ||
X[col] = pd.to_numeric(X[col], errors="coerce") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current usage of "coerce" it is hard to do a warning if an invalid value is encountered.
If we feel that it is important to warn (I am leaning in this direction but not 100% sure), one way to do it is to do a try/except with 'errors="raise"', if an error (the "except" block), capture it, pop up a warning (ideally informative) and then do the same call but with 'error="coerce"'.
elif pd.api.types.is_datetime64_any_dtype(dtype): | ||
X[col] = pd.to_datetime(X[col], errors="coerce") | ||
else: | ||
# this should not happen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
Thanks!! Merged |
Fix #631
Questions: