-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Local Testing] SNOW-904261 Support DataFrame.except_ #1076
Conversation
@@ -411,13 +403,13 @@ def test_project_should_not_be_pushed_down_through_intersect_or_except(session): | |||
assert df1.except_(df2).count() == 70 | |||
|
|||
|
|||
# TODO: Fix this, `MockExecutionPlan.attributes` are ignoring nullability for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to mark it as localtestable.
fields = [ | ||
StructField( | ||
f.name, | ||
merge_type( | ||
f.datatype, | ||
nfs.get(f.name, NullType()), | ||
name_to_datatype_b.get(f.name, NullType()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the data type is not nullable in b?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If f.name
does not exist in name_to_datatype_b
, it shouldn't exist in name_to_nullable_b
either, which means nullable will be set to True
.
# Dedup all none rows | ||
if res_df.isnull().all(axis=1).where(lambda x: x).count() > 1: | ||
res_df = res_df.drop(index=res_df.isnull().all(axis=1).index[1:]) | ||
# If there are all none rows in cur_df, drop all none rows in res_df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand this comment. Could you elaborate it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because in pandas NaN == NaN
evaluates to False. So NOT IS IN
won't drop rows that are all NaN/None
(all-none-row's) , neither does drop_duplicates
remove those rows even if there are multiple.
bool(any([bool(item is None) for item in result[c]])), | ||
result[ | ||
c | ||
].sf_type.nullable, # bool(any([bool(item is None) for item in result[c]])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: remove the trailing comment.
I cleaned up the code and made some refactor in #1077 , merging this first. |
Please answer these questions before submitting your pull requests. Thanks!
What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes #NNNN
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
Please write a short description of how your code change solves the related issue.