-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RelationalGroupedDataFrame.pivot() #1130
Add RelationalGroupedDataFrame.pivot() #1130
Conversation
if values is None: | ||
distinct_values = ( | ||
self._df.select(pivot_col).distinct()._internal_collect_with_tag() | ||
) | ||
value_exprs = [Literal(v[0]) for v in distinct_values] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add this in Dataframe.pivot
as well with a documentation update saying this method is not recommended?
>>> create_result = session.sql('''create or replace temp table monthly_sales(empid int, team text, amount int, month text) | ||
... as select * from values | ||
... (1, 'A', 10000, 'JAN'), | ||
... (1, 'B', 400, 'JAN'), | ||
... (2, 'A', 4500, 'JAN'), | ||
... (2, 'A', 35000, 'JAN'), | ||
... (1, 'B', 5000, 'FEB'), | ||
... (1, 'A', 3000, 'FEB'), | ||
... (2, 'B', 200, 'FEB') ''').collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could use create_dataframe
then save_as_table
to represent this session.sql
? Maybe it is not worth it but just thinking out loud.
distinct_values = ( | ||
self._df.select(pivot_col).distinct()._internal_collect_with_tag() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a blocking call, and might not be cheap. I wonder if we should highlight why this is not efficient in the documentation, in addition to saying this is not recommended.
Alternatively, should we lazy evaluate this query (put them into the query plan), or at least do a async query here and fetch the result only when the value is required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure we have a TODO for distinct values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to raise if values
is None
?
|
Right now, |
I see. Both works, probably an explicit error message is easier for user to understand :) |
v._expression if isinstance(v, Column) else Literal(v) for v in values | ||
] | ||
self._group_type = _PivotType(pc[0], value_exprs) | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could have the return-type be typing_extensions.Self here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢 🚢
Please answer these questions before submitting your pull requests. Thanks!
What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-944062: Implementation and functionality of
pivot
differs from PySpark and is not user-friendly #1093Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
Added a
pivot
method inRelationalGroupedDataFrame
which allows to accesspivot
usingdf.group_by().pivot()