Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RelationalGroupedDataFrame.pivot() #1130

Merged

Conversation

sfc-gh-aalam
Copy link
Contributor

Please answer these questions before submitting your pull requests. Thanks!

  1. What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-944062: Implementation and functionality of pivot differs from PySpark and is not user-friendly #1093

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
  3. Please describe how your code solves the related issue.

    Added apivot method in RelationalGroupedDataFrame which allows to access pivot using df.group_by().pivot()

Comment on lines 454 to 458
if values is None:
distinct_values = (
self._df.select(pivot_col).distinct()._internal_collect_with_tag()
)
value_exprs = [Literal(v[0]) for v in distinct_values]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add this in Dataframe.pivot as well with a documentation update saying this method is not recommended?

@sfc-gh-aalam sfc-gh-aalam marked this pull request as ready for review November 5, 2023 23:34
@sfc-gh-aalam sfc-gh-aalam requested a review from a team as a code owner November 5, 2023 23:34
Comment on lines 409 to 417
>>> create_result = session.sql('''create or replace temp table monthly_sales(empid int, team text, amount int, month text)
... as select * from values
... (1, 'A', 10000, 'JAN'),
... (1, 'B', 400, 'JAN'),
... (2, 'A', 4500, 'JAN'),
... (2, 'A', 35000, 'JAN'),
... (1, 'B', 5000, 'FEB'),
... (1, 'A', 3000, 'FEB'),
... (2, 'B', 200, 'FEB') ''').collect()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could use create_dataframe then save_as_table to represent this session.sql? Maybe it is not worth it but just thinking out loud.

Comment on lines 455 to 457
distinct_values = (
self._df.select(pivot_col).distinct()._internal_collect_with_tag()
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a blocking call, and might not be cheap. I wonder if we should highlight why this is not efficient in the documentation, in addition to saying this is not recommended.

Alternatively, should we lazy evaluate this query (put them into the query plan), or at least do a async query here and fetch the result only when the value is required?

Copy link
Collaborator

@sfc-gh-sfan sfc-gh-sfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure we have a TODO for distinct values

Copy link
Collaborator

@sfc-gh-sfan sfc-gh-sfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to raise if values is None?

@sfc-gh-aalam
Copy link
Contributor Author

Let's make sure we have a TODO for distinct values

https://snowflakecomputing.atlassian.net/browse/SNOW-967385

@sfc-gh-aalam
Copy link
Contributor Author

Do we need to raise if values is None?

Right now, df.pivot doesn't raise it but we could raise a ValueError in both places saying values cannot be empty. Right now we get NoneType is not iterable

@sfc-gh-sfan
Copy link
Collaborator

I see. Both works, probably an explicit error message is easier for user to understand :)

src/snowflake/snowpark/dataframe.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/relational_grouped_dataframe.py Outdated Show resolved Hide resolved
v._expression if isinstance(v, Column) else Literal(v) for v in values
]
self._group_type = _PivotType(pc[0], value_exprs)
return self
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could have the return-type be typing_extensions.Self here

Copy link
Collaborator

@sfc-gh-mkeller sfc-gh-mkeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 🚢

@sfc-gh-aalam sfc-gh-aalam merged commit da4be4a into main Nov 17, 2023
51 checks passed
@sfc-gh-aalam sfc-gh-aalam deleted the aalam-SNOW-944062-usage-of-pivot-differs-from-py-spark branch November 17, 2023 17:38
@github-actions github-actions bot locked and limited conversation to collaborators Nov 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SNOW-944062: Implementation and functionality of pivot differs from PySpark and is not user-friendly
3 participants