Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1622441: Calling UDFs recreates deleted rows as nan and shuffles row values #2072

Open
djfletcher opened this issue Aug 12, 2024 · 1 comment
Assignees
Labels
bug Something isn't working local testing Local Testing issues/PRs needs triage Initial RCA is required

Comments

@djfletcher
Copy link

djfletcher commented Aug 12, 2024

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using?

Python 3.9.6 (default, Feb 3 2024, 15:58:27)
[Clang 15.0.0 (clang-1500.3.9.4)]

  1. What are the Snowpark Python and pandas versions in the environment?

pandas==2.2.2
snowflake-snowpark-python==1.20.0

  1. What did you do?

Deleted a row from a table. When I select remaining rows from that table the deleted row gets recreated, and the values are shuffled.

>>> from snowflake.snowpark.functions import call_udf, col, lit
>>> from snowflake.snowpark.session import Session
>>> 
>>> 
>>> def add_one(val: int) -> int:
...     return val + 1
... 
>>> 
>>> session = Session.builder.config("local_testing", True).create()
>>> session.udf.register(add_one, name="add_one")
<snowflake.snowpark.mock._udf.MockUserDefinedFunction object at 0x149035d90>
>>> 
>>> df = session.create_dataframe([(1),(2),(3)], schema=["a"])
>>> df.write.save_as_table("my_table", table_type="temporary")
>>> 
>>> t = session.table("my_table")
>>> t.show()
-------
|"A"  |
-------
|1    |
|2    |
|3    |
-------

# row is correctly deleted
>>> t.delete(t["a"] == 1)
DeleteResult(rows_deleted=1)
>>> t.show()
-------
|"A"  |
-------
|2    |
|3    |
-------

# calling a udf recreates the deleted column with nan and shuffles the remaining values
>>> t.with_column("added", call_udf("add_one", col("a"))).show()
-----------------
|"A"  |"ADDED"  |
-----------------
|2    |4        |
|3    |nan      |
|nan  |3        |
-----------------

# `select` has the same result as `with_column`
>>> t.select(col("a"), call_udf("add_one", col("a")).alias("added")).show()
-----------------
|"A"  |"ADDED"  |
-----------------
|2    |4        |
|3    |nan      |
|nan  |3        |
-----------------

# `alias` is not the issue
>>> t.select(col("a"), call_udf("add_one", col("a"))).show()
--------------------------
|"A"  |"ADD_ONE(""A"")"  |
--------------------------
|2    |4                 |
|3    |nan               |
|nan  |3                 |
--------------------------

# udf is the issue because using a lit works
>>> t.select(col("a"), lit("blah").alias("added")).show()
-----------------
|"A"  |"ADDED"  |
-----------------
|2    |blah     |
|3    |blah     |
-----------------
  1. What did you expect to see?

Deleted row should not have been recreated with nan and rows should not be shuffled.

@djfletcher djfletcher added bug Something isn't working local testing Local Testing issues/PRs needs triage Initial RCA is required labels Aug 12, 2024
@github-actions github-actions bot changed the title Calling UDFs recreates deleted rows as nan and shuffles row values SNOW-1622441: Calling UDFs recreates deleted rows as nan and shuffles row values Aug 12, 2024
@donjin-master
Copy link

Hey @sfc-gh-jrose I know you are working on it. I am new to this snowpark-python open source community and I would like to solve this bug. I did go through the dataframe classes and didn't able to find out where this data is messing up while returning the value. If you can share some insights what needs to be look for that would be great. Thanks in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working local testing Local Testing issues/PRs needs triage Initial RCA is required
Projects
None yet
Development

No branches or pull requests

3 participants