[SNOW-990542] When converting snowpark dataframe to pandas, cast decimal columns to float type #1201

sfc-gh-xhe · 2024-01-11T21:55:31Z

Please answer these questions before submitting your pull requests. Thanks!

What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Fixes #SNOW-990542

Pandas doesn't recognize decimal type. When we convert snowpark dataframe to pandas dataframe, the decimal columns are not handled correctly, and thus result in issues.

Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
- I am adding new logging messages
- I am adding a new telemetry message
- I am adding new credentials
- I am adding a new dependency
Please describe how your code solves the related issue.

When we convert snowpark dataframe to pandas dataframe, we cast decimal columns into float64.
For more details, please see [thread]

github-actions · 2024-01-11T21:55:47Z

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

… float64

sfc-gh-sfan

LGTM % nits

sfc-gh-sfan · 2024-01-11T23:03:22Z

tests/integ/test_df_to_pandas.py

@@ -141,6 +141,28 @@ def test_to_pandas_precision_for_number_38_0(session):
    assert pdf["A"].min() == -9223372036854775808


+def test_to_pandas_precision_for_number_38_6_and_others(session):


Let's rename this test to test_to_pandas_precision_for_non_zero_scale

Sounds good.

sfc-gh-sfan · 2024-01-11T23:06:19Z

Please also fix the test:

[gw2] linux -- Python 3.8.18 /home/runner/work/snowpark-python/snowpark-python/.tox/py38-notdoctest-ci/bin/python
.tox/py38-notdoctest-ci/lib/python3.8/site-packages/pandas/core/indexes/base.py:3653: in get_loc
    return self._engine.get_loc(casted_key)
pandas/_libs/index.pyx:147: in pandas._libs.index.IndexEngine.get_loc
    ???
pandas/_libs/index.pyx:176: in pandas._libs.index.IndexEngine.get_loc
    ???
pandas/_libs/hashtable_class_helper.pxi:7080: in pandas._libs.hashtable.PyObjectHashTable.get_item
    ???
pandas/_libs/hashtable_class_helper.pxi:7088: in pandas._libs.hashtable.PyObjectHashTable.get_item
    ???
E   KeyError: 'division'

The above exception was the direct cause of the following exception:
tests/integ/test_df_to_pandas.py:161: in test_to_pandas_precision_for_number_38_6_and_others
    assert pdf["division"].dtype == "float64"
.tox/py38-notdoctest-ci/lib/python3.8/site-packages/pandas/core/frame.py:3761: in __getitem__
    indexer = self.columns.get_loc(key)
.tox/py38-notdoctest-ci/lib/python3.8/site-packages/pandas/core/indexes/base.py:3655: in get_loc
    raise KeyError(key) from err
E   KeyError: 'division'

src/snowflake/snowpark/_internal/server_connection.py

sfc-gh-aalam

Since we are also modifying the scope of the function, we should rename _fix_pandas_df_integer -> _fix_pandas_df_fixed_type

sfc-gh-xhe · 2024-01-11T23:59:26Z

Please also fix the test:

[gw2] linux -- Python 3.8.18 /home/runner/work/snowpark-python/snowpark-python/.tox/py38-notdoctest-ci/bin/python
.tox/py38-notdoctest-ci/lib/python3.8/site-packages/pandas/core/indexes/base.py:3653: in get_loc
    return self._engine.get_loc(casted_key)
pandas/_libs/index.pyx:147: in pandas._libs.index.IndexEngine.get_loc
    ???
pandas/_libs/index.pyx:176: in pandas._libs.index.IndexEngine.get_loc
    ???
pandas/_libs/hashtable_class_helper.pxi:7080: in pandas._libs.hashtable.PyObjectHashTable.get_item
    ???
pandas/_libs/hashtable_class_helper.pxi:7088: in pandas._libs.hashtable.PyObjectHashTable.get_item
    ???
E   KeyError: 'division'

The above exception was the direct cause of the following exception:
tests/integ/test_df_to_pandas.py:161: in test_to_pandas_precision_for_number_38_6_and_others
    assert pdf["division"].dtype == "float64"
.tox/py38-notdoctest-ci/lib/python3.8/site-packages/pandas/core/frame.py:3761: in __getitem__
    indexer = self.columns.get_loc(key)
.tox/py38-notdoctest-ci/lib/python3.8/site-packages/pandas/core/indexes/base.py:3655: in get_loc
    raise KeyError(key) from err
E   KeyError: 'division'

I'm a bit confused why I have this error...

sfc-gh-sfan · 2024-01-12T00:11:57Z

I'm confused as well because I saw the stacktrace has assert pdf["division"].dtype == "float64". Your code does not have this 🤔

sfc-gh-xhe · 2024-01-12T00:31:24Z

I have read the CLA Document and I hereby sign the CLA

sfc-gh-xhe · 2024-01-12T00:35:02Z

I'm confused as well because I saw the stacktrace has assert pdf["division"].dtype == "float64". Your code does not have this 🤔

Yeah, that's my original test. I guess it's because pandas column name needs to be upper case. So I modified the column name in the test.

sfc-gh-sfan · 2024-01-12T01:01:45Z

@sfc-gh-aalam Do you think if we should do pandas.to_numeric(pd_df[pandas_col_name], downcast="float")? I wrote the changelog referencing float64, but I'm not sure if we should just set it as float64, or let to_numeric decides.

sfc-gh-aalam · 2024-01-12T01:35:20Z

src/snowflake/snowpark/_internal/server_connection.py

+                # recognize decimal type.
+                pd_df[pandas_col_name] = pd_df[pandas_col_name].astype("float64")


instead of a hard astype to float64, we should prefer using to_numeric downcast="float" in this case

Unfortunately, after changing to downcast="float", certain cases (column C in the test) still return object (dtype('O')). It's unclear to me if this is expected. Do we want to force a "float64" if the column cannot be casted to float?

That sounds good to me. What do you think? @sfc-gh-aalam

I guess that is the only option we have. We should run this by @sfc-gh-yixie before merging.

This reverts commit 21441fd.

sfc-gh-yixie · 2024-01-12T19:58:36Z

src/snowflake/snowpark/_internal/server_connection.py

+                # For decimal columns, we want to cast it into float64 because pandas doesn't
+                # recognize decimal type.
+                pandas.to_numeric(pd_df[pandas_col_name], downcast="float")
+                if pd_df[pandas_col_name].dtype == "O":


For my education, when is a pandas column dtype "0"?

It is O not 0 LOL. It is when the dtype is an object.

sfc-gh-xhe force-pushed the xhe-SNOW-990542-cast-decimal-to-float branch from 2ac08bb to 0a59f5b Compare January 11, 2024 22:02

When converting snowpark dataframe to pandas, cast decimal columns to…

77dab49

… float64

sfc-gh-xhe force-pushed the xhe-SNOW-990542-cast-decimal-to-float branch from 0a59f5b to 77dab49 Compare January 11, 2024 22:03

sfc-gh-sfan approved these changes Jan 11, 2024

View reviewed changes

sfc-gh-aalam reviewed Jan 11, 2024

View reviewed changes

src/snowflake/snowpark/_internal/server_connection.py Outdated Show resolved Hide resolved

sfc-gh-aalam reviewed Jan 11, 2024

View reviewed changes

addressing comments

9e71912

sfc-gh-xhe marked this pull request as ready for review January 12, 2024 00:32

sfc-gh-xhe requested a review from a team as a code owner January 12, 2024 00:32

sfc-gh-xhe requested review from sfc-gh-mkeller and sfc-gh-aling January 12, 2024 00:32

sfc-gh-xhe changed the title ~~[DRAFT] When converting snowpark dataframe to pandas, cast decimal columns to float64~~ [SNOW-990542] When converting snowpark dataframe to pandas, cast decimal columns to float64 Jan 12, 2024

changelog

dbb9420

sfc-gh-aalam reviewed Jan 12, 2024

View reviewed changes

sfc-gh-aalam requested a review from sfc-gh-yixie January 12, 2024 01:35

sfc-gh-sfan changed the title ~~[SNOW-990542] When converting snowpark dataframe to pandas, cast decimal columns to float64~~ [SNOW-990542] When converting snowpark dataframe to pandas, cast decimal columns to float type Jan 12, 2024

sfc-gh-sfan added 4 commits January 11, 2024 19:09

use to_numeric

21441fd

Revert "use to_numeric"

3169954

This reverts commit 21441fd.

comment

9f81d05

try downcast and force float64

252954c

sfc-gh-yixie reviewed Jan 12, 2024

View reviewed changes

sfc-gh-yixie approved these changes Jan 12, 2024

View reviewed changes

sfc-gh-sfan merged commit 27ae233 into main Jan 12, 2024
57 checks passed

sfc-gh-sfan deleted the xhe-SNOW-990542-cast-decimal-to-float branch January 12, 2024 21:01

github-actions bot locked and limited conversation to collaborators Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SNOW-990542] When converting snowpark dataframe to pandas, cast decimal columns to float type #1201

[SNOW-990542] When converting snowpark dataframe to pandas, cast decimal columns to float type #1201

sfc-gh-xhe commented Jan 11, 2024 •

edited

Loading

github-actions bot commented Jan 11, 2024 •

edited

Loading

sfc-gh-sfan left a comment

sfc-gh-sfan Jan 11, 2024

sfc-gh-xhe Jan 11, 2024

sfc-gh-sfan commented Jan 11, 2024

sfc-gh-aalam left a comment

sfc-gh-xhe commented Jan 11, 2024

sfc-gh-sfan commented Jan 12, 2024

sfc-gh-xhe commented Jan 12, 2024

sfc-gh-xhe commented Jan 12, 2024

sfc-gh-sfan commented Jan 12, 2024

sfc-gh-aalam Jan 12, 2024

sfc-gh-sfan Jan 12, 2024

sfc-gh-xhe Jan 12, 2024

sfc-gh-aalam Jan 12, 2024

sfc-gh-yixie Jan 12, 2024

sfc-gh-sfan Jan 12, 2024

		@@ -141,6 +141,28 @@ def test_to_pandas_precision_for_number_38_0(session):
		assert pdf["A"].min() == -9223372036854775808


		def test_to_pandas_precision_for_number_38_6_and_others(session):

		# recognize decimal type.
		pd_df[pandas_col_name] = pd_df[pandas_col_name].astype("float64")

[SNOW-990542] When converting snowpark dataframe to pandas, cast decimal columns to float type #1201

[SNOW-990542] When converting snowpark dataframe to pandas, cast decimal columns to float type #1201

Conversation

sfc-gh-xhe commented Jan 11, 2024 • edited Loading

github-actions bot commented Jan 11, 2024 • edited Loading

sfc-gh-sfan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-sfan commented Jan 11, 2024

sfc-gh-aalam left a comment

Choose a reason for hiding this comment

sfc-gh-xhe commented Jan 11, 2024

sfc-gh-sfan commented Jan 12, 2024

sfc-gh-xhe commented Jan 12, 2024

sfc-gh-xhe commented Jan 12, 2024

sfc-gh-sfan commented Jan 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-xhe commented Jan 11, 2024 •

edited

Loading

github-actions bot commented Jan 11, 2024 •

edited

Loading