Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SNOW-990542] When converting snowpark dataframe to pandas, cast decimal columns to float type #1201

Merged
merged 7 commits into from
Jan 12, 2024

Conversation

sfc-gh-xhe
Copy link
Contributor

@sfc-gh-xhe sfc-gh-xhe commented Jan 11, 2024

Please answer these questions before submitting your pull requests. Thanks!

  1. What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes #SNOW-990542

Pandas doesn't recognize decimal type. When we convert snowpark dataframe to pandas dataframe, the decimal columns are not handled correctly, and thus result in issues.

  1. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
  2. Please describe how your code solves the related issue.

When we convert snowpark dataframe to pandas dataframe, we cast decimal columns into float64.
For more details, please see [thread]

Copy link

github-actions bot commented Jan 11, 2024

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@sfc-gh-xhe sfc-gh-xhe force-pushed the xhe-SNOW-990542-cast-decimal-to-float branch from 2ac08bb to 0a59f5b Compare January 11, 2024 22:02
@sfc-gh-xhe sfc-gh-xhe force-pushed the xhe-SNOW-990542-cast-decimal-to-float branch from 0a59f5b to 77dab49 Compare January 11, 2024 22:03
Copy link
Collaborator

@sfc-gh-sfan sfc-gh-sfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % nits

@@ -141,6 +141,28 @@ def test_to_pandas_precision_for_number_38_0(session):
assert pdf["A"].min() == -9223372036854775808


def test_to_pandas_precision_for_number_38_6_and_others(session):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this test to test_to_pandas_precision_for_non_zero_scale

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@sfc-gh-sfan
Copy link
Collaborator

Please also fix the test:

[gw2] linux -- Python 3.8.18 /home/runner/work/snowpark-python/snowpark-python/.tox/py38-notdoctest-ci/bin/python
.tox/py38-notdoctest-ci/lib/python3.8/site-packages/pandas/core/indexes/base.py:3653: in get_loc
    return self._engine.get_loc(casted_key)
pandas/_libs/index.pyx:147: in pandas._libs.index.IndexEngine.get_loc
    ???
pandas/_libs/index.pyx:176: in pandas._libs.index.IndexEngine.get_loc
    ???
pandas/_libs/hashtable_class_helper.pxi:7080: in pandas._libs.hashtable.PyObjectHashTable.get_item
    ???
pandas/_libs/hashtable_class_helper.pxi:7088: in pandas._libs.hashtable.PyObjectHashTable.get_item
    ???
E   KeyError: 'division'

The above exception was the direct cause of the following exception:
tests/integ/test_df_to_pandas.py:161: in test_to_pandas_precision_for_number_38_6_and_others
    assert pdf["division"].dtype == "float64"
.tox/py38-notdoctest-ci/lib/python3.8/site-packages/pandas/core/frame.py:3761: in __getitem__
    indexer = self.columns.get_loc(key)
.tox/py38-notdoctest-ci/lib/python3.8/site-packages/pandas/core/indexes/base.py:3655: in get_loc
    raise KeyError(key) from err
E   KeyError: 'division'

Copy link
Contributor

@sfc-gh-aalam sfc-gh-aalam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are also modifying the scope of the function, we should rename _fix_pandas_df_integer -> _fix_pandas_df_fixed_type

@sfc-gh-xhe
Copy link
Contributor Author

Please also fix the test:

[gw2] linux -- Python 3.8.18 /home/runner/work/snowpark-python/snowpark-python/.tox/py38-notdoctest-ci/bin/python
.tox/py38-notdoctest-ci/lib/python3.8/site-packages/pandas/core/indexes/base.py:3653: in get_loc
    return self._engine.get_loc(casted_key)
pandas/_libs/index.pyx:147: in pandas._libs.index.IndexEngine.get_loc
    ???
pandas/_libs/index.pyx:176: in pandas._libs.index.IndexEngine.get_loc
    ???
pandas/_libs/hashtable_class_helper.pxi:7080: in pandas._libs.hashtable.PyObjectHashTable.get_item
    ???
pandas/_libs/hashtable_class_helper.pxi:7088: in pandas._libs.hashtable.PyObjectHashTable.get_item
    ???
E   KeyError: 'division'

The above exception was the direct cause of the following exception:
tests/integ/test_df_to_pandas.py:161: in test_to_pandas_precision_for_number_38_6_and_others
    assert pdf["division"].dtype == "float64"
.tox/py38-notdoctest-ci/lib/python3.8/site-packages/pandas/core/frame.py:3761: in __getitem__
    indexer = self.columns.get_loc(key)
.tox/py38-notdoctest-ci/lib/python3.8/site-packages/pandas/core/indexes/base.py:3655: in get_loc
    raise KeyError(key) from err
E   KeyError: 'division'

I'm a bit confused why I have this error...

@sfc-gh-sfan
Copy link
Collaborator

I'm confused as well because I saw the stacktrace has assert pdf["division"].dtype == "float64". Your code does not have this 🤔

@sfc-gh-xhe
Copy link
Contributor Author

I have read the CLA Document and I hereby sign the CLA

@sfc-gh-xhe sfc-gh-xhe marked this pull request as ready for review January 12, 2024 00:32
@sfc-gh-xhe sfc-gh-xhe requested a review from a team as a code owner January 12, 2024 00:32
@sfc-gh-xhe
Copy link
Contributor Author

I'm confused as well because I saw the stacktrace has assert pdf["division"].dtype == "float64". Your code does not have this 🤔

Yeah, that's my original test. I guess it's because pandas column name needs to be upper case. So I modified the column name in the test.

@sfc-gh-xhe sfc-gh-xhe changed the title [DRAFT] When converting snowpark dataframe to pandas, cast decimal columns to float64 [SNOW-990542] When converting snowpark dataframe to pandas, cast decimal columns to float64 Jan 12, 2024
@sfc-gh-sfan
Copy link
Collaborator

@sfc-gh-aalam Do you think if we should do pandas.to_numeric(pd_df[pandas_col_name], downcast="float")? I wrote the changelog referencing float64, but I'm not sure if we should just set it as float64, or let to_numeric decides.

Comment on lines 720 to 721
# recognize decimal type.
pd_df[pandas_col_name] = pd_df[pandas_col_name].astype("float64")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of a hard astype to float64, we should prefer using to_numeric downcast="float" in this case

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, after changing to downcast="float", certain cases (column C in the test) still return object (dtype('O')). It's unclear to me if this is expected. Do we want to force a "float64" if the column cannot be casted to float?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me. What do you think? @sfc-gh-aalam

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that is the only option we have. We should run this by @sfc-gh-yixie before merging.

@sfc-gh-sfan sfc-gh-sfan changed the title [SNOW-990542] When converting snowpark dataframe to pandas, cast decimal columns to float64 [SNOW-990542] When converting snowpark dataframe to pandas, cast decimal columns to float type Jan 12, 2024
# For decimal columns, we want to cast it into float64 because pandas doesn't
# recognize decimal type.
pandas.to_numeric(pd_df[pandas_col_name], downcast="float")
if pd_df[pandas_col_name].dtype == "O":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my education, when is a pandas column dtype "0"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is O not 0 LOL. It is when the dtype is an object.

@sfc-gh-sfan sfc-gh-sfan merged commit 27ae233 into main Jan 12, 2024
57 checks passed
@sfc-gh-sfan sfc-gh-sfan deleted the xhe-SNOW-990542-cast-decimal-to-float branch January 12, 2024 21:01
@github-actions github-actions bot locked and limited conversation to collaborators Jan 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants