Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dataframe (de)serialization #600

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

ektar
Copy link

@ektar ektar commented Feb 3, 2022

Resolves #599

@jbednar jbednar changed the title Fix dataframe serde Fix dataframe (de)serialization Feb 3, 2022
Copy link
Member

@jbednar jbednar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! @jlstevens ?

@philippjfr
Copy link
Member

Thanks for pushing this fix! I have a few concerns unfortunately. First, it seems quite inefficient to serialize to a JSON string and then load it again and do the reverse on deserialization. Also a bit confused as even a very simple df doesn't roundtrip for me using this approach:

Screen Shot 2022-02-03 at 17 38 24

The other thing which is not necessarily a problem is that this includes a schema in the data declaration, e.g. for a simple table this might look like this:

{"schema":{"fields":[{"name":"index","type":"integer"},{"name":0,"type":"number"},{"name":1,"type":"number"},{"name":2,"type":"number"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"0":0.5529858391,"1":0.3236336179,"2":0.8076434981},{"index":1,"0":0.6460423326,"1":0.3008845818,"2":0.4971573288},{"index":2,"0":0.0956722025,"1":0.4527357795,"2":0.7199068404},{"index":3,"0":0.4972505762,"1":0.5035357012,"2":0.4823526223},{"index":4,"0":0.5667421155,"1":0.4730645023,"2":0.5405153226},{"index":5,"0":0.212481649,"1":0.3453984203,"2":0.9464833136},{"index":6,"0":0.287443691,"1":0.1836393869,"2":0.6931539396},{"index":7,"0":0.4739893758,"1":0.3493285907,"2":0.799432673},{"index":8,"0":0.7580364284,"1":0.354098563,"2":0.7434683717},{"index":9,"0":0.3159887857,"1":0.8479990624,"2":0.5183711565}]}

This is helpful so it can accurately decode the datetime type or other custom types but it's also a bit odd since we try to keep the schema separate from the data.

@jlstevens
Copy link
Contributor

I will have a closer look at this shortly but looking at the schema Philipp just posted, I have to agree...it is definitely weird to have a schema that includes so many data values like that.

@philippjfr
Copy link
Member

I have to agree...it is definitely weird to have a schema that includes so many data values like that.

The opposite is the case, the data serialization includes the schema, not the other way around.

@jlstevens
Copy link
Contributor

Thanks for the correction, though either way, these two things shouldn't be mixed together like that!

@ektar
Copy link
Author

ektar commented Feb 3, 2022

Good catch!

Investigated, seems issue just with the pandas "table" orient - roundtrips work if df created using records({0: [], 1: []...}), fails if df was created just with a numeric array like you show.

I added test with dataframe created with an array instead of records - test fails as expected, then removed the 'orient' option to just have default... the problem now is that since schema isn't present anymore, pandas has no way to read the date strings and infer they are supposed to be datetype vs string.

Summary thus far - without more significant modifications to give pandas type hints (or make panda's orient=table work for matrices), output dataframe won't match input.

A few options:

  1. Use orient='table' work with pandas devs to fix orient='table' for matrix input. Rationale - functionality currently broken for tables with well defined rows/columns that have dates, which seems a primary use of pandas dataframes. Purely numeric tables seem mostly used for tests, as pandas functions for these offer little advantages over numpy. Schema is generally helpful to provide to pandas during deser to ensure correct dtypes coming back, not rely on inference logic.
  2. Use default orient, specify date format (to preserve time zone), leave it to user or future capability to cast to appropriate types.

orient='table': dates and df's constructed with record format works, df's constructed with matrices fail

def serde_func(df):
    orient='table'
    return pd.read_json(df.to_json(orient=orient), orient=orient)

def non_eq_disp(*dfs):
    for df in dfs:
        display(df)
    for df in dfs:
        display(df.dtypes)

def run_test(df):
    new_df = serde_func(df)
    eq = df.equals(new_df)
    print(f'Equal: {df.equals(new_df)}')
    if not eq:
        non_eq_disp(df, new_df)
        
print('Date test')
df = pd.DataFrame(
    {
        "A": [datetime.datetime(year, 1, 1) for year in range(2020, 2023)],
        "B": [1, 2, 3],
    }
)
run_test(df)

print('Numeric test')
df = pd.DataFrame([[3, 3], [4, 2]])
run_test(df)

orient='table' output:

Date test
Equal: True
Numeric test
---------------------------------------------------------------------------
IntCastingNaNError                        Traceback (most recent call last)
Input In [74], in <module>
     15 print('Numeric test')
     16 df = pd.DataFrame([[3, 3], [4, 2]])
...
File ~/opt/anaconda3/envs/dev-param/lib/python3.10/site-packages/pandas/core/dtypes/cast.py:1193, in astype_float_to_int_nansafe(values, dtype, copy)
   1189 """
   1190 astype with a check preventing converting NaN to an meaningless integer value.
   1191 """
   1192 if not np.isfinite(values).all():
-> 1193     raise IntCastingNaNError(
   1194         "Cannot convert non-finite values (NA or inf) to integer"
   1195     )
   1196 return values.astype(dtype, copy=copy)

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

default orient - matrices work, dates fail as they're exported as epoch times, imported as ints

def serde_func(df):
    orient=None
    return pd.read_json(df.to_json(orient=orient), orient=orient)

def non_eq_disp(*dfs):
    for df in dfs:
        display(df)
    for df in dfs:
        display(df.dtypes)

def run_test(df):
    new_df = serde_func(df)
    eq = df.equals(new_df)
    print(f'Equal: {df.equals(new_df)}')
    if not eq:
        non_eq_disp(df, new_df)
        
print('Date test')
df = pd.DataFrame(
    {
        "A": [datetime.datetime(year, 1, 1) for year in range(2020, 2023)],
        "B": [1, 2, 3],
    }
)
run_test(df)

print('Numeric test')
df = pd.DataFrame([[3, 3], [4, 2]])
run_test(df)

default orient output
image

default orient, specify date output format - matrices work, dates interpreted as strings

def serde_func(df):
    orient=None
    return pd.read_json(df.to_json(orient=orient, date_format='iso'), orient=orient)

def non_eq_disp(*dfs):
    for df in dfs:
        display(df)
    for df in dfs:
        display(df.dtypes)

def run_test(df):
    new_df = serde_func(df)
    eq = df.equals(new_df)
    print(f'Equal: {df.equals(new_df)}')
    if not eq:
        non_eq_disp(df, new_df)
        
print('Date test')
df = pd.DataFrame(
    {
        "A": [datetime.datetime(year, 1, 1) for year in range(2020, 2023)],
        "B": [1, 2, 3],
    }
)
run_test(df)

print('Numeric test')
df = pd.DataFrame([[3, 3], [4, 2]])
run_test(df)

specified date format output
image

@ektar
Copy link
Author

ektar commented Feb 3, 2022

One last note on schema - Schema provides hints to pandas to interpret columns correctly (e.g. dates), also captures table structure in a way the others can't. E.g. indexing:

df = pd.DataFrame(
    {
        "A": [datetime.datetime(year, 1, 1) for year in range(2020, 2023)],
        "B": [1, 2, 3]
    },
    index=((1, 1), (1, 2), (2, 1))
)
display(df)
json_str=df.to_json()
print(json_str)
display(pd.read_json(df.to_json()))
print()
json_str=df.to_json(orient='table')
print(json_str)
display(pd.read_json(json_str, orient='table'))

output:
image

@ektar
Copy link
Author

ektar commented Feb 3, 2022

As much as I like schemas, orient=table just seems not ready yet, fine to implement just casting dates to iso str and user can parse later - at least will stop current crashing behavior.

If that sounds right, comment and question:

  • Earlier there was a question why do serialize->json load then reverse in serialization... this is just to fit with the overall framework of how json serialization seems to be working in param - all serialization funcs are expected to return a dict, but panda's to_json returns a string. And purpose of using panda's to_json is so that the logic doesn't need to get implemented here. I needed to do reverse in deser before due to table orient with schema. I thought I could just do dataframe load from dict on the return, but it turns out the index gets cast to string instead of ints (tests failed), so needed to do the full reverse round trip to get types correct. This seemed simplest change - thoughts?
  • Current test of df.equals will fail - original dataframe will have datetimes, new dataframe will have string column. Preferences for how to change test? I just updated to do the date cast, and match the timezone of the source.

param/__init__.py Outdated Show resolved Hide resolved
@jlstevens
Copy link
Contributor

all serialization funcs are expected to return a dict, but panda's to_json returns a string.

It is possible that I would need to run the code to understand the real issue, but this statement isn't quite correct. The idea of the serialize and deserialize methods is to use an intermediate representation of some sort where the output of serialize is JSON serializable. As strings (including JSON strings!) are JSON serializable, I don't think you need to use json.loads or json.dumps for this to work. Essentially, you would be embedding pandas JSON output in the param JSON output which is a little weird, but should work.

If it weren't for potential performance concerns, I would indeed prefer the way you have done it already though!

@ektar
Copy link
Author

ektar commented Feb 4, 2022

all serialization funcs are expected to return a dict, but panda's to_json returns a string.

It is possible that I would need to run the code to understand the real issue, but this statement isn't quite correct. The idea of the serialize and deserialize methods is to use an intermediate representation of some sort where the output of serialize is JSON serializable. As strings (including JSON strings!) are JSON serializable, I don't think you need to use json.loads or json.dumps for this to work. Essentially, you would be embedding pandas JSON output in the param JSON output which is a little weird, but should work.

If it weren't for potential performance concerns, I would indeed prefer the way you have done it already though!

I agree it looks weird - I was copying original implementation to have same return type. I think intent is that the final json be standard, not have long strings that are intended to be further processed. This is used by upstream jsonserializer - serializes all components then calls dumps.

@ektar
Copy link
Author

ektar commented Feb 10, 2022

@jlstevens @philippjfr - Any changes needed? Happy to implement

@jbednar jbednar added this to the v1.12.2 milestone May 16, 2022
@maximlt
Copy link
Member

maximlt commented May 31, 2022

@ektar you mentioned limitations with orient='table' and that it was not yet ready. I've just made a quick attempt at using it and it looks like it's working fine.

import pandas as pd

df = pd.util.testing.makeMixedDataFrame()
out = pd.read_json(df.to_json(orient='table'), orient='table')
df.equals(out)  # True

Output of `df.info():

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
A    5 non-null float64
B    5 non-null float64
C    5 non-null object
D    5 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 288.0+ bytes

@jlstevens
Copy link
Contributor

I've thought about this a bit more and I'm fine with the pd.read_json(df.to_json(orient='table'), orient='table') approach which I now think is better than the current approach. Here are my observations now I've managed to clarify my thoughts about this:

  • 'schema' in the pandas JSON should be called dtypes imho - it is not official JSON schema as 'datetime' is not a valid JSON schema type. This 'schema' here is really a specification of dtypes so pandas can deserialize correctly. This is a schema that only exists when a concrete dataframe exists.
  • The Parameter schema generated by param is JSON Schema and it exists at the parameter level, even when there is no concrete dataframe present (e.g.allow_None=True with a value of None).

It is a little confusing in terms of nomenclature, but I think these two things are orthogonal and both have reasons to exist. I have a quibble with the naming used by pandas but what really matters to me is that there is a convenient way to roundtrip the data.

To me, this means that the pd.read_json(df.to_json(orient='table'), orient='table') approach looks good to me in principle. @ektar could you confirm what @maximlt asserts above about the dataframes being equal after round tripping? And if so, what are the remaining problems you see with this approach?

@maximlt maximlt modified the milestones: v1.12.2, v1.12.3 Jun 2, 2022
@droumis
Copy link
Member

droumis commented Jan 16, 2023

@ektar, do you intend on completing this?

@maximlt maximlt modified the milestones: v1.12.x, v2.x Apr 4, 2023
@maximlt
Copy link
Member

maximlt commented Apr 12, 2023

Investigated, seems issue just with the pandas "table" orient - roundtrips work if df created using records({0: [], 1: []...}), fails if df was created just with a numeric array like you show.

For reference, this is the issue with trying to round-trip with Pandas and table='orient' pandas-dev/pandas#46392

@maximlt
Copy link
Member

maximlt commented Apr 12, 2023

In 7e5098c I've slightly updated the tests to remove code that was handling timezone conversion. This was failing when I ran it locally (newer Pandas version?), removing that code got the test to pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error serializing dataframes with timestamps
6 participants