Fix dataframe (de)serialization #600

ektar · 2022-02-03T04:21:06Z

Resolves #599

jbednar

Looks good to me! @jlstevens ?

philippjfr · 2022-02-03T16:42:46Z

Thanks for pushing this fix! I have a few concerns unfortunately. First, it seems quite inefficient to serialize to a JSON string and then load it again and do the reverse on deserialization. Also a bit confused as even a very simple df doesn't roundtrip for me using this approach:

The other thing which is not necessarily a problem is that this includes a schema in the data declaration, e.g. for a simple table this might look like this:

{"schema":{"fields":[{"name":"index","type":"integer"},{"name":0,"type":"number"},{"name":1,"type":"number"},{"name":2,"type":"number"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"0":0.5529858391,"1":0.3236336179,"2":0.8076434981},{"index":1,"0":0.6460423326,"1":0.3008845818,"2":0.4971573288},{"index":2,"0":0.0956722025,"1":0.4527357795,"2":0.7199068404},{"index":3,"0":0.4972505762,"1":0.5035357012,"2":0.4823526223},{"index":4,"0":0.5667421155,"1":0.4730645023,"2":0.5405153226},{"index":5,"0":0.212481649,"1":0.3453984203,"2":0.9464833136},{"index":6,"0":0.287443691,"1":0.1836393869,"2":0.6931539396},{"index":7,"0":0.4739893758,"1":0.3493285907,"2":0.799432673},{"index":8,"0":0.7580364284,"1":0.354098563,"2":0.7434683717},{"index":9,"0":0.3159887857,"1":0.8479990624,"2":0.5183711565}]}

This is helpful so it can accurately decode the datetime type or other custom types but it's also a bit odd since we try to keep the schema separate from the data.

jlstevens · 2022-02-03T16:56:30Z

I will have a closer look at this shortly but looking at the schema Philipp just posted, I have to agree...it is definitely weird to have a schema that includes so many data values like that.

philippjfr · 2022-02-03T17:06:56Z

I have to agree...it is definitely weird to have a schema that includes so many data values like that.

The opposite is the case, the data serialization includes the schema, not the other way around.

jlstevens · 2022-02-03T17:12:43Z

Thanks for the correction, though either way, these two things shouldn't be mixed together like that!

ektar · 2022-02-03T18:40:00Z

Good catch!

Investigated, seems issue just with the pandas "table" orient - roundtrips work if df created using records({0: [], 1: []...}), fails if df was created just with a numeric array like you show.

I added test with dataframe created with an array instead of records - test fails as expected, then removed the 'orient' option to just have default... the problem now is that since schema isn't present anymore, pandas has no way to read the date strings and infer they are supposed to be datetype vs string.

Summary thus far - without more significant modifications to give pandas type hints (or make panda's orient=table work for matrices), output dataframe won't match input.

A few options:

Use orient='table' work with pandas devs to fix orient='table' for matrix input. Rationale - functionality currently broken for tables with well defined rows/columns that have dates, which seems a primary use of pandas dataframes. Purely numeric tables seem mostly used for tests, as pandas functions for these offer little advantages over numpy. Schema is generally helpful to provide to pandas during deser to ensure correct dtypes coming back, not rely on inference logic.
Use default orient, specify date format (to preserve time zone), leave it to user or future capability to cast to appropriate types.

orient='table': dates and df's constructed with record format works, df's constructed with matrices fail

def serde_func(df):
    orient='table'
    return pd.read_json(df.to_json(orient=orient), orient=orient)

def non_eq_disp(*dfs):
    for df in dfs:
        display(df)
    for df in dfs:
        display(df.dtypes)

def run_test(df):
    new_df = serde_func(df)
    eq = df.equals(new_df)
    print(f'Equal: {df.equals(new_df)}')
    if not eq:
        non_eq_disp(df, new_df)
        
print('Date test')
df = pd.DataFrame(
    {
        "A": [datetime.datetime(year, 1, 1) for year in range(2020, 2023)],
        "B": [1, 2, 3],
    }
)
run_test(df)

print('Numeric test')
df = pd.DataFrame([[3, 3], [4, 2]])
run_test(df)

orient='table' output:

Date test
Equal: True
Numeric test
---------------------------------------------------------------------------
IntCastingNaNError                        Traceback (most recent call last)
Input In [74], in <module>
     15 print('Numeric test')
     16 df = pd.DataFrame([[3, 3], [4, 2]])
...
File ~/opt/anaconda3/envs/dev-param/lib/python3.10/site-packages/pandas/core/dtypes/cast.py:1193, in astype_float_to_int_nansafe(values, dtype, copy)
   1189 """
   1190 astype with a check preventing converting NaN to an meaningless integer value.
   1191 """
   1192 if not np.isfinite(values).all():
-> 1193     raise IntCastingNaNError(
   1194         "Cannot convert non-finite values (NA or inf) to integer"
   1195     )
   1196 return values.astype(dtype, copy=copy)

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

default orient - matrices work, dates fail as they're exported as epoch times, imported as ints

def serde_func(df):
    orient=None
    return pd.read_json(df.to_json(orient=orient), orient=orient)

def non_eq_disp(*dfs):
    for df in dfs:
        display(df)
    for df in dfs:
        display(df.dtypes)

def run_test(df):
    new_df = serde_func(df)
    eq = df.equals(new_df)
    print(f'Equal: {df.equals(new_df)}')
    if not eq:
        non_eq_disp(df, new_df)
        
print('Date test')
df = pd.DataFrame(
    {
        "A": [datetime.datetime(year, 1, 1) for year in range(2020, 2023)],
        "B": [1, 2, 3],
    }
)
run_test(df)

print('Numeric test')
df = pd.DataFrame([[3, 3], [4, 2]])
run_test(df)

default orient output

default orient, specify date output format - matrices work, dates interpreted as strings

def serde_func(df):
    orient=None
    return pd.read_json(df.to_json(orient=orient, date_format='iso'), orient=orient)

def non_eq_disp(*dfs):
    for df in dfs:
        display(df)
    for df in dfs:
        display(df.dtypes)

def run_test(df):
    new_df = serde_func(df)
    eq = df.equals(new_df)
    print(f'Equal: {df.equals(new_df)}')
    if not eq:
        non_eq_disp(df, new_df)
        
print('Date test')
df = pd.DataFrame(
    {
        "A": [datetime.datetime(year, 1, 1) for year in range(2020, 2023)],
        "B": [1, 2, 3],
    }
)
run_test(df)

print('Numeric test')
df = pd.DataFrame([[3, 3], [4, 2]])
run_test(df)

specified date format output

ektar · 2022-02-03T19:00:52Z

One last note on schema - Schema provides hints to pandas to interpret columns correctly (e.g. dates), also captures table structure in a way the others can't. E.g. indexing:

df = pd.DataFrame(
    {
        "A": [datetime.datetime(year, 1, 1) for year in range(2020, 2023)],
        "B": [1, 2, 3]
    },
    index=((1, 1), (1, 2), (2, 1))
)
display(df)
json_str=df.to_json()
print(json_str)
display(pd.read_json(df.to_json()))
print()
json_str=df.to_json(orient='table')
print(json_str)
display(pd.read_json(json_str, orient='table'))

output:

… to strings

ektar · 2022-02-03T19:39:58Z

As much as I like schemas, orient=table just seems not ready yet, fine to implement just casting dates to iso str and user can parse later - at least will stop current crashing behavior.

If that sounds right, comment and question:

Earlier there was a question why do serialize->json load then reverse in serialization... this is just to fit with the overall framework of how json serialization seems to be working in param - all serialization funcs are expected to return a dict, but panda's to_json returns a string. And purpose of using panda's to_json is so that the logic doesn't need to get implemented here. I needed to do reverse in deser before due to table orient with schema. I thought I could just do dataframe load from dict on the return, but it turns out the index gets cast to string instead of ints (tests failed), so needed to do the full reverse round trip to get types correct. This seemed simplest change - thoughts?
Current test of df.equals will fail - original dataframe will have datetimes, new dataframe will have string column. Preferences for how to change test? I just updated to do the date cast, and match the timezone of the source.

param/__init__.py

jlstevens · 2022-02-04T11:03:38Z

all serialization funcs are expected to return a dict, but panda's to_json returns a string.

It is possible that I would need to run the code to understand the real issue, but this statement isn't quite correct. The idea of the serialize and deserialize methods is to use an intermediate representation of some sort where the output of serialize is JSON serializable. As strings (including JSON strings!) are JSON serializable, I don't think you need to use json.loads or json.dumps for this to work. Essentially, you would be embedding pandas JSON output in the param JSON output which is a little weird, but should work.

If it weren't for potential performance concerns, I would indeed prefer the way you have done it already though!

ektar · 2022-02-04T13:28:05Z

all serialization funcs are expected to return a dict, but panda's to_json returns a string.

It is possible that I would need to run the code to understand the real issue, but this statement isn't quite correct. The idea of the serialize and deserialize methods is to use an intermediate representation of some sort where the output of serialize is JSON serializable. As strings (including JSON strings!) are JSON serializable, I don't think you need to use json.loads or json.dumps for this to work. Essentially, you would be embedding pandas JSON output in the param JSON output which is a little weird, but should work.

If it weren't for potential performance concerns, I would indeed prefer the way you have done it already though!

I agree it looks weird - I was copying original implementation to have same return type. I think intent is that the final json be standard, not have long strings that are intended to be further processed. This is used by upstream jsonserializer - serializes all components then calls dumps.

ektar · 2022-02-10T06:28:59Z

@jlstevens @philippjfr - Any changes needed? Happy to implement

maximlt · 2022-05-31T13:27:27Z

@ektar you mentioned limitations with orient='table' and that it was not yet ready. I've just made a quick attempt at using it and it looks like it's working fine.

import pandas as pd

df = pd.util.testing.makeMixedDataFrame()
out = pd.read_json(df.to_json(orient='table'), orient='table')
df.equals(out)  # True

Output of `df.info():

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
A    5 non-null float64
B    5 non-null float64
C    5 non-null object
D    5 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 288.0+ bytes

jlstevens · 2022-05-31T13:40:33Z

I've thought about this a bit more and I'm fine with the pd.read_json(df.to_json(orient='table'), orient='table') approach which I now think is better than the current approach. Here are my observations now I've managed to clarify my thoughts about this:

'schema' in the pandas JSON should be called dtypes imho - it is not official JSON schema as 'datetime' is not a valid JSON schema type. This 'schema' here is really a specification of dtypes so pandas can deserialize correctly. This is a schema that only exists when a concrete dataframe exists.
The Parameter schema generated by param is JSON Schema and it exists at the parameter level, even when there is no concrete dataframe present (e.g.allow_None=True with a value of None).

It is a little confusing in terms of nomenclature, but I think these two things are orthogonal and both have reasons to exist. I have a quibble with the naming used by pandas but what really matters to me is that there is a convenient way to roundtrip the data.

To me, this means that the pd.read_json(df.to_json(orient='table'), orient='table') approach looks good to me in principle. @ektar could you confirm what @maximlt asserts above about the dataframes being equal after round tripping? And if so, what are the remaining problems you see with this approach?

droumis · 2023-01-16T16:30:28Z

@ektar, do you intend on completing this?

maximlt · 2023-04-12T09:03:23Z

Investigated, seems issue just with the pandas "table" orient - roundtrips work if df created using records({0: [], 1: []...}), fails if df was created just with a numeric array like you show.

For reference, this is the issue with trying to round-trip with Pandas and table='orient' pandas-dev/pandas#46392

maximlt · 2023-04-12T09:06:42Z

In 7e5098c I've slightly updated the tests to remove code that was handling timezone conversion. This was failing when I ran it locally (newer Pandas version?), removing that code got the test to pass.

Carlson, Eric Thomas added 2 commits February 2, 2022 23:16

Add failing test with a dataframe that has datetimes in it

d90d216

Add new dataframe ser/de implementations, tests pass

d8a4c8c

jbednar changed the title ~~Fix dataframe serde~~ Fix dataframe (de)serialization Feb 3, 2022

jbednar approved these changes Feb 3, 2022

View reviewed changes

Add load from array test that breaks pandas

8a00687

Fix operation for matrix-defined dataframes, allow dates to serialize…

a266106

… to strings

Carlson, Eric Thomas added 2 commits February 3, 2022 16:11

Add verification of datetime serialization roundtrips to pandas test

18882d4

Remove redundant "if" statement

f59a022

jlstevens reviewed Feb 4, 2022

View reviewed changes

param/__init__.py Outdated Show resolved Hide resolved

Globally import json instead of inline

5501d0e

ektar mentioned this pull request Feb 4, 2022

Add ability to specify schema of pandas columns #603

Open

jbednar added this to the v1.12.2 milestone May 16, 2022

maximlt modified the milestones: v1.12.2, v1.12.3 Jun 2, 2022

maximlt modified the milestones: v1.12.x, v2.x Apr 4, 2023

Merge branch 'main' into pr/ektar/600

4bc0c50

update tests

7e5098c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dataframe (de)serialization #600

Fix dataframe (de)serialization #600

ektar commented Feb 3, 2022

jbednar left a comment

philippjfr commented Feb 3, 2022

jlstevens commented Feb 3, 2022

philippjfr commented Feb 3, 2022

jlstevens commented Feb 3, 2022

ektar commented Feb 3, 2022

ektar commented Feb 3, 2022

ektar commented Feb 3, 2022 •

edited

Loading

jlstevens commented Feb 4, 2022

ektar commented Feb 4, 2022

ektar commented Feb 10, 2022

maximlt commented May 31, 2022

jlstevens commented May 31, 2022

droumis commented Jan 16, 2023

maximlt commented Apr 12, 2023

maximlt commented Apr 12, 2023

Fix dataframe (de)serialization #600

Are you sure you want to change the base?

Fix dataframe (de)serialization #600

Conversation

ektar commented Feb 3, 2022

jbednar left a comment

Choose a reason for hiding this comment

philippjfr commented Feb 3, 2022

jlstevens commented Feb 3, 2022

philippjfr commented Feb 3, 2022

jlstevens commented Feb 3, 2022

ektar commented Feb 3, 2022

ektar commented Feb 3, 2022

ektar commented Feb 3, 2022 • edited Loading

jlstevens commented Feb 4, 2022

ektar commented Feb 4, 2022

ektar commented Feb 10, 2022

maximlt commented May 31, 2022

jlstevens commented May 31, 2022

droumis commented Jan 16, 2023

maximlt commented Apr 12, 2023

maximlt commented Apr 12, 2023

ektar commented Feb 3, 2022 •

edited

Loading