-
-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dataframe (de)serialization #600
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! @jlstevens ?
Thanks for pushing this fix! I have a few concerns unfortunately. First, it seems quite inefficient to serialize to a JSON string and then load it again and do the reverse on deserialization. Also a bit confused as even a very simple df doesn't roundtrip for me using this approach: The other thing which is not necessarily a problem is that this includes a schema in the data declaration, e.g. for a simple table this might look like this: {"schema":{"fields":[{"name":"index","type":"integer"},{"name":0,"type":"number"},{"name":1,"type":"number"},{"name":2,"type":"number"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"0":0.5529858391,"1":0.3236336179,"2":0.8076434981},{"index":1,"0":0.6460423326,"1":0.3008845818,"2":0.4971573288},{"index":2,"0":0.0956722025,"1":0.4527357795,"2":0.7199068404},{"index":3,"0":0.4972505762,"1":0.5035357012,"2":0.4823526223},{"index":4,"0":0.5667421155,"1":0.4730645023,"2":0.5405153226},{"index":5,"0":0.212481649,"1":0.3453984203,"2":0.9464833136},{"index":6,"0":0.287443691,"1":0.1836393869,"2":0.6931539396},{"index":7,"0":0.4739893758,"1":0.3493285907,"2":0.799432673},{"index":8,"0":0.7580364284,"1":0.354098563,"2":0.7434683717},{"index":9,"0":0.3159887857,"1":0.8479990624,"2":0.5183711565}]} This is helpful so it can accurately decode the datetime type or other custom types but it's also a bit odd since we try to keep the schema separate from the data. |
I will have a closer look at this shortly but looking at the schema Philipp just posted, I have to agree...it is definitely weird to have a schema that includes so many data values like that. |
The opposite is the case, the data serialization includes the schema, not the other way around. |
Thanks for the correction, though either way, these two things shouldn't be mixed together like that! |
Good catch! Investigated, seems issue just with the pandas "table" orient - roundtrips work if df created using records({0: [], 1: []...}), fails if df was created just with a numeric array like you show. I added test with dataframe created with an array instead of records - test fails as expected, then removed the 'orient' option to just have default... the problem now is that since schema isn't present anymore, pandas has no way to read the date strings and infer they are supposed to be datetype vs string. Summary thus far - without more significant modifications to give pandas type hints (or make panda's orient=table work for matrices), output dataframe won't match input. A few options:
orient='table': dates and df's constructed with record format works, df's constructed with matrices fail def serde_func(df):
orient='table'
return pd.read_json(df.to_json(orient=orient), orient=orient)
def non_eq_disp(*dfs):
for df in dfs:
display(df)
for df in dfs:
display(df.dtypes)
def run_test(df):
new_df = serde_func(df)
eq = df.equals(new_df)
print(f'Equal: {df.equals(new_df)}')
if not eq:
non_eq_disp(df, new_df)
print('Date test')
df = pd.DataFrame(
{
"A": [datetime.datetime(year, 1, 1) for year in range(2020, 2023)],
"B": [1, 2, 3],
}
)
run_test(df)
print('Numeric test')
df = pd.DataFrame([[3, 3], [4, 2]])
run_test(df) orient='table' output:
default orient - matrices work, dates fail as they're exported as epoch times, imported as ints def serde_func(df):
orient=None
return pd.read_json(df.to_json(orient=orient), orient=orient)
def non_eq_disp(*dfs):
for df in dfs:
display(df)
for df in dfs:
display(df.dtypes)
def run_test(df):
new_df = serde_func(df)
eq = df.equals(new_df)
print(f'Equal: {df.equals(new_df)}')
if not eq:
non_eq_disp(df, new_df)
print('Date test')
df = pd.DataFrame(
{
"A": [datetime.datetime(year, 1, 1) for year in range(2020, 2023)],
"B": [1, 2, 3],
}
)
run_test(df)
print('Numeric test')
df = pd.DataFrame([[3, 3], [4, 2]])
run_test(df) default orient, specify date output format - matrices work, dates interpreted as strings def serde_func(df):
orient=None
return pd.read_json(df.to_json(orient=orient, date_format='iso'), orient=orient)
def non_eq_disp(*dfs):
for df in dfs:
display(df)
for df in dfs:
display(df.dtypes)
def run_test(df):
new_df = serde_func(df)
eq = df.equals(new_df)
print(f'Equal: {df.equals(new_df)}')
if not eq:
non_eq_disp(df, new_df)
print('Date test')
df = pd.DataFrame(
{
"A": [datetime.datetime(year, 1, 1) for year in range(2020, 2023)],
"B": [1, 2, 3],
}
)
run_test(df)
print('Numeric test')
df = pd.DataFrame([[3, 3], [4, 2]])
run_test(df) |
One last note on schema - Schema provides hints to pandas to interpret columns correctly (e.g. dates), also captures table structure in a way the others can't. E.g. indexing: df = pd.DataFrame(
{
"A": [datetime.datetime(year, 1, 1) for year in range(2020, 2023)],
"B": [1, 2, 3]
},
index=((1, 1), (1, 2), (2, 1))
)
display(df)
json_str=df.to_json()
print(json_str)
display(pd.read_json(df.to_json()))
print()
json_str=df.to_json(orient='table')
print(json_str)
display(pd.read_json(json_str, orient='table')) |
As much as I like schemas, orient=table just seems not ready yet, fine to implement just casting dates to iso str and user can parse later - at least will stop current crashing behavior. If that sounds right, comment and question:
|
It is possible that I would need to run the code to understand the real issue, but this statement isn't quite correct. The idea of the If it weren't for potential performance concerns, I would indeed prefer the way you have done it already though! |
I agree it looks weird - I was copying original implementation to have same return type. I think intent is that the final json be standard, not have long strings that are intended to be further processed. This is used by upstream jsonserializer - serializes all components then calls dumps. |
@jlstevens @philippjfr - Any changes needed? Happy to implement |
@ektar you mentioned limitations with import pandas as pd
df = pd.util.testing.makeMixedDataFrame()
out = pd.read_json(df.to_json(orient='table'), orient='table')
df.equals(out) # True Output of `df.info():
|
I've thought about this a bit more and I'm fine with the
It is a little confusing in terms of nomenclature, but I think these two things are orthogonal and both have reasons to exist. I have a quibble with the naming used by pandas but what really matters to me is that there is a convenient way to roundtrip the data. To me, this means that the |
@ektar, do you intend on completing this? |
For reference, this is the issue with trying to round-trip with Pandas and |
In 7e5098c I've slightly updated the tests to remove code that was handling timezone conversion. This was failing when I ran it locally (newer Pandas version?), removing that code got the test to pass. |
Resolves #599