-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug Fix] cast None current-snapshot-id
as -1 for Backwards Compatibility
#473
Conversation
Thanks for the great catch @syun64 ! My understanding is that we need to write Internally, either |
Great suggestion @HonahX . I tried adding a @field_serializer to the TableMetadata1 class to test it out, but unfortunately the serialized output from model_dump_json doesn't seem to serialize
|
Thanks @syun64 for raising this. I agree with @HonahX that setting it to
However Java is the reference implementation, we should not blindly follow everything that they did there. There are some things in there that we can improve on since we don't carry the full history that Java does. Can you elaborate on why you need this change? |
Hi @Fokko thank you for the response! Tables created by PyIceberg currently do not persist a current_snapshot_id. This means that if no further commits were made after table creation, the table cannot be read by older versions of Java code (Trino, Spark, etc) that require that the current_snapshot_id be a valid long value, until a commit is made that commits a current_snapshot_id. Otherwise, it will throw above error trace when trying to parse the table metadata. It looks like the PR to handle current_snapshot_id as optional was only introduced in the Java code a few months ago, meaning that all Java code running versions prior to that would throw above error on parsing table metadata created by PyIceberg in its current state. Hence I thought it was a backward incompatible change that's worth noting here. I think as @HonahX suggested, serializing the output current_snapshot_id in TableMetadata as '-1' would be good for backwards compatibility. |
So it looks like using a custom @field_serializer isn't working in the current IcebergBaseModel definition with "None" values, because we are setting If we want to make Iceberg tables created by PyIceberg compatible with older versions of Java, I think we'll just have to store current_snapshot_id as |
Thanks for the full context. cc @rdblue @danielcweeks WDYT? This has been fixed in Iceberg 1.4.0. We should break the spec if we want to support reading Iceberg tables written by an earlier version of Java Iceberg. We could also update the spec to make it non-optional, otherwise, other implementations will follow. |
The spec calls out the The reason I think this is safe is that the situation where someone is using an older version of java but also using python to create tables (no commits) is relatively narrow and they would identify the problem quickly and turn the flag on for their use cases. I feel like this is the best approach to move forward with the correct solution, but still accommodate backward compatibility. |
That makes sense @danielcweeks @Fokko . Thank you for sharing your opinions. Would the compatibility flag be included as a environment variable? If we think that supporting this change is a hassle, and we want to honor correctness of the implementation according to the Spec, maybe we can make a conscious decision to not accommodate for backward compatibility in this edge case. One of my goals with putting this PR was to at least document this issue and the discussion so others running into it can refer to it. |
I prefer to not exclude certain groups (Java <1.3.0 in this case, I'm not sure on which versions all the proprietary implementations). I think a flag is an elegant way of enabling this. I fully agree with you on the documentation part of it 👍
The flag would be a configuration option on the catalog. This way you can set it through the |
That makes sense @Fokko . What are your thoughts on my findings about the pydantic base model? I don't think there's an easy way to use a custom serializer only on the output, because we are using exclude_none=True on the model. So it will ignore the current_snapshot_id attribute if we stored it as None internally. So do we want to:
|
I went forward with the Option 2 and I think it looks pretty clean. Let me know what you think @Fokko |
pyiceberg/serializers.py
Outdated
if Config().get_bool("legacy-current-snapshot-id") and metadata.current_snapshot_id is None: | ||
model_dict = json.loads(model_dump) | ||
model_dict[CURRENT_SNAPSHOT_ID] = -1 | ||
model_dump = json.dumps(model_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the best place to fix this. Mostly because we have to deserialize and serialize the metadata, and the rest of the deserialization logic is part of the Pydantic model. I think option 1 is much cleaner. We can set the ignore-None to False
:
@staticmethod
def table_metadata(metadata: TableMetadata, output_file: OutputFile, overwrite: bool = False) -> None:
"""Write a TableMetadata instance to an output file.
Args:
output_file (OutputFile): A custom implementation of the iceberg.io.file.OutputFile abstract base class.
overwrite (bool): Where to overwrite the file if it already exists. Defaults to `False`.
"""
with output_file.create(overwrite=overwrite) as output_stream:
json_bytes = metadata.model_dump_json(exclude_none=False).encode(UTF8)
json_bytes = Compressor.get_compressor(output_file.location).bytes_compressor()(json_bytes)
output_stream.write(json_bytes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's a good suggestion. I'll add the field_serializer and set exclude_none to False so we can print out -1
in the output.
Co-authored-by: Fokko Driesprong <[email protected]>
…rg-python into current-snapshot-bf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! Just have one comment. Thanks for working on this!
pyiceberg/table/metadata.py
Outdated
@@ -121,7 +122,7 @@ def check_sort_orders(table_metadata: TableMetadata) -> TableMetadata: | |||
|
|||
def construct_refs(table_metadata: TableMetadata) -> TableMetadata: | |||
"""Set the main branch if missing.""" | |||
if table_metadata.current_snapshot_id is not None: | |||
if table_metadata.current_snapshot_id is not None and table_metadata.current_snapshot_id != -1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this -1 check still necessary? construct_refs
is an after validator. At this point, cleanup_snapshot_id
should already turn current_snapshot_id=-1
to None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right! I'll submit a fix thank you!
pyiceberg/serializers.py
Outdated
@@ -127,6 +127,6 @@ def table_metadata(metadata: TableMetadata, output_file: OutputFile, overwrite: | |||
overwrite (bool): Where to overwrite the file if it already exists. Defaults to `False`. | |||
""" | |||
with output_file.create(overwrite=overwrite) as output_stream: | |||
json_bytes = metadata.model_dump_json().encode(UTF8) | |||
json_bytes = metadata.model_dump_json(exclude_none=False).encode(UTF8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not a big deal: Shall we set exclude_none=False
only in the legacy mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer that. Let me put in this update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, one minor suggestion:
Could you add the legacy-current-snapshot-id
key to the write options table as well: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md#write-options
@field_serializer('current_snapshot_id') | ||
def serialize_current_snapshot_id(self, current_snapshot_id: Optional[int]) -> Optional[int]: | ||
if current_snapshot_id is None and Config().get_bool("legacy-current-snapshot-id"): | ||
return -1 | ||
return current_snapshot_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful!
I have this configuration documented under "Backward Compatibility" in the documentation now. I feel a bit awkward adding it to the write-options because the current list of write-options map to pyarrow parquet properties. Are you suggesting that we just add the property to the write-options in the documentation as well? Like: Write options
Are we looking to move the documentation to Write options or have it in both places? |
Hi @Fokko @syun64
I have a similar feeling, mainly because write-options table is for table properties while the catalog:
default:
type: glue
legacy-current-snapshot-id: True
Let's get this in first. Thanks for the great effort on it. Thanks everyone for reviewing! |
…lity (apache#473) Backport to 0.6.1
…lity (#473) (#557) Backport to 0.6.1 Co-authored-by: Sung Yun <[email protected]>
is this available in 0.6.1, bigquery can't read the table because of that
|
@djouallah Thanks for jumping in here. Sad to hear that Bigquery also suffers from this. This is part of 0.6.1, so setting the configuration, or writing an empty dataframe, should fix the issue. |
@Fokko sorry for the silly question, but i could not find it in the documentation ? |
@djouallah It is mentioned all the way at the bottom of the configuration page: https://py.iceberg.apache.org/configuration/#backward-compatibility |
thanks, it works |
The existing PyIceberg
cleanup_snapshot_id
validator creates tables that are not backward compatible.On table creation, the existing behavior in Java is to create tables with current_snapshot_id = -1. Older versions of the Java implementations also seem to require that current_snapshot_id is a None-null long value, and throws an exception if it is None:
In order to preserve backward compatibility, we should cast
None
current_snapshot_id to-1
, rather than casting-1
current_snapshot_id value toNone
.