Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Fix] cast None current-snapshot-id as -1 for Backwards Compatibility #473

Merged
merged 11 commits into from
Mar 6, 2024

Conversation

sungwy
Copy link
Collaborator

@sungwy sungwy commented Feb 26, 2024

The existing PyIceberg cleanup_snapshot_id validator creates tables that are not backward compatible.

On table creation, the existing behavior in Java is to create tables with current_snapshot_id = -1. Older versions of the Java implementations also seem to require that current_snapshot_id is a None-null long value, and throws an exception if it is None:

Retrying task after failure: Cannot parse missing long: current-snapshot-id
java.lang.IllegalArgumentException: Cannot parse missing long: current-snapshot-id
	at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:220)
	at org.apache.iceberg.util.JsonUtil.getLong(JsonUtil.java:133)
	at org.apache.iceberg.TableMetadataParser.fromJson(TableMetadataParser.java:436)
	at org.apache.iceberg.TableMetadataParser.fromJson(TableMetadataParser.java:303)
	at org.apache.iceberg.TableMetadataParser.read(TableMetadataParser.java:273)
	at org.apache.iceberg.TableMetadataParser.read(TableMetadataParser.java:266)

In order to preserve backward compatibility, we should cast None current_snapshot_id to -1, rather than casting -1 current_snapshot_id value to None.

@HonahX
Copy link
Contributor

HonahX commented Feb 27, 2024

Thanks for the great catch @syun64 ! My understanding is that we need to write current_snapshot_id to -1 when serializing the new metadata object to JSON. Would it be better to directly update the serializer to support this backwards compatibility?
ToOutputFile.table_metadata

Internally, either -1 or None can represent "no current snapshot id". But I think None is better as it aligns more with the spec which states that current_snapshot_id is optional and make things more intuitive. WDYT?

@sungwy
Copy link
Collaborator Author

sungwy commented Feb 27, 2024

Thanks for the great catch @syun64 ! My understanding is that we need to write current_snapshot_id to -1 when serializing the new metadata object to JSON. Would it be better to directly update the serializer to support this backwards compatibility? ToOutputFile.table_metadata

Internally, either -1 or None can represent "no current snapshot id". But I think None is better as it aligns more with the spec which states that current_snapshot_id is optional and make things more intuitive. WDYT?

Great suggestion @HonahX . I tried adding a @field_serializer to the TableMetadata1 class to test it out, but unfortunately the serialized output from model_dump_json doesn't seem to serialize None value as -1 as we'd want. I'm still trying to figure out what about our pydantic class definition isn't allowing us to add the custom serializer, for example:

class TableMetadataV1(TableMetadataCommonFields, IcebergBaseModel):
    ...
    @field_serializer('current_snapshot_id')
    def serialize_current_snapshot_id(self, current_snapshot_id: Optional[int]):
        return current_snapshot_id if current_snapshot_id is not None else -1

@Fokko
Copy link
Contributor

Fokko commented Feb 27, 2024

Thanks @syun64 for raising this. I agree with @HonahX that setting it to -1 is not a good idea, for mainly two reasons:

  • There are two options for representing saying that the current-snapshot-id should be ignored, both -1 and None.
  • Where -1 does not point to a valid snapshot.

However Java is the reference implementation, we should not blindly follow everything that they did there. There are some things in there that we can improve on since we don't carry the full history that Java does. Can you elaborate on why you need this change?

@sungwy
Copy link
Collaborator Author

sungwy commented Feb 27, 2024

Thanks @syun64 for raising this. I agree with @HonahX that setting it to -1 is not a good idea, for mainly two reasons:

  • There are two options for representing saying that the current-snapshot-id should be ignored, both -1 and None.

  • Where -1 does not point to a valid snapshot.

However Java is the reference implementation, we should not blindly follow everything that they did there. There are some things in there that we can improve on since we don't carry the full history that Java does. Can you elaborate on why you need this change?

Hi @Fokko thank you for the response! Tables created by PyIceberg currently do not persist a current_snapshot_id. This means that if no further commits were made after table creation, the table cannot be read by older versions of Java code (Trino, Spark, etc) that require that the current_snapshot_id be a valid long value, until a commit is made that commits a current_snapshot_id. Otherwise, it will throw above error trace when trying to parse the table metadata.

It looks like the PR to handle current_snapshot_id as optional was only introduced in the Java code a few months ago, meaning that all Java code running versions prior to that would throw above error on parsing table metadata created by PyIceberg in its current state. Hence I thought it was a backward incompatible change that's worth noting here.

I think as @HonahX suggested, serializing the output current_snapshot_id in TableMetadata as '-1' would be good for backwards compatibility.

@sungwy
Copy link
Collaborator Author

sungwy commented Feb 27, 2024

So it looks like using a custom @field_serializer isn't working in the current IcebergBaseModel definition with "None" values, because we are setting exclude_none=True by default for model_dump_json. Unfortunately the model exclusion logic runs before the custom field_serializer, meaning that if we store the current_snapshot_id internally as None, we won't be able to use custom pydantic serializers to cast None to -1 only on output. There is a recent discussion on this specific issue in the Pydantic community, and people had to resort to other hacky workarounds to support field_serialization on None values when exclude_none is set to True.

If we want to make Iceberg tables created by PyIceberg compatible with older versions of Java, I think we'll just have to store current_snapshot_id as -1 in the TableMetadata, or refactor the way we identify fields that we want to exclude if they are None in the IcebergBaseModel.

@Fokko
Copy link
Contributor

Fokko commented Feb 28, 2024

Thanks for the full context. cc @rdblue @danielcweeks WDYT?

This has been fixed in Iceberg 1.4.0. We should break the spec if we want to support reading Iceberg tables written by an earlier version of Java Iceberg. We could also update the spec to make it non-optional, otherwise, other implementations will follow.

@danielcweeks
Copy link
Contributor

The spec calls out the current_snapshot_id as optional, so omitting it "should" be valid. I think I agree with @Fokko that we shouldn't just blindly follow java's implementation. I would suggest a compatibility flag that sets the value to -1, but default it to off.

The reason I think this is safe is that the situation where someone is using an older version of java but also using python to create tables (no commits) is relatively narrow and they would identify the problem quickly and turn the flag on for their use cases. I feel like this is the best approach to move forward with the correct solution, but still accommodate backward compatibility.

@sungwy
Copy link
Collaborator Author

sungwy commented Feb 28, 2024

The spec calls out the current_snapshot_id as optional, so omitting it "should" be valid. I think I agree with @Fokko that we shouldn't just blindly follow java's implementation. I would suggest a compatibility flag that sets the value to -1, but default it to off.

The reason I think this is safe is that the situation where someone is using an older version of java but also using python to create tables (no commits) is relatively narrow and they would identify the problem quickly and turn the flag on for their use cases. I feel like this is the best approach to move forward with the correct solution, but still accommodate backward compatibility.

That makes sense @danielcweeks @Fokko . Thank you for sharing your opinions.

Would the compatibility flag be included as a environment variable? If we think that supporting this change is a hassle, and we want to honor correctness of the implementation according to the Spec, maybe we can make a conscious decision to not accommodate for backward compatibility in this edge case.

One of my goals with putting this PR was to at least document this issue and the discussion so others running into it can refer to it.

@Fokko
Copy link
Contributor

Fokko commented Feb 28, 2024

I prefer to not exclude certain groups (Java <1.3.0 in this case, I'm not sure on which versions all the proprietary implementations). I think a flag is an elegant way of enabling this. I fully agree with you on the documentation part of it 👍

Would the compatibility flag be included as a environment variable?

The flag would be a configuration option on the catalog. This way you can set it through the ~/.pyiceberg.yaml and through an environment variable if you like.

@sungwy
Copy link
Collaborator Author

sungwy commented Feb 29, 2024

I prefer to not exclude certain groups (Java <1.3.0 in this case, I'm not sure on which versions all the proprietary implementations). I think a flag is an elegant way of enabling this. I fully agree with you on the documentation part of it 👍

Would the compatibility flag be included as a environment variable?

The flag would be a configuration option on the catalog. This way you can set it through the ~/.pyiceberg.yaml and through an environment variable if you like.

That makes sense @Fokko . What are your thoughts on my findings about the pydantic base model? I don't think there's an easy way to use a custom serializer only on the output, because we are using exclude_none=True on the model. So it will ignore the current_snapshot_id attribute if we stored it as None internally.

So do we want to:

  1. Store current_snapshot_id as '-1' internally if we have this flag set to True? Or
  2. model_dump a dictionary, and add current_snapshot_id as -1 if not present, then json.dumps(model_dump) on output serialization if the flag is set to True?

@sungwy
Copy link
Collaborator Author

sungwy commented Feb 29, 2024

I went forward with the Option 2 and I think it looks pretty clean. Let me know what you think @Fokko

mkdocs/docs/configuration.md Outdated Show resolved Hide resolved
Comment on lines 133 to 136
if Config().get_bool("legacy-current-snapshot-id") and metadata.current_snapshot_id is None:
model_dict = json.loads(model_dump)
model_dict[CURRENT_SNAPSHOT_ID] = -1
model_dump = json.dumps(model_dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the best place to fix this. Mostly because we have to deserialize and serialize the metadata, and the rest of the deserialization logic is part of the Pydantic model. I think option 1 is much cleaner. We can set the ignore-None to False:

    @staticmethod
    def table_metadata(metadata: TableMetadata, output_file: OutputFile, overwrite: bool = False) -> None:
        """Write a TableMetadata instance to an output file.

        Args:
            output_file (OutputFile): A custom implementation of the iceberg.io.file.OutputFile abstract base class.
            overwrite (bool): Where to overwrite the file if it already exists. Defaults to `False`.
        """
        with output_file.create(overwrite=overwrite) as output_stream:
            json_bytes = metadata.model_dump_json(exclude_none=False).encode(UTF8)
            json_bytes = Compressor.get_compressor(output_file.location).bytes_compressor()(json_bytes)
            output_stream.write(json_bytes)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's a good suggestion. I'll add the field_serializer and set exclude_none to False so we can print out -1 in the output.

pyiceberg/utils/config.py Show resolved Hide resolved
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! Just have one comment. Thanks for working on this!

@@ -121,7 +122,7 @@ def check_sort_orders(table_metadata: TableMetadata) -> TableMetadata:

def construct_refs(table_metadata: TableMetadata) -> TableMetadata:
"""Set the main branch if missing."""
if table_metadata.current_snapshot_id is not None:
if table_metadata.current_snapshot_id is not None and table_metadata.current_snapshot_id != -1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this -1 check still necessary? construct_refs is an after validator. At this point, cleanup_snapshot_id should already turn current_snapshot_id=-1 to None.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! I'll submit a fix thank you!

pyiceberg/table/metadata.py Outdated Show resolved Hide resolved
@sungwy sungwy requested review from HonahX and Fokko March 1, 2024 13:26
@@ -127,6 +127,6 @@ def table_metadata(metadata: TableMetadata, output_file: OutputFile, overwrite:
overwrite (bool): Where to overwrite the file if it already exists. Defaults to `False`.
"""
with output_file.create(overwrite=overwrite) as output_stream:
json_bytes = metadata.model_dump_json().encode(UTF8)
json_bytes = metadata.model_dump_json(exclude_none=False).encode(UTF8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not a big deal: Shall we set exclude_none=False only in the legacy mode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer that. Let me put in this update

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, one minor suggestion:

Could you add the legacy-current-snapshot-id key to the write options table as well: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md#write-options

Comment on lines +267 to +271
@field_serializer('current_snapshot_id')
def serialize_current_snapshot_id(self, current_snapshot_id: Optional[int]) -> Optional[int]:
if current_snapshot_id is None and Config().get_bool("legacy-current-snapshot-id"):
return -1
return current_snapshot_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful!

@sungwy
Copy link
Collaborator Author

sungwy commented Mar 5, 2024

This looks great, one minor suggestion:

Could you add the legacy-current-snapshot-id key to the write options table as well: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/configuration.md#write-options

I have this configuration documented under "Backward Compatibility" in the documentation now. I feel a bit awkward adding it to the write-options because the current list of write-options map to pyarrow parquet properties.

Are you suggesting that we just add the property to the write-options in the documentation as well? Like:

Write options

Key Options Default Description
write.parquet.compression-codec {uncompressed,zstd,gzip,snappy} zstd Sets the Parquet compression coddec.
write.parquet.compression-level Integer null Parquet compression level for the codec. If not set, it is up to PyIceberg
write.parquet.page-size-bytes Size in bytes 1MB Set a target threshold for the approximate encoded size of data pages within a column chunk
write.parquet.page-row-limit Number of rows 20000 Set a target threshold for the approximate encoded size of data pages within a column chunk
write.parquet.dict-size-bytes Size in bytes 2MB Set the dictionary page size limit per row group
write.parquet.row-group-limit Number of rows 122880 The Parquet row group limit
legacy-current-snapshot-id {True,False} False Set current-snapshot-id in serialized TableMetadata as -1 instead of None

Are we looking to move the documentation to Write options or have it in both places?

@sungwy
Copy link
Collaborator Author

sungwy commented Mar 6, 2024

Shall we merge this in if the place in the documentation (Write options vs Backward Compatibility) is the only subject for debate? We can always update the docs. Would appreciate it if you can merge this in for me @Fokko @HonahX , and thank you both for the thorough review!

@HonahX
Copy link
Contributor

HonahX commented Mar 6, 2024

Hi @Fokko @syun64

I feel a bit awkward adding it to the write-options because the current list of write-options map to pyarrow parquet properties.

I have a similar feeling, mainly because write-options table is for table properties while the legacy-current-snapshot-id is a global config in ~/.pyiceberg.yaml

catalog:
  default:
    type: glue
 
legacy-current-snapshot-id: True

Shall we merge this in if the place in the documentation..

Let's get this in first. Thanks for the great effort on it. Thanks everyone for reviewing!

@HonahX HonahX merged commit 14c021b into apache:main Mar 6, 2024
7 checks passed
@sungwy sungwy deleted the current-snapshot-bf branch March 6, 2024 01:57
@Fokko Fokko added this to the PyIceberg 0.6.1 milestone Mar 26, 2024
HonahX pushed a commit to HonahX/iceberg-python that referenced this pull request Mar 29, 2024
Fokko pushed a commit that referenced this pull request Mar 29, 2024
@djouallah
Copy link

is this available in 0.6.1, bigquery can't read the table because of that

xxxx.metadata.json missing current-snapshot-id

@Fokko
Copy link
Contributor

Fokko commented May 11, 2024

@djouallah Thanks for jumping in here. Sad to hear that Bigquery also suffers from this. This is part of 0.6.1, so setting the configuration, or writing an empty dataframe, should fix the issue.

@djouallah
Copy link

@Fokko sorry for the silly question, but i could not find it in the documentation ?

@Fokko
Copy link
Contributor

Fokko commented May 11, 2024

@djouallah It is mentioned all the way at the bottom of the configuration page: https://py.iceberg.apache.org/configuration/#backward-compatibility

@djouallah
Copy link

thanks, it works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants