Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Snapshot logic and Summary generation #61

Merged
merged 10 commits into from
Dec 7, 2023
Merged

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Oct 12, 2023

This is a slimmed down version of Java, but I'm not sure if we need everything on the Python side.

I left out a lot of the stuff on the Java side because:

  • The Java side is rather complex, and I'm not sure if we want to port it one-on-one.
  • I'm not sure if we're going to write delete files anytime soon, because I would expect to people pull in a dataframe, mangle it, and then write it back (overwrite operation).

@Fokko Fokko added this to the PyIceberg 0.6.0 release milestone Oct 13, 2023
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! @Fokko. It is much simpler compared to the one in Java. I have some questions related to the design and concepts. Thanks in advance for your help!

pyiceberg/table/snapshots.py Outdated Show resolved Hide resolved
pyiceberg/table/snapshots.py Show resolved Hide resolved
removed_pos_deletes: int
added_eq_deletes: int
removed_eq_deletes: int

Copy link
Contributor

@HonahX HonahX Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Java Implementation I saw a flag named trustSizeAndDeleteCounts, which is set to false when we add a DELETES manifest. Based on my understanding, the purpose of this flag is to let us skip reporting size and delete counts related metrics when we add one or more DELETES manifest since we do not know the exact number of rows deleted in the manifest.

Do we want to add the flag in this PR or in the future?

ref: apache/iceberg#1367 (comment)

Copy link
Contributor Author

@Fokko Fokko Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to get some insights from @rdblue on this one. When we Operation.APPEND a table we can add existing DELETE manifests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag is needed because we don't have the correct counts for the eq and pos deletes in delete manifests. I don't think that we need to add whole manifests in Python so I'd skip it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things that I like to avoid; complexity and trust issues!

pyiceberg/table/snapshots.py Outdated Show resolved Hide resolved
pyiceberg/table/snapshots.py Outdated Show resolved Hide resolved
pyiceberg/table/snapshots.py Outdated Show resolved Hide resolved
pyiceberg/table/snapshots.py Outdated Show resolved Hide resolved
pyiceberg/table/snapshots.py Outdated Show resolved Hide resolved
pyiceberg/table/snapshots.py Outdated Show resolved Hide resolved
pyiceberg/table/snapshots.py Outdated Show resolved Hide resolved
pyiceberg/table/snapshots.py Outdated Show resolved Hide resolved
pyiceberg/table/snapshots.py Outdated Show resolved Hide resolved
pyiceberg/table/snapshots.py Outdated Show resolved Hide resolved
pyiceberg/table/snapshots.py Outdated Show resolved Hide resolved
pyiceberg/table/snapshots.py Outdated Show resolved Hide resolved
ADDED_RECORDS = 'added-records'
DELETED_DATA_FILES = 'deleted-data-files'
DELETED_RECORDS = 'deleted-records'
EQUALITY_DELETE_FILES = 'added-equality-delete-files'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have the ADDED_ prefix like the others?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

ADDED_POSITION_DELETE_FILES = f'{ADDED_POSITION_DELETES}-files'
ADDED_RECORDS = 'added-records'
DELETED_DATA_FILES = 'deleted-data-files'
DELETED_RECORDS = 'deleted-records'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DELETED_ properties look correct.

ADDED_DELETE_FILES = 'added-delete-files'
ADDED_EQUALITY_DELETES = 'added-equality-deletes'
ADDED_FILE_SIZE = 'added-files-size'
ADDED_POSITION_DELETES = 'added-position-deletes'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These first 5 look correct.

@@ -65,6 +90,25 @@ def __init__(self, operation: Operation, **data: Any) -> None:
super().__init__(operation=operation, **data)
self._additional_properties = data

def __getitem__(self, __key: str) -> Optional[Any]: # type: ignore
"""Return a key as it is a map."""
if __key == 'operation':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be OPERATION?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be lower-case here:
image

I can make it case-insensitive

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that we have a constant defined and don't need to embed the string. Not a big deal.

properties: Dict[str, str] = {}
set_when_positive(properties, self.added_size, ADDED_FILE_SIZE)
set_when_positive(properties, self.removed_size, REMOVED_FILE_SIZE)
set_when_positive(properties, self.added_files, ADDED_DATA_FILES)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: it would be better to use self.added_data_files and self.removed_data_files since those are the properties that we're tracking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a minor, thanks!

return summary


def _merge_snapshot_summaries(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a merge? To me, a merge is like adding two summaries together. This is actually updating from a previous snapshot summary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to _update_snapshot_summaries

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very close, but a couple property names are incorrect. Thanks @Fokko!

ADDED_RECORDS = 'added-records'
DELETED_DATA_FILES = 'deleted-data-files'
DELETED_RECORDS = 'deleted-records'
ADDED_EQUALITY_DELETE_FILES = 'added-equality-delete-files'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Odd that this is here rather than with the other ADDED_ properties.

@rdblue rdblue merged commit 0cbb71c into apache:main Dec 7, 2023
6 checks passed
@rdblue
Copy link
Contributor

rdblue commented Dec 7, 2023

Thanks, @Fokko! Looks great.

@Fokko Fokko deleted the fd-snapshots branch July 31, 2024 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants