-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Snapshot logic and Summary generation #61
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! @Fokko. It is much simpler compared to the one in Java. I have some questions related to the design and concepts. Thanks in advance for your help!
removed_pos_deletes: int | ||
added_eq_deletes: int | ||
removed_eq_deletes: int | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Java Implementation I saw a flag named trustSizeAndDeleteCounts
, which is set to false when we add a DELETES
manifest. Based on my understanding, the purpose of this flag is to let us skip reporting size and delete counts related metrics when we add one or more DELETES
manifest since we do not know the exact number of rows deleted in the manifest.
Do we want to add the flag in this PR or in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping to get some insights from @rdblue on this one. When we Operation.APPEND
a table we can add existing DELETE
manifests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flag is needed because we don't have the correct counts for the eq and pos deletes in delete manifests. I don't think that we need to add whole manifests in Python so I'd skip it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things that I like to avoid; complexity and trust issues!
pyiceberg/table/snapshots.py
Outdated
ADDED_RECORDS = 'added-records' | ||
DELETED_DATA_FILES = 'deleted-data-files' | ||
DELETED_RECORDS = 'deleted-records' | ||
EQUALITY_DELETE_FILES = 'added-equality-delete-files' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this have the ADDED_
prefix like the others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
ADDED_POSITION_DELETE_FILES = f'{ADDED_POSITION_DELETES}-files' | ||
ADDED_RECORDS = 'added-records' | ||
DELETED_DATA_FILES = 'deleted-data-files' | ||
DELETED_RECORDS = 'deleted-records' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DELETED_
properties look correct.
ADDED_DELETE_FILES = 'added-delete-files' | ||
ADDED_EQUALITY_DELETES = 'added-equality-deletes' | ||
ADDED_FILE_SIZE = 'added-files-size' | ||
ADDED_POSITION_DELETES = 'added-position-deletes' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These first 5 look correct.
pyiceberg/table/snapshots.py
Outdated
@@ -65,6 +90,25 @@ def __init__(self, operation: Operation, **data: Any) -> None: | |||
super().__init__(operation=operation, **data) | |||
self._additional_properties = data | |||
|
|||
def __getitem__(self, __key: str) -> Optional[Any]: # type: ignore | |||
"""Return a key as it is a map.""" | |||
if __key == 'operation': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be OPERATION
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that we have a constant defined and don't need to embed the string. Not a big deal.
pyiceberg/table/snapshots.py
Outdated
properties: Dict[str, str] = {} | ||
set_when_positive(properties, self.added_size, ADDED_FILE_SIZE) | ||
set_when_positive(properties, self.removed_size, REMOVED_FILE_SIZE) | ||
set_when_positive(properties, self.added_files, ADDED_DATA_FILES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: it would be better to use self.added_data_files
and self.removed_data_files
since those are the properties that we're tracking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not a minor, thanks!
pyiceberg/table/snapshots.py
Outdated
return summary | ||
|
||
|
||
def _merge_snapshot_summaries( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really a merge? To me, a merge is like adding two summaries together. This is actually updating from a previous snapshot summary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to _update_snapshot_summaries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very close, but a couple property names are incorrect. Thanks @Fokko!
ADDED_RECORDS = 'added-records' | ||
DELETED_DATA_FILES = 'deleted-data-files' | ||
DELETED_RECORDS = 'deleted-records' | ||
ADDED_EQUALITY_DELETE_FILES = 'added-equality-delete-files' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Odd that this is here rather than with the other ADDED_
properties.
Thanks, @Fokko! Looks great. |
This is a slimmed down version of Java, but I'm not sure if we need everything on the Python side.
I left out a lot of the stuff on the Java side because: