-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Glue catalog commit table #140
Glue catalog commit table #140
Conversation
# Conflicts: # pyiceberg/table/__init__.py # tests/conftest.py
# Conflicts: # pyiceberg/catalog/glue.py # pyiceberg/table/__init__.py # pyiceberg/table/metadata.py # tests/conftest.py # tests/table/test_init.py
…nd_glue_commit # Conflicts: # pyiceberg/table/__init__.py
I'm quite busy at the moment, but I'm aiming to complete this and mark this ready by the end of the week. |
Thanks @nicor88 and @jackye1995 for the initial review. I think this PR is ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good to me, just a nit regarding the doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small comment to make the regex a bit more robust, apart from that this looks good! Thanks for working on this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks @HonahX! Just a few minor questions/nits. Looking forward to trying this out w/ Glue for Ray later this week and will let you know if anything comes up (if any issues pop up, we can address them in a follow-up PR)!
pyiceberg/catalog/__init__.py
Outdated
@@ -74,6 +75,8 @@ | |||
LOCATION = "location" | |||
EXTERNAL_TABLE = "EXTERNAL_TABLE" | |||
|
|||
TABLE_METADATA_FILE_NAME_REGEX = re.compile(r"""(\d+)-.*\.json""", re.X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To better avoid potential false positives, should we change this to:
TABLE_METADATA_FILE_NAME_REGEX = re.compile(r"""(\d+)-.*\.json""", re.X) | |
TABLE_METADATA_FILE_NAME_REGEX = re.compile(r"""(\d+)-(.*)\.metadata\.json""", re.X) |
Where the 2nd capturing group will also allow us to assert UUID validity later. I'm assuming we don't want to match non-UUID strings or arbitrary JSON files, but let me know if you think this is too restrictive.
Also, since we've specified the re.X
flag anyway, we can make use of it by adding some comments here like:
TABLE_METADATA_FILE_NAME_REGEX = re.compile(r"""(\d+)-.*\.json""", re.X) | |
TABLE_METADATA_FILE_NAME_REGEX = re.compile( | |
r""" | |
(\d+) # version number | |
- # separator | |
(.*) # UUID | |
\.metadata\.json # file extension | |
""", | |
re.X | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I've updated the regex to capture the UUID. Although in java we do not check the UUID validity: https://github.com/apache/iceberg/blob/81bf8d30766b1b129b87abde15239645cb127046/core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java#L382-L404
I think it make sense to add the check here
pyiceberg/catalog/__init__.py
Outdated
if new_version < 0: | ||
raise ValueError(f"Table metadata version: {new_version} cannot be negative") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also return a more friendly error in the off chance that new_version is None
?
if new_version < 0: | |
raise ValueError(f"Table metadata version: {new_version} cannot be negative") | |
if new_version is None or new_version < 0: | |
raise ValueError(f"Table metadata version: `{new_version}` must be a non-negative integer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! Since this is an internal method, we can rely on the type check to ensure new_version is never None. I think one benefit of doing type-checking is that it reduces the number of None
checks in our code. Also, I've incorporated your suggested change to the error message.
Please let me know if you think the type-check is not enough and we'd better do the None-check here. Thanks!
Returns: | ||
int: The version of the metadata file. -1 if the file name does not have valid version string | ||
""" | ||
file_name = metadata_location.split("/")[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also gracefully return -1
if metadata_location
happens to be undefined?
file_name = metadata_location.split("/")[-1] | |
file_name = metadata_location.split("/")[-1] if metadata_location else "" | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar reason as above, since this is an internal method, I think we can rely on the type-check to ensure metadata_location
is not None
. But please let me know if you think type-check is not enough.
pyiceberg/catalog/__init__.py
Outdated
file_name = metadata_location.split("/")[-1] | ||
if file_name_match := TABLE_METADATA_FILE_NAME_REGEX.fullmatch(file_name): | ||
return int(file_name_match.group(1)) | ||
else: | ||
return -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we take the above change to add a capturing group around the UUID, we can assert its validity here with something like:
file_name = metadata_location.split("/")[-1] | |
if file_name_match := TABLE_METADATA_FILE_NAME_REGEX.fullmatch(file_name): | |
return int(file_name_match.group(1)) | |
else: | |
return -1 | |
file_name = metadata_location.split("/")[-1] if metadata_location else "" | |
if file_name_match := TABLE_METADATA_FILE_NAME_REGEX.fullmatch(file_name): | |
try: | |
UUID(file_name_match.group(2)) | |
except ValueError: | |
return -1 | |
return int(file_name_match.group(1)) | |
return -1 |
(Again, assuming this type of stricter assertion against table metadata file names is actually desirable - not sure if @Fokko or @jackye1995 have any opinions or additional context here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pdames for reviewing! Apologies for the delay; I was on vacation last week.
pyiceberg/catalog/__init__.py
Outdated
if new_version < 0: | ||
raise ValueError(f"Table metadata version: {new_version} cannot be negative") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! Since this is an internal method, we can rely on the type check to ensure new_version is never None. I think one benefit of doing type-checking is that it reduces the number of None
checks in our code. Also, I've incorporated your suggested change to the error message.
Please let me know if you think the type-check is not enough and we'd better do the None-check here. Thanks!
Returns: | ||
int: The version of the metadata file. -1 if the file name does not have valid version string | ||
""" | ||
file_name = metadata_location.split("/")[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar reason as above, since this is an internal method, I think we can rely on the type-check to ensure metadata_location
is not None
. But please let me know if you think type-check is not enough.
pyiceberg/catalog/__init__.py
Outdated
@@ -74,6 +75,8 @@ | |||
LOCATION = "location" | |||
EXTERNAL_TABLE = "EXTERNAL_TABLE" | |||
|
|||
TABLE_METADATA_FILE_NAME_REGEX = re.compile(r"""(\d+)-.*\.json""", re.X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I've updated the regex to capture the UUID. Although in java we do not check the UUID validity: https://github.com/apache/iceberg/blob/81bf8d30766b1b129b87abde15239645cb127046/core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java#L382-L404
I think it make sense to add the check here
@@ -1586,7 +1586,7 @@ def fixture_aws_credentials() -> Generator[None, None, None]: | |||
os.environ.pop("AWS_DEFAULT_REGION") | |||
|
|||
|
|||
MOTO_SERVER = ThreadedMotoServer(port=5000) | |||
MOTO_SERVER = ThreadedMotoServer(ip_address="localhost", port=5000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default ip address of the moto server is 0.0.0.0
. Since we only use it for unit test, I think it make more sense to start it at localhost
Thanks @HonahX for working on this, and thanks @pdames, @nicor88, and @jackye1995 for the review 🙌 |
* Implement table metadata updater first draft * fix updater error and add tests * implement _commit_table for glue * implement apply_metadata_update which is simpler * remove old implementation * re-organize method place * fix nit * fix test * add another test * clear TODO * add a combined test * Fix merge conflict * update table metadata merged * implement requirements validation * change the exception to CommitFailedException * add docstring * use regex to parse the metadata version * fix lint issue * fix CI issue * make base_metadata optional and add null check * make base_metadata optional and add null check * add integration test * default skip-archive to true and comments * refactor tests * add doc and fix test after merge * make regex more robust, thanks Fokko! * Fix review comments, thanks Patrick!
Fixes #29 #22
Implements
_commit_table
for GlueCatalog. This PR depends on:Thanks in advance for your help!