Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Glue catalog commit table #140

Merged
merged 39 commits into from
Jan 1, 2024

Conversation

HonahX
Copy link
Contributor

@HonahX HonahX commented Nov 12, 2023

Fixes #29 #22

Implements _commit_table for GlueCatalog. This PR depends on:

Thanks in advance for your help!

@HonahX HonahX mentioned this pull request Nov 12, 2023
@HonahX HonahX changed the title [WIP] Table metadata update, table requirement validation, and glue catalog commit WIP: Table metadata update, table requirement validation, and glue catalog commit Nov 12, 2023
@HonahX HonahX changed the title WIP: Table metadata update, table requirement validation, and glue catalog commit WIP: Table metadata update, table requirement validation, and glue catalog commit Nov 12, 2023
# Conflicts:
#	pyiceberg/catalog/glue.py
#	pyiceberg/table/__init__.py
#	pyiceberg/table/metadata.py
#	tests/conftest.py
#	tests/table/test_init.py
@Fokko Fokko added this to the PyIceberg 0.6.0 release milestone Dec 13, 2023
@HonahX
Copy link
Contributor Author

HonahX commented Dec 14, 2023

I'm quite busy at the moment, but I'm aiming to complete this and mark this ready by the end of the week.

@HonahX HonahX changed the title WIP: Glue catalog commit Glue catalog commit table Dec 16, 2023
@HonahX
Copy link
Contributor Author

HonahX commented Dec 16, 2023

Thanks @nicor88 and @jackye1995 for the initial review. I think this PR is ready.

cc: @Fokko @pdames @svdimchenko

@HonahX HonahX marked this pull request as ready for review December 16, 2023 02:05
Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me, just a nit regarding the doc

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment to make the regex a bit more robust, apart from that this looks good! Thanks for working on this!

Copy link
Contributor

@pdames pdames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks @HonahX! Just a few minor questions/nits. Looking forward to trying this out w/ Glue for Ray later this week and will let you know if anything comes up (if any issues pop up, we can address them in a follow-up PR)!

@@ -74,6 +75,8 @@
LOCATION = "location"
EXTERNAL_TABLE = "EXTERNAL_TABLE"

TABLE_METADATA_FILE_NAME_REGEX = re.compile(r"""(\d+)-.*\.json""", re.X)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To better avoid potential false positives, should we change this to:

Suggested change
TABLE_METADATA_FILE_NAME_REGEX = re.compile(r"""(\d+)-.*\.json""", re.X)
TABLE_METADATA_FILE_NAME_REGEX = re.compile(r"""(\d+)-(.*)\.metadata\.json""", re.X)

Where the 2nd capturing group will also allow us to assert UUID validity later. I'm assuming we don't want to match non-UUID strings or arbitrary JSON files, but let me know if you think this is too restrictive.

Also, since we've specified the re.X flag anyway, we can make use of it by adding some comments here like:

Suggested change
TABLE_METADATA_FILE_NAME_REGEX = re.compile(r"""(\d+)-.*\.json""", re.X)
TABLE_METADATA_FILE_NAME_REGEX = re.compile(
r"""
(\d+) # version number
- # separator
(.*) # UUID
\.metadata\.json # file extension
""",
re.X
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I've updated the regex to capture the UUID. Although in java we do not check the UUID validity: https://github.com/apache/iceberg/blob/81bf8d30766b1b129b87abde15239645cb127046/core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java#L382-L404
I think it make sense to add the check here

Comment on lines 594 to 595
if new_version < 0:
raise ValueError(f"Table metadata version: {new_version} cannot be negative")
Copy link
Contributor

@pdames pdames Dec 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also return a more friendly error in the off chance that new_version is None?

Suggested change
if new_version < 0:
raise ValueError(f"Table metadata version: {new_version} cannot be negative")
if new_version is None or new_version < 0:
raise ValueError(f"Table metadata version: `{new_version}` must be a non-negative integer")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! Since this is an internal method, we can rely on the type check to ensure new_version is never None. I think one benefit of doing type-checking is that it reduces the number of None checks in our code. Also, I've incorporated your suggested change to the error message.

Please let me know if you think the type-check is not enough and we'd better do the None-check here. Thanks!

Returns:
int: The version of the metadata file. -1 if the file name does not have valid version string
"""
file_name = metadata_location.split("/")[-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also gracefully return -1 if metadata_location happens to be undefined?

Suggested change
file_name = metadata_location.split("/")[-1]
file_name = metadata_location.split("/")[-1] if metadata_location else ""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar reason as above, since this is an internal method, I think we can rely on the type-check to ensure metadata_location is not None. But please let me know if you think type-check is not enough.

pyiceberg/catalog/glue.py Outdated Show resolved Hide resolved
pyiceberg/catalog/glue.py Outdated Show resolved Hide resolved
pyiceberg/catalog/glue.py Outdated Show resolved Hide resolved
Comment on lines 616 to 620
file_name = metadata_location.split("/")[-1]
if file_name_match := TABLE_METADATA_FILE_NAME_REGEX.fullmatch(file_name):
return int(file_name_match.group(1))
else:
return -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we take the above change to add a capturing group around the UUID, we can assert its validity here with something like:

Suggested change
file_name = metadata_location.split("/")[-1]
if file_name_match := TABLE_METADATA_FILE_NAME_REGEX.fullmatch(file_name):
return int(file_name_match.group(1))
else:
return -1
file_name = metadata_location.split("/")[-1] if metadata_location else ""
if file_name_match := TABLE_METADATA_FILE_NAME_REGEX.fullmatch(file_name):
try:
UUID(file_name_match.group(2))
except ValueError:
return -1
return int(file_name_match.group(1))
return -1

(Again, assuming this type of stricter assertion against table metadata file names is actually desirable - not sure if @Fokko or @jackye1995 have any opinions or additional context here).

Copy link
Contributor Author

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pdames for reviewing! Apologies for the delay; I was on vacation last week.

Comment on lines 594 to 595
if new_version < 0:
raise ValueError(f"Table metadata version: {new_version} cannot be negative")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! Since this is an internal method, we can rely on the type check to ensure new_version is never None. I think one benefit of doing type-checking is that it reduces the number of None checks in our code. Also, I've incorporated your suggested change to the error message.

Please let me know if you think the type-check is not enough and we'd better do the None-check here. Thanks!

Returns:
int: The version of the metadata file. -1 if the file name does not have valid version string
"""
file_name = metadata_location.split("/")[-1]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar reason as above, since this is an internal method, I think we can rely on the type-check to ensure metadata_location is not None. But please let me know if you think type-check is not enough.

pyiceberg/catalog/glue.py Outdated Show resolved Hide resolved
pyiceberg/catalog/glue.py Outdated Show resolved Hide resolved
@@ -74,6 +75,8 @@
LOCATION = "location"
EXTERNAL_TABLE = "EXTERNAL_TABLE"

TABLE_METADATA_FILE_NAME_REGEX = re.compile(r"""(\d+)-.*\.json""", re.X)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I've updated the regex to capture the UUID. Although in java we do not check the UUID validity: https://github.com/apache/iceberg/blob/81bf8d30766b1b129b87abde15239645cb127046/core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java#L382-L404
I think it make sense to add the check here

@@ -1586,7 +1586,7 @@ def fixture_aws_credentials() -> Generator[None, None, None]:
os.environ.pop("AWS_DEFAULT_REGION")


MOTO_SERVER = ThreadedMotoServer(port=5000)
MOTO_SERVER = ThreadedMotoServer(ip_address="localhost", port=5000)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default ip address of the moto server is 0.0.0.0. Since we only use it for unit test, I think it make more sense to start it at localhost

@Fokko Fokko merged commit 2272e20 into apache:main Jan 1, 2024
6 checks passed
@Fokko
Copy link
Contributor

Fokko commented Jan 1, 2024

Thanks @HonahX for working on this, and thanks @pdames, @nicor88, and @jackye1995 for the review 🙌

sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Jan 13, 2024
* Implement table metadata updater first draft

* fix updater error and add tests

* implement _commit_table for glue

* implement apply_metadata_update which is simpler

* remove old implementation

* re-organize method place

* fix nit

* fix test

* add another test

* clear TODO

* add a combined test

* Fix merge conflict

* update table metadata merged

* implement requirements validation

* change the exception to CommitFailedException

* add docstring

* use regex to parse the metadata version

* fix lint issue

* fix CI issue

* make base_metadata optional and add null check

* make base_metadata optional and add null check

* add integration test

* default skip-archive to true and comments

* refactor tests

* add doc and fix test after merge

* make regex more robust, thanks Fokko!

* Fix review comments, thanks Patrick!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Glue Table commit
5 participants