-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GlueCatalog: Set Glue table input information based on Iceberg table metadata #216
Comments
It seems that this is also required to enable AWS Athena to query a table created with ExampleRunning a test to generate a table on glue: from pyiceberg.schema import Schema
from pyiceberg.types import (
StringType,
NestedField,
)
schema = Schema(
NestedField(field_id=1, name="x", field_type=StringType(), required=False),
)
catalog.create_namespace("test_pyiceberg", properties=dict(location="s3://bucket/test_pyiceberg"))
catalog.create_table(
identifier="test_pyiceberg.test",
schema=schema,
)
import pyarrow as pa
import pandas as pd
t = catalog.load_table("test_pyiceberg.test")
to_append = pa.Table.from_pandas(pd.DataFrame([dict(x="hello!")]))
t.append(to_append)
t.scan().to_pandas()
# x
# 0 hello! The subsequent Athena query: SELECT * FROM test_pyiceberg.test fails with
At this point, updating the schema manually in Glue then results in being able to successfully query Athena. Other commentsAs a side note, querying the schema via Athena still works, e.g. before updating the schema in Glue SHOW COLUMNS FROM test_pyiceberg.test will succeed. |
I confirm the behvior that @mgmarino raised. Also any query of this type Due to this behavior, query engines like athena (I supect trino as well) might not be able to work with iceberg tables created via pyiceberg without manual intervention. |
FYI, I am preparing something here, I've already mostly translated the code from the java library. I hope to have something latest early next week, earlier if I can get some time this weekend. :-) |
This is, I think, a more general issue with Athena, maybe Trino. This also fails for me with Iceberg tables created by Spark/Flink. |
I noticed another side effect on the above. After such table is created via pyicberg, when I drop it via Athena the s3 location is not cleaned up, only the table reference in the glue catalog. |
This is maybe due to the fact that the location is not explicitly set on table creation (it is with the Java library). I can try and include this as well. |
I have some tables created via GlueJobs (spark) and I will inspect table property created via glue jobs/athena vs pyicerbg to see the different in the table properties that could cause that. |
Resolves apache#216. This PR adds information about the schema (on update/create) and location (create) of the table to Glue, enabling both an improved UI experience as well as querying with Athena.
Feature Request / Improvement
Based on discussion: #140 (comment). We can add
StorageDescriptor
to the table input when creating/updating table via GlueCatalog. Java RefThis information can let Glue display more useful information when checking tables through tools like UI or CLI
The text was updated successfully, but these errors were encountered: