Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GlueCatalog: Set Glue table input information based on Iceberg table metadata #216

Closed
HonahX opened this issue Dec 15, 2023 · 8 comments · Fixed by #288
Closed

GlueCatalog: Set Glue table input information based on Iceberg table metadata #216

HonahX opened this issue Dec 15, 2023 · 8 comments · Fixed by #288

Comments

@HonahX
Copy link
Contributor

HonahX commented Dec 15, 2023

Feature Request / Improvement

Based on discussion: #140 (comment). We can add StorageDescriptor to the table input when creating/updating table via GlueCatalog. Java Ref

This information can let Glue display more useful information when checking tables through tools like UI or CLI

@mgmarino
Copy link
Contributor

mgmarino commented Jan 19, 2024

It seems that this is also required to enable AWS Athena to query a table created with pyiceberg and not just useful for UI/CLI tools.

Example

Running a test to generate a table on glue:

from pyiceberg.schema import Schema
from pyiceberg.types import (
    StringType,
    NestedField,
)

schema = Schema(
    NestedField(field_id=1, name="x", field_type=StringType(), required=False),
)
catalog.create_namespace("test_pyiceberg", properties=dict(location="s3://bucket/test_pyiceberg"))

catalog.create_table(
    identifier="test_pyiceberg.test",
    schema=schema,
)

import pyarrow as pa
import pandas as pd

t = catalog.load_table("test_pyiceberg.test")
to_append = pa.Table.from_pandas(pd.DataFrame([dict(x="hello!")]))
t.append(to_append)
t.scan().to_pandas()
#         x
# 0  hello!

The subsequent Athena query:

SELECT * FROM test_pyiceberg.test

fails with

Reason: COLUMN_NOT_FOUND: line 1:8: SELECT * not allowed from relation that has no columns

At this point, updating the schema manually in Glue then results in being able to successfully query Athena.

Other comments

As a side note, querying the schema via Athena still works, e.g. before updating the schema in Glue

SHOW COLUMNS FROM test_pyiceberg.test

will succeed.

@nicor88
Copy link

nicor88 commented Jan 20, 2024

I confirm the behvior that @mgmarino raised.
After the table is created in glue catalog, the Schema section for the table is empty - as you can see in my screenshoot:
Screenshot 2024-01-20 at 09 27 14

Also any query of this type SHOW CREATE TABLE test_iceberg_writing; in athena fails - even when the columns are added manually to glue.

Due to this behavior, query engines like athena (I supect trino as well) might not be able to work with iceberg tables created via pyiceberg without manual intervention.

@mgmarino
Copy link
Contributor

FYI, I am preparing something here, I've already mostly translated the code from the java library. I hope to have something latest early next week, earlier if I can get some time this weekend. :-)

@mgmarino
Copy link
Contributor

Also any query of this type SHOW CREATE TABLE test_iceberg_writing; in athena fails - even when the columns are added manually to glue.

This is, I think, a more general issue with Athena, maybe Trino. This also fails for me with Iceberg tables created by Spark/Flink.

@nicor88
Copy link

nicor88 commented Jan 20, 2024

I noticed another side effect on the above. After such table is created via pyicberg, when I drop it via Athena the s3 location is not cleaned up, only the table reference in the glue catalog.

@mgmarino
Copy link
Contributor

I noticed another side effect on the above. After such table is created via pyicberg, when I drop it via Athena the s3 location is not cleaned up, only the table reference in the glue catalog.

This is maybe due to the fact that the location is not explicitly set on table creation (it is with the Java library). I can try and include this as well.

@nicor88
Copy link

nicor88 commented Jan 20, 2024

This is, I think, a more general issue with Athena, maybe Trino. This also fails for me with Iceberg tables created by Spark/Flink.

I have some tables created via GlueJobs (spark) and SHOW CREATE TABLE works for Iceberg tables, I suspect that there must be a glue property to set in the glue table creation to make such command working.

I will inspect table property created via glue jobs/athena vs pyicerbg to see the different in the table properties that could cause that.

mgmarino added a commit to mgmarino/iceberg-python that referenced this issue Jan 20, 2024
Resolves apache#216.

This PR adds information about the schema (on update/create) and
location (create) of the table to Glue, enabling both an improved UI
experience as well as querying with Athena.
@mgmarino
Copy link
Contributor

Ok, ready to go: #288.

@nicor88 my changes with Location apparently still didn't allow DROP TABLE in Athena to clean up the resources and I didn't have a chance to investigate further, so let me know if you have any further input here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants