GlueCatalog: Set Glue table input information based on Iceberg table metadata #216

HonahX · 2023-12-15T06:18:34Z

Feature Request / Improvement

Based on discussion: #140 (comment). We can add StorageDescriptor to the table input when creating/updating table via GlueCatalog. Java Ref

This information can let Glue display more useful information when checking tables through tools like UI or CLI

The text was updated successfully, but these errors were encountered:

mgmarino · 2024-01-19T08:01:45Z

It seems that this is also required to enable AWS Athena to query a table created with pyiceberg and not just useful for UI/CLI tools.

Example

Running a test to generate a table on glue:

from pyiceberg.schema import Schema
from pyiceberg.types import (
    StringType,
    NestedField,
)

schema = Schema(
    NestedField(field_id=1, name="x", field_type=StringType(), required=False),
)
catalog.create_namespace("test_pyiceberg", properties=dict(location="s3://bucket/test_pyiceberg"))

catalog.create_table(
    identifier="test_pyiceberg.test",
    schema=schema,
)

import pyarrow as pa
import pandas as pd

t = catalog.load_table("test_pyiceberg.test")
to_append = pa.Table.from_pandas(pd.DataFrame([dict(x="hello!")]))
t.append(to_append)
t.scan().to_pandas()
#         x
# 0  hello!

The subsequent Athena query:

SELECT * FROM test_pyiceberg.test

fails with

Reason: COLUMN_NOT_FOUND: line 1:8: SELECT * not allowed from relation that has no columns

At this point, updating the schema manually in Glue then results in being able to successfully query Athena.

Other comments

As a side note, querying the schema via Athena still works, e.g. before updating the schema in Glue

SHOW COLUMNS FROM test_pyiceberg.test

will succeed.

nicor88 · 2024-01-20T08:29:25Z

I confirm the behvior that @mgmarino raised.
After the table is created in glue catalog, the Schema section for the table is empty - as you can see in my screenshoot:

Also any query of this type SHOW CREATE TABLE test_iceberg_writing; in athena fails - even when the columns are added manually to glue.

Due to this behavior, query engines like athena (I supect trino as well) might not be able to work with iceberg tables created via pyiceberg without manual intervention.

mgmarino · 2024-01-20T08:33:36Z

FYI, I am preparing something here, I've already mostly translated the code from the java library. I hope to have something latest early next week, earlier if I can get some time this weekend. :-)

mgmarino · 2024-01-20T08:35:40Z

Also any query of this type SHOW CREATE TABLE test_iceberg_writing; in athena fails - even when the columns are added manually to glue.

This is, I think, a more general issue with Athena, maybe Trino. This also fails for me with Iceberg tables created by Spark/Flink.

nicor88 · 2024-01-20T08:38:38Z

I noticed another side effect on the above. After such table is created via pyicberg, when I drop it via Athena the s3 location is not cleaned up, only the table reference in the glue catalog.

mgmarino · 2024-01-20T08:41:39Z

I noticed another side effect on the above. After such table is created via pyicberg, when I drop it via Athena the s3 location is not cleaned up, only the table reference in the glue catalog.

This is maybe due to the fact that the location is not explicitly set on table creation (it is with the Java library). I can try and include this as well.

nicor88 · 2024-01-20T08:42:12Z

This is, I think, a more general issue with Athena, maybe Trino. This also fails for me with Iceberg tables created by Spark/Flink.

I have some tables created via GlueJobs (spark) and SHOW CREATE TABLE works for Iceberg tables, I suspect that there must be a glue property to set in the glue table creation to make such command working.

I will inspect table property created via glue jobs/athena vs pyicerbg to see the different in the table properties that could cause that.

Resolves apache#216. This PR adds information about the schema (on update/create) and location (create) of the table to Glue, enabling both an improved UI experience as well as querying with Athena.

mgmarino · 2024-01-20T10:08:33Z

Ok, ready to go: #288.

@nicor88 my changes with Location apparently still didn't allow DROP TABLE in Athena to clean up the resources and I didn't have a chance to investigate further, so let me know if you have any further input here.

HonahX mentioned this issue Dec 15, 2023

Glue catalog commit table #140

Merged

mgmarino mentioned this issue Jan 20, 2024

Set Glue Table Information when creating/updating tables #288

Merged

HonahX closed this as completed in #288 Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GlueCatalog: Set Glue table input information based on Iceberg table metadata #216

GlueCatalog: Set Glue table input information based on Iceberg table metadata #216

HonahX commented Dec 15, 2023

mgmarino commented Jan 19, 2024 •

edited

Loading

nicor88 commented Jan 20, 2024 •

edited

Loading

mgmarino commented Jan 20, 2024

mgmarino commented Jan 20, 2024

nicor88 commented Jan 20, 2024

mgmarino commented Jan 20, 2024

nicor88 commented Jan 20, 2024 •

edited

Loading

mgmarino commented Jan 20, 2024

GlueCatalog: Set Glue table input information based on Iceberg table metadata #216

GlueCatalog: Set Glue table input information based on Iceberg table metadata #216

Comments

HonahX commented Dec 15, 2023

Feature Request / Improvement

mgmarino commented Jan 19, 2024 • edited Loading

Example

Other comments

nicor88 commented Jan 20, 2024 • edited Loading

mgmarino commented Jan 20, 2024

mgmarino commented Jan 20, 2024

nicor88 commented Jan 20, 2024

mgmarino commented Jan 20, 2024

nicor88 commented Jan 20, 2024 • edited Loading

mgmarino commented Jan 20, 2024

mgmarino commented Jan 19, 2024 •

edited

Loading

nicor88 commented Jan 20, 2024 •

edited

Loading

nicor88 commented Jan 20, 2024 •

edited

Loading