Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve original field order when merging schemas #98

Merged
merged 1 commit into from
Nov 25, 2024

Conversation

istreeter
Copy link
Contributor

See snowplow/schema-ddl#213

When a schema is evolved (e.g. from 1-0-0 to 1-0-1) we create a merged struct column combining fields from new and old schema.

For some loaders it is important that newly-added nested fields come after the original fields. E.g. Lake Loader with Hudi and Glue sync enabled.

@istreeter istreeter force-pushed the migration-field-order branch from 06fe18b to fc21028 Compare November 15, 2024 17:48
istreeter added a commit to snowplow-incubator/snowplow-lake-loader that referenced this pull request Nov 18, 2024
This overcomes a limitation with how Hudi syncs schemas to the Glue
catalog. Previously, if version `1-0-0` of a schema had fields `a` and
`b`, and then vesion `1-0-1` adds a field `c`, then the new field might
be added _before_ the original fields in the Hudi schema.  The new field
would get synced to Glue, but only for new partitions; it is not
back-filled to existing partitions.

After this change, the new field `c` is added _after_ the original
fields `a` and `b` in the Hudi schema.  Then there is no need to sync
the new field to existing partitions in Glue.

The problem manifested in AWS Athena with a message like:

> HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas.

This fix was implemented in snowplow/schema-ddl#213 and
snowplow-incubator/common-streams#98 and imported via a new version of
common-streams.

This change does not impact Delta or Iceberg, where nothing was broken.
istreeter added a commit to snowplow-incubator/snowplow-lake-loader that referenced this pull request Nov 19, 2024
This overcomes a limitation with how Hudi syncs schemas to the Glue
catalog. Previously, if version `1-0-0` of a schema had fields `a` and
`b`, and then vesion `1-0-1` adds a field `c`, then the new field might
be added _before_ the original fields in the Hudi schema.  The new field
would get synced to Glue, but only for new partitions; it is not
back-filled to existing partitions.

After this change, the new field `c` is added _after_ the original
fields `a` and `b` in the Hudi schema.  Then there is no need to sync
the new field to existing partitions in Glue.

The problem manifested in AWS Athena with a message like:

> HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas.

This fix was implemented in snowplow/schema-ddl#213 and
snowplow-incubator/common-streams#98 and imported via a new version of
common-streams.

This change does not impact Delta or Iceberg, where nothing was broken.
istreeter added a commit to snowplow-incubator/snowplow-bigquery-loader that referenced this pull request Nov 19, 2024
For this app, the most significant changes are:

- common-streams has a big change in how it preserves the order of
  struct fields: snowplow-incubator/common-streams#98. This should not
  impact how BigQuery loader works. But I highlight here because of the
  potential risk if we got something wrong.
- iglu-scala-client now treats both 403 and 404 as NotFound when listing
  a series of schemas from the Iglu repos:
  snowplow/iglu-scala-client#260
Copy link
Contributor

@benjben benjben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

See snowplow/schema-ddl#213

When a schema is evolved (e.g. from `1-0-0` to `1-0-1`) we create a
merged struct column combining fields from new and old schema.

For some loaders it is important that newly-added nested fields come
after the original fields. E.g. Lake Loader with Hudi and Glue sync
enabled.
@istreeter istreeter force-pushed the migration-field-order branch from fc21028 to 0e3484f Compare November 25, 2024 10:47
@istreeter istreeter merged commit 200e4b5 into main Nov 25, 2024
1 check passed
@istreeter istreeter deleted the migration-field-order branch November 25, 2024 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants