-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve original field order when merging schemas #98
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
istreeter
force-pushed
the
migration-field-order
branch
from
November 15, 2024 17:48
06fe18b
to
fc21028
Compare
istreeter
added a commit
to snowplow-incubator/snowplow-lake-loader
that referenced
this pull request
Nov 18, 2024
This overcomes a limitation with how Hudi syncs schemas to the Glue catalog. Previously, if version `1-0-0` of a schema had fields `a` and `b`, and then vesion `1-0-1` adds a field `c`, then the new field might be added _before_ the original fields in the Hudi schema. The new field would get synced to Glue, but only for new partitions; it is not back-filled to existing partitions. After this change, the new field `c` is added _after_ the original fields `a` and `b` in the Hudi schema. Then there is no need to sync the new field to existing partitions in Glue. The problem manifested in AWS Athena with a message like: > HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. This fix was implemented in snowplow/schema-ddl#213 and snowplow-incubator/common-streams#98 and imported via a new version of common-streams. This change does not impact Delta or Iceberg, where nothing was broken.
istreeter
added a commit
to snowplow-incubator/snowplow-lake-loader
that referenced
this pull request
Nov 19, 2024
This overcomes a limitation with how Hudi syncs schemas to the Glue catalog. Previously, if version `1-0-0` of a schema had fields `a` and `b`, and then vesion `1-0-1` adds a field `c`, then the new field might be added _before_ the original fields in the Hudi schema. The new field would get synced to Glue, but only for new partitions; it is not back-filled to existing partitions. After this change, the new field `c` is added _after_ the original fields `a` and `b` in the Hudi schema. Then there is no need to sync the new field to existing partitions in Glue. The problem manifested in AWS Athena with a message like: > HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. This fix was implemented in snowplow/schema-ddl#213 and snowplow-incubator/common-streams#98 and imported via a new version of common-streams. This change does not impact Delta or Iceberg, where nothing was broken.
istreeter
added a commit
to snowplow-incubator/snowplow-bigquery-loader
that referenced
this pull request
Nov 19, 2024
For this app, the most significant changes are: - common-streams has a big change in how it preserves the order of struct fields: snowplow-incubator/common-streams#98. This should not impact how BigQuery loader works. But I highlight here because of the potential risk if we got something wrong. - iglu-scala-client now treats both 403 and 404 as NotFound when listing a series of schemas from the Iglu repos: snowplow/iglu-scala-client#260
benjben
approved these changes
Nov 21, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
See snowplow/schema-ddl#213 When a schema is evolved (e.g. from `1-0-0` to `1-0-1`) we create a merged struct column combining fields from new and old schema. For some loaders it is important that newly-added nested fields come after the original fields. E.g. Lake Loader with Hudi and Glue sync enabled.
istreeter
force-pushed
the
migration-field-order
branch
from
November 25, 2024 10:47
fc21028
to
0e3484f
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See snowplow/schema-ddl#213
When a schema is evolved (e.g. from
1-0-0
to1-0-1
) we create a merged struct column combining fields from new and old schema.For some loaders it is important that newly-added nested fields come after the original fields. E.g. Lake Loader with Hudi and Glue sync enabled.