Skip to content

Commit

Permalink
Preserve original field order during schema evolution
Browse files Browse the repository at this point in the history
This overcomes a limitation with how Hudi syncs schemas to the Glue
catalog. Previously, if version `1-0-0` of a schema had fields `a` and
`b`, and then vesion `1-0-1` adds a field `c`, then the new field might
be added _before_ the original fields in the Hudi schema.  The new field
would get synced to Glue, but only for new partitions; it is not
back-filled to existing partitions.

After this change, the new field `c` is added _after_ the original
fields `a` and `b` in the Hudi schema.  Then there is no need to sync
the new field to existing partitions in Glue.

The problem manifested in AWS Athena with a message like:

> HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas.

This fix was implemented in snowplow/schema-ddl#213 and
snowplow-incubator/common-streams#98 and imported via a new version of
common-streams.

This change does not impact Delta or Iceberg, where nothing was broken.
  • Loading branch information
istreeter committed Nov 18, 2024
1 parent c1e275e commit 9b2376f
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion project/Dependencies.scala
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ object Dependencies {
val awsRegistry = "1.1.20"

// Snowplow
val streams = "0.8.0"
val streams = "0.8.2-M1"
val igluClient = "4.0.0"

// Transitive overrides
Expand Down

0 comments on commit 9b2376f

Please sign in to comment.