-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Omit parquet field for a schema with no nested fields #205
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
istreeter
force-pushed
the
omit-empty-struct
branch
3 times, most recently
from
April 20, 2024 15:35
a738be5
to
f72552f
Compare
pondzix
approved these changes
May 8, 2024
Snowplow users commonly use schemas with no nested fields. Previously, we were creating a string column and loading the string field `{}`. But there is no benefit to loading this redundant data. By omitting a column for these schemas, it means we support schema evolution if the user ever adds a nested field to the empty schema. For empty schemas with `additionalProperties: true` we retain the old behaviour of loading the original JSON as a string.
In modern snowplow loaders we have already switched to using Vector when manipulating data structures. We use vector because we commonly need to join together fields from different shcmeas. But even the new loader code still has lots of `list.toVector`. If we change to Vector in schema-ddl then we can eliminate a lot of those unnecessary conversions.
istreeter
force-pushed
the
omit-empty-struct
branch
from
May 8, 2024 14:28
f72552f
to
4c0505d
Compare
istreeter
added a commit
to snowplow/snowplow-rdb-loader
that referenced
this pull request
Nov 29, 2024
This concerns schemas like: ``` {"type": "object", "additionalProperties": false} ``` Older versions of schema-ddl would convert this to a schema type to String (JSON) parquet column. In snowplow/schema-ddl#205 we changed the behaviour so this schema is converted to a `None`, i.e. do not create a column for this schema. It was a good change for newer loaders (aside from RDB Loader). But that caused problems for RDB Loader under an edge-case scenario: If the schema above is evolved from `1-0-0` to `1-0-1` and the new schema adds a field to the schema, then RDB Loader tries to create a column for the new field. But that clashes with the old string column created with the older version of RDB Loader. This PR returns to the original behaviour of schema-ddl for this schemas with no explicit properties. It does so without us making any change to schema-ddl, so we still get all the benefits of snowplow/schema-ddl#205 for the other loaders.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Snowplow users commonly use schemas with no nested fields. Previously, we were creating a string column and loading the string field
{}
. But there is no benefit to loading this redundant data.By omitting a column for these schemas, it means we support schema evolution if the user ever adds a nested field to the empty schema.
For empty schemas with
additionalProperties: true
we retain the old behaviour of loading the original JSON as a string.