Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Omit parquet field for a schema with no nested fields #205

Merged
merged 2 commits into from
May 8, 2024

Conversation

istreeter
Copy link
Contributor

Snowplow users commonly use schemas with no nested fields. Previously, we were creating a string column and loading the string field {}. But there is no benefit to loading this redundant data.

By omitting a column for these schemas, it means we support schema evolution if the user ever adds a nested field to the empty schema.

For empty schemas with additionalProperties: true we retain the old behaviour of loading the original JSON as a string.

@coveralls
Copy link

coveralls commented Apr 19, 2024

Pull Request Test Coverage Report for Build 9003594613

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 47 of 50 (94.0%) changed or added relevant lines in 4 files are covered.
  • 7 unchanged lines in 7 files lost coverage.
  • Overall coverage increased (+0.1%) to 80.668%

Changes Missing Coverage Covered Lines Changed/Added Lines %
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/parquet/Field.scala 32 33 96.97%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/parquet/FieldValue.scala 0 2 0.0%
Files with Coverage Reduction New Missed Lines %
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/jsonschema/circe/NumberCodecs.scala 1 70.37%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/parquet/Caster.scala 1 84.85%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/bigquery/Row.scala 1 71.79%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/parquet/Migrations.scala 1 90.67%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/jsonschema/package.scala 1 61.76%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/jsonschema/circe/StringCodecs.scala 1 64.29%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/jsonschema/mutate/Widened.scala 1 86.36%
Totals Coverage Status
Change from base Build 8434107272: 0.1%
Covered Lines: 1256
Relevant Lines: 1557

💛 - Coveralls

istreeter added 2 commits May 8, 2024 15:28
Snowplow users commonly use schemas with no nested fields. Previously,
we were creating a string column and loading the string field `{}`. But
there is no benefit to loading this redundant data.

By omitting a column for these schemas, it means we support schema
evolution if the user ever adds a nested field to the empty schema.

For empty schemas with `additionalProperties: true` we retain the old
behaviour of loading the original JSON as a string.
In modern snowplow loaders we have already switched to using Vector when
manipulating data structures. We use vector because we commonly need to
join together fields from different shcmeas.

But even the new loader code still has lots of `list.toVector`. If we
change to Vector in schema-ddl then we can eliminate a lot of those
unnecessary conversions.
@istreeter istreeter force-pushed the omit-empty-struct branch from f72552f to 4c0505d Compare May 8, 2024 14:28
@istreeter istreeter merged commit 4c0505d into develop May 8, 2024
1 check passed
istreeter added a commit to snowplow/snowplow-rdb-loader that referenced this pull request Nov 29, 2024
This concerns schemas like:

```
{"type": "object", "additionalProperties": false}
```

Older versions of schema-ddl would convert this to a schema type to
String (JSON) parquet column.  In snowplow/schema-ddl#205 we changed the
behaviour so this schema is converted to a `None`, i.e. do not create a
column for this schema. It was a good change for newer loaders (aside
from RDB Loader).

But that caused problems for RDB Loader under an edge-case scenario: If
the schema above is evolved from `1-0-0` to `1-0-1` and the new schema
adds a field to the schema, then RDB Loader tries to create a column for
the new field.  But that clashes with the old string column created with
the older version of RDB Loader.

This PR returns to the original behaviour of schema-ddl for this schemas
with no explicit properties.  It does so without us making any change to
schema-ddl, so we still get all the benefits of snowplow/schema-ddl#205
for the other loaders.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants