Omit parquet field for a schema with no nested fields #205

istreeter · 2024-04-19T21:02:27Z

Snowplow users commonly use schemas with no nested fields. Previously, we were creating a string column and loading the string field {}. But there is no benefit to loading this redundant data.

By omitting a column for these schemas, it means we support schema evolution if the user ever adds a nested field to the empty schema.

For empty schemas with additionalProperties: true we retain the old behaviour of loading the original JSON as a string.

coveralls · 2024-04-19T21:06:01Z

Pull Request Test Coverage Report for Build 9003594613

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

47 of 50 (94.0%) changed or added relevant lines in 4 files are covered.
7 unchanged lines in 7 files lost coverage.
Overall coverage increased (+0.1%) to 80.668%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/parquet/Field.scala	32	33	96.97%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/parquet/FieldValue.scala	0	2	0.0%

Files with Coverage Reduction	New Missed Lines	%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/jsonschema/circe/NumberCodecs.scala	1	70.37%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/parquet/Caster.scala	1	84.85%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/bigquery/Row.scala	1	71.79%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/parquet/Migrations.scala	1	90.67%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/jsonschema/package.scala	1	61.76%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/jsonschema/circe/StringCodecs.scala	1	64.29%
modules/core/src/main/scala/com.snowplowanalytics/iglu.schemaddl/jsonschema/mutate/Widened.scala	1	86.36%

Totals
Change from base Build 8434107272:	0.1%
Covered Lines:	1256
Relevant Lines:	1557

💛 - Coveralls

Snowplow users commonly use schemas with no nested fields. Previously, we were creating a string column and loading the string field `{}`. But there is no benefit to loading this redundant data. By omitting a column for these schemas, it means we support schema evolution if the user ever adds a nested field to the empty schema. For empty schemas with `additionalProperties: true` we retain the old behaviour of loading the original JSON as a string.

In modern snowplow loaders we have already switched to using Vector when manipulating data structures. We use vector because we commonly need to join together fields from different shcmeas. But even the new loader code still has lots of `list.toVector`. If we change to Vector in schema-ddl then we can eliminate a lot of those unnecessary conversions.

This concerns schemas like: ``` {"type": "object", "additionalProperties": false} ``` Older versions of schema-ddl would convert this to a schema type to String (JSON) parquet column. In snowplow/schema-ddl#205 we changed the behaviour so this schema is converted to a `None`, i.e. do not create a column for this schema. It was a good change for newer loaders (aside from RDB Loader). But that caused problems for RDB Loader under an edge-case scenario: If the schema above is evolved from `1-0-0` to `1-0-1` and the new schema adds a field to the schema, then RDB Loader tries to create a column for the new field. But that clashes with the old string column created with the older version of RDB Loader. This PR returns to the original behaviour of schema-ddl for this schemas with no explicit properties. It does so without us making any change to schema-ddl, so we still get all the benefits of snowplow/schema-ddl#205 for the other loaders.

istreeter force-pushed the omit-empty-struct branch 3 times, most recently from a738be5 to f72552f Compare April 20, 2024 15:35

istreeter mentioned this pull request Apr 20, 2024

Bump common-streams to 0.6.0 snowplow-incubator/snowplow-bigquery-loader#374

Merged

pondzix approved these changes May 8, 2024

View reviewed changes

istreeter added 2 commits May 8, 2024 15:28

istreeter force-pushed the omit-empty-struct branch from f72552f to 4c0505d Compare May 8, 2024 14:28

istreeter merged commit 4c0505d into develop May 8, 2024
1 check passed

istreeter mentioned this pull request Nov 29, 2024

Restore original schema-ddl behaviour for objects with no defined fields snowplow/snowplow-rdb-loader#1369

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Omit parquet field for a schema with no nested fields #205

Omit parquet field for a schema with no nested fields #205

istreeter commented Apr 19, 2024

coveralls commented Apr 19, 2024 •

edited

Loading

Omit parquet field for a schema with no nested fields #205

Omit parquet field for a schema with no nested fields #205

Conversation

istreeter commented Apr 19, 2024

coveralls commented Apr 19, 2024 • edited Loading

Pull Request Test Coverage Report for Build 9003594613

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

coveralls commented Apr 19, 2024 •

edited

Loading