Skip to content

Commit

Permalink
Address review feedback part2
Browse files Browse the repository at this point in the history
  • Loading branch information
pondzix committed May 10, 2024
1 parent f30c709 commit 647e935
Show file tree
Hide file tree
Showing 4 changed files with 119 additions and 25 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -66,5 +66,3 @@ https://github.com/snowplow-incubator/snowplow-bigquery-loader/blob/v2/config/co
</Tabs>

See the [configuration reference](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/index.md) for all possible configuration parameters.

For more information on how events are stored in BigQuery, check the [mapping between Snowplow schemas and the corresponding BigQuery column types](/docs/storing-querying/schemas-in-warehouse/index.md?warehouse=bigquery).
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,7 @@ The loader takes command line arguments `--config` with a path to the configurat
<CodeBlock language="bash">{
`docker run \\
-v /path/to/configs:/configs \\
snowplow/snowplow-bigquery-streamloader:${versions.bqLoader} \\
snowplow/snowplow-bigquery-streamloader:1.7.1 \\
--config=/configs/bigquery.hocon \\
--resolver=/configs/resolver.json
`}</CodeBlock>
Expand All @@ -157,7 +157,7 @@ Or you can pass the whole config as a base64-encoded string using the `--config`
<CodeBlock language="bash">{
`docker run \\
-v /path/to/resolver.json:/resolver.json \\
snowplow/snowplow-bigquery-streamloader:${versions.bqLoader} \\
snowplow/snowplow-bigquery-streamloader:1.7.1 \\
--config=ewogICJwcm9qZWN0SWQiOiAiY29tLWFjbWUiCgogICJsb2FkZXIiOiB7CiAgICAiaW5wdXQiOiB7CiAgICAgICJzdWJzY3JpcHRpb24iOiAiZW5yaWNoZWQtc3ViIgogICAgfQoKICAgICJvdXRwdXQiOiB7CiAgICAgICJnb29kIjogewogICAgICAgICJkYXRhc2V0SWQiOiAic25vd3Bsb3ciCiAgICAgICAgInRhYmxlSWQiOiAiZXZlbnRzIgogICAgICB9CgogICAgICAiYmFkIjogewogICAgICAgICJ0b3BpYyI6ICJiYWQtdG9waWMiCiAgICAgIH0KCiAgICAgICJ0eXBlcyI6IHsKICAgICAgICAidG9waWMiOiAidHlwZXMtdG9waWMiCiAgICAgIH0KCiAgICAgICJmYWlsZWRJbnNlcnRzIjogewogICAgICAgICJ0b3BpYyI6ICJmYWlsZWQtaW5zZXJ0cy10b3BpYyIKICAgICAgfQogICAgfQogIH0KCiAgIm11dGF0b3IiOiB7CiAgICAiaW5wdXQiOiB7CiAgICAgICJzdWJzY3JpcHRpb24iOiAidHlwZXMtc3ViIgogICAgfQoKICAgICJvdXRwdXQiOiB7CiAgICAgICJnb29kIjogJHtsb2FkZXIub3V0cHV0Lmdvb2R9ICMgd2lsbCBiZSBhdXRvbWF0aWNhbGx5IGluZmVycmVkCiAgICB9CiAgfQoKICAicmVwZWF0ZXIiOiB7CiAgICAiaW5wdXQiOiB7CiAgICAgICJzdWJzY3JpcHRpb24iOiAiZmFpbGVkLWluc2VydHMtc3ViIgogICAgfQoKICAgICJvdXRwdXQiOiB7CiAgICAgICJnb29kIjogJHtsb2FkZXIub3V0cHV0Lmdvb2R9ICMgd2lsbCBiZSBhdXRvbWF0aWNhbGx5IGluZmVycmVkCgogICAgICAiZGVhZExldHRlcnMiOiB7CiAgICAgICAgImJ1Y2tldCI6ICJnczovL2RlYWQtbGV0dGVyLWJ1Y2tldCIKICAgICAgfQogICAgfQogIH0KCiAgIm1vbml0b3JpbmciOiB7fSAjIGRpc2FibGVkCn0= \\
--resolver=/resolver.json
`}</CodeBlock>
Expand All @@ -169,7 +169,7 @@ For example, to override the `repeater.input.subscription` setting using system
<CodeBlock language="bash">{
`docker run \\
-v /path/to/configs:/configs \\
snowplow/snowplow-bigquery-streamloader:${versions.bqLoader} \\
snowplow/snowplow-bigquery-streamloader:1.7.1 \\
--config=/configs/bigquery.hocon \\
--resolver=/configs/resolver.json \\
-Drepeater.input.subscription="failed-inserts-sub"
Expand All @@ -180,7 +180,7 @@ Or to use environment variables for every setting:
<CodeBlock language="bash">{
`docker run \\
-v /path/to/resolver.json:/resolver.json \\
snowplow/snowplow-bigquery-repeater:${versions.bqLoader} \\
snowplow/snowplow-bigquery-repeater:1.7.1 \\
--resolver=/resolver.json \\
-Dconfig.override_with_env_vars=true
`}</CodeBlock>
Expand All @@ -197,7 +197,7 @@ StreamLoader accepts `--config` and `--resolver` arguments, as well as any JVM s
<CodeBlock language="bash">{
`docker run \\
-v /path/to/configs:/configs \\
snowplow/snowplow-bigquery-streamloader:${versions.bqLoader} \\
snowplow/snowplow-bigquery-streamloader:1.7.1 \\
--config=/configs/bigquery.hocon \\
--resolver=/configs/resolver.json \\
-Dconfig.override_with_env_vars=true
Expand All @@ -212,7 +212,7 @@ The Dataflow Loader accepts the same two arguments as StreamLoader and [any oth
<CodeBlock language="bash">{
`docker run \\
-v /path/to/configs:/configs \\
snowplow/snowplow-bigquery-loader:${versions.bqLoader} \\
snowplow/snowplow-bigquery-loader:1.7.1 \\
--config=/configs/bigquery.hocon \\
--resolver=/configs/resolver.json \\
--labels={"key1":"val1","key2":"val2"} # optional Dataflow args
Expand All @@ -233,7 +233,7 @@ Mutator has three subcommands: `listen`, `create` and `add-column`.
<CodeBlock language="bash">{
`docker run \\
-v /path/to/configs:/configs \\
snowplow/snowplow-bigquery-mutator:${versions.bqLoader} \\
snowplow/snowplow-bigquery-mutator:1.7.1 \\
listen \\
--config=/configs/bigquery.hocon \\
--resolver=/configs/resolver.json \\
Expand All @@ -247,7 +247,7 @@ Mutator has three subcommands: `listen`, `create` and `add-column`.
<CodeBlock language="bash">{
`docker run \\
-v /path/to/configs:/configs \\
snowplow/snowplow-bigquery-mutator:${versions.bqLoader} \\
snowplow/snowplow-bigquery-mutator:1.7.1 \\
add-column \\
--config=/configs/bigquery.hocon \\
--resolver=/configs/resolver.json \\
Expand All @@ -264,7 +264,7 @@ The specified schema must be present in one of the Iglu registries in the resolv
<CodeBlock language="bash">{
`docker run \\
-v /path/to/configs:/configs \\
snowplow/snowplow-bigquery-mutator:${versions.bqLoader} \\
snowplow/snowplow-bigquery-mutator:1.7.1 \\
create \\
--config=/configs/bigquery.hocon \\
--resolver=/configs/resolver.json \\
Expand All @@ -281,7 +281,7 @@ We recommend constantly running Repeater on a small / cheap node or Docker conta
<CodeBlock language="bash">{
`docker run \\
-v /path/to/configs:/configs \\
snowplow/snowplow-bigquery-repeater:${versions.bqLoader} \\
snowplow/snowplow-bigquery-repeater:1.7.1 \\
--config=/configs/bigquery.hocon \\
--resolver=/configs/resolver.json \\
--bufferSize=20 \\ # size of the batch to send to the dead-letter bucket
Expand All @@ -297,19 +297,19 @@ We recommend constantly running Repeater on a small / cheap node or Docker conta
All applications are available as Docker images on Docker Hub, based on Ubuntu Focal and OpenJDK 11:

<CodeBlock language="bash">{
`$ docker pull snowplow/snowplow-bigquery-streamloader:${versions.bqLoader}
$ docker pull snowplow/snowplow-bigquery-loader:${versions.bqLoader}
$ docker pull snowplow/snowplow-bigquery-mutator:${versions.bqLoader}
$ docker pull snowplow/snowplow-bigquery-repeater:${versions.bqLoader}
`$ docker pull snowplow/snowplow-bigquery-streamloader:1.7.1
$ docker pull snowplow/snowplow-bigquery-loader:1.7.1
$ docker pull snowplow/snowplow-bigquery-mutator:1.7.1
$ docker pull snowplow/snowplow-bigquery-repeater:1.7.1
`}</CodeBlock>

<p>We also provide an alternative lightweight set of images based on <a href="https://github.com/GoogleContainerTools/distroless">Google's "distroless" base image</a>, which may provide some security advantages for carrying fewer dependencies. These images are distinguished with the <code>{`${versions.bqLoader}-distroless`}</code> tag:</p>
<p>We also provide an alternative lightweight set of images based on <a href="https://github.com/GoogleContainerTools/distroless">Google's "distroless" base image</a>, which may provide some security advantages for carrying fewer dependencies. These images are distinguished with the <code>{`1.7.1-distroless`}</code> tag:</p>

<CodeBlock language="bash">{
`$ docker pull snowplow/snowplow-bigquery-streamloader:${versions.bqLoader}-distroless
$ docker pull snowplow/snowplow-bigquery-loader:${versions.bqLoader}-distroless
$ docker pull snowplow/snowplow-bigquery-mutator:${versions.bqLoader}-distroless
$ docker pull snowplow/snowplow-bigquery-repeater:${versions.bqLoader}-distroless
`$ docker pull snowplow/snowplow-bigquery-streamloader:1.7.1-distroless
$ docker pull snowplow/snowplow-bigquery-loader:1.7.1-distroless
$ docker pull snowplow/snowplow-bigquery-mutator:1.7.1-distroless
$ docker pull snowplow/snowplow-bigquery-repeater:1.7.1-distroless
`}</CodeBlock>

Mutator, Repeater and Streamloader are also available as fatjar files attached to [releases](https://github.com/snowplow-incubator/snowplow-bigquery-loader/releases) in the project's Github repository.
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ sidebar_position: 0

## Configuration

The only breaking change from the 0.6.x series is the new format of the configuration file. That used to be a self-describing JSON but is now HOCON. Additionally, some app-specific command-line arguments have been incorporated into the config, such as Repeater's `--failedInsertsSub` option. For more details, see the [setup guide](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/index.md#setup-guide) and
The only breaking change from the 0.6.x series is the new format of the configuration file. That used to be a self-describing JSON but is now HOCON. Additionally, some app-specific command-line arguments have been incorporated into the config, such as Repeater's `--failedInsertsSub` option. For more details, see the [setup guide](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/index.md#setup-guide) and [configuration reference](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/configuration-reference/index.md).

Using Repeater as an example, if your configuration for 0.6.x looked like this:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,114 @@ sidebar_position: -20

## Configuration

BigQuery Loader 2.0.0 brings changes to the loading setup. It is no longer neccessary to configure and deploy three independent applications (Loader, Repeator and Mutator in [1.X](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/index.md)) in order to load your data to BigQuery.
BigQuery Loader 2.0.0 brings changes to the loading setup. It is no longer neccessary to configure and deploy three independent applications (Loader, Repeater and Mutator in [1.X](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/index.md)) in order to load your data to BigQuery.
Starting from 2.0.0 only one appliction is needed, which naturally introduces some breaking changes to the configuration file structure.

See the [configuration reference](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/index.md) for all possible configuration parameters
and the minimal [configuration samples](https://github.com/snowplow-incubator/snowplow-bigquery-loader/blob/v2/config) for each of supported cloud environments.

## Infrastructure

Apart from Repeater and Mutator, there are also other infrastructure components that have become obsolete:
* The `types` PubSub topic connecting Loader and Mutator.
* The `failedInserts` PubSub topic connecting Loader and Repeater.
* The `deadLetter` GCS bucket used by Repeater to store data that repeatedly failed to be inserted into BigQuery.

## Events table format

Starting from 2.0.0, BigQuery Loader no longer uses full schema version in column names for self-describing events and entities in `events` table. It uses only major schema version in the column name instead.
Starting from 2.0.0, BigQuery Loader changes its output column naming strategy. For example, for [ad_click event](https://github.com/snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.snowplow.media/ad_click_event/jsonschema/1-0-0):

* Before an upgrade, new column would be named `unstruct_event_com_snowplowanalytics_snowplow_media_ad_click_event_1-0-0`.
* After an upgrade, new column will be named `unstruct_event_com_snowplowanalytics_snowplow_media_ad_click_event_1`.

This means all new self-describing events and entities will be loaded to 'major version' - oriented columns, whereas old 'full version' - oriented columns remain unchanged, are no longer used by the new loader, and have no direct effect on loading. If neccessary, old columns need to be consolidated separately. If you are [modeling your data with dbt](/docs/modeling-your-data/modeling-your-data-with-dbt/index.md), you can use [this macro](https://github.com/snowplow/dbt-snowplow-utils#combine_column_versions-source) to aggregate the data across multiple columns.

Before 2.0.0, breaking changes introduced for the same schema family (shemas using the same major version) had no impact on your `events` table. Starting from 2.0.0, loader tries to merge all changes from the same schema family and load to the single column (with major version suffix). In case of breaking changes, loader creates recovery columns to try to load all your data, even the one referencing 'broken' schemas. You can read more about schema evolution and how recovery columns work [here](/docs/storing-querying/schemas-in-warehouse/?warehouse=bigquery#versioning).
## Recovery columns

### What is schema evolution?

One of Snowplow’s key features is the ability to [define custom schemas and validate events against them](https://docs.snowplow.io/docs/understanding-tracking-design/understanding-schemas-and-validation/). Over time, users often evolve the schemas, e.g. by adding new fields or changing existing fields. To accommodate these changes, BigQuery Loader 2.0.0 automatically adjusts the database tables in the warehouse accordingly.

There are two main types of schema changes:

**Breaking**: The schema version has to be changed in a major way (`1-2-3``2-0-0`). As of BigQuery Loader 2.0.0, each major schema version has its own column (`..._1`, `..._2`, etc, for example: `contexts_com.snowplowanalytics_ad_click_1`).

**Non-breaking**: The schema version can be changed in a minor way (`1-2-3``1-3-0` or `1-2-3``1-2-4`). Data is stored in the same database column.

### Without recovery columns

Loader tries to format the incoming data according to the latest version of the schema it saw (for a given major version, e.g. `1-*-*`). For example, if a batch contains events with schema versions `1-0-0`, `1-0-1` and `1-0-2`, the loader derives the output schema based on version `1-0-2`. Then the loader instructs BigQuery to adjust the database column and load the data.

This logic relies on two assumptions:

1. **Old events compatible with new schemas.** Events with older schema versions, e.g. `1-0-0` and `1-0-1`, have to be valid against the newer ones, e.g. `1-0-2`. Those that are valid will result in failed events.

2. **Old columns compatible with new schemas.** The corresponding BigQuery columns have to be migrated correctly from one version to another. Changes, such as altering the type of a field from `integer` to `string`, would fail. Loading would break with SQL errors and the whole batch would be stuck and hard to recover.

These assumptions are not always clear to the users, making the process error-prone.

### With recovery columns

First, we support schema evolution that’s not strictly backwards compatible (although we still recommend against it since it can confuse downstream consumers of the data). This is done by _merging_ multiple schemas so that both old and new events can coexist. For example, suppose we have these two schemas:

```json
{
// 1-0-0
"properties": {
"a": {"type": "integer"}
}
}
```

```json
{
// 1-0-1
"properties": {
"b": {"type": "integer"}
}
}
```

These would be merged into the following:
```json
{
// merged
"properties": {
"a": {"type": "integer"},
"b": {"type": "integer"}
}
}
```


Second, the loader does not fail when it can’t modify the database column to store both old and new events. (As a reminder, an example would be changing the type of a field from `integer` to `string`.) Instead, it creates a _temporary_ column for the new data as an exception. The users can then run SQL statements to resolve this situation as they see fit. For instance, consider these two schemas:
```json
{
// 1-0-0
"properties": {
"a": {"type": "integer"}
}
}
```

```json
{
// 1-0-1
"properties": {
"a": {"type": "string"}
}
}
```

Because `1-0-1` events cannot be loaded into the same column with `1-0-0`, the data would be put in a separate column, e.g. `contexts_com_snowplowanalytics_ad_click_1_0_1_recovered_9999999`, where:
- `1_0_1` is the version of the offending schema;
- `9999999` is a hash code unique to the schema (i.e. it will change if the schema is overwritten with a different one).

If you create a new schema `1-0-2` that reverts the offending changes and is again compatible with `1-0-0`, the data for events with that schema will be written to the original column as expected.

### Notes

- If events with incorrectly evolved schemas do not arrive, then the recovery column would not be created.
- It is still possible to break loading by overwriting version `1-0-0` of the schema. Please, avoid doing that.

You can read more about schema evolution and how recovery columns work [here](/docs/storing-querying/schemas-in-warehouse/?warehouse=bigquery#versioning).

0 comments on commit 647e935

Please sign in to comment.