diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/index.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/index.md index c184ae0f40..e8ac06f294 100644 --- a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/index.md +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/index.md @@ -66,5 +66,3 @@ https://github.com/snowplow-incubator/snowplow-bigquery-loader/blob/v2/config/co See the [configuration reference](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/index.md) for all possible configuration parameters. - -For more information on how events are stored in BigQuery, check the [mapping between Snowplow schemas and the corresponding BigQuery column types](/docs/storing-querying/schemas-in-warehouse/index.md?warehouse=bigquery). diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/index.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/index.md index 00ad2ce692..dd4aae832d 100644 --- a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/index.md +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/index.md @@ -147,7 +147,7 @@ The loader takes command line arguments `--config` with a path to the configurat { `docker run \\ -v /path/to/configs:/configs \\ - snowplow/snowplow-bigquery-streamloader:${versions.bqLoader} \\ + snowplow/snowplow-bigquery-streamloader:1.7.1 \\ --config=/configs/bigquery.hocon \\ --resolver=/configs/resolver.json `} @@ -157,7 +157,7 @@ Or you can pass the whole config as a base64-encoded string using the `--config` { `docker run \\ -v /path/to/resolver.json:/resolver.json \\ - snowplow/snowplow-bigquery-streamloader:${versions.bqLoader} \\ + snowplow/snowplow-bigquery-streamloader:1.7.1 \\ --config=ewogICJwcm9qZWN0SWQiOiAiY29tLWFjbWUiCgogICJsb2FkZXIiOiB7CiAgICAiaW5wdXQiOiB7CiAgICAgICJzdWJzY3JpcHRpb24iOiAiZW5yaWNoZWQtc3ViIgogICAgfQoKICAgICJvdXRwdXQiOiB7CiAgICAgICJnb29kIjogewogICAgICAgICJkYXRhc2V0SWQiOiAic25vd3Bsb3ciCiAgICAgICAgInRhYmxlSWQiOiAiZXZlbnRzIgogICAgICB9CgogICAgICAiYmFkIjogewogICAgICAgICJ0b3BpYyI6ICJiYWQtdG9waWMiCiAgICAgIH0KCiAgICAgICJ0eXBlcyI6IHsKICAgICAgICAidG9waWMiOiAidHlwZXMtdG9waWMiCiAgICAgIH0KCiAgICAgICJmYWlsZWRJbnNlcnRzIjogewogICAgICAgICJ0b3BpYyI6ICJmYWlsZWQtaW5zZXJ0cy10b3BpYyIKICAgICAgfQogICAgfQogIH0KCiAgIm11dGF0b3IiOiB7CiAgICAiaW5wdXQiOiB7CiAgICAgICJzdWJzY3JpcHRpb24iOiAidHlwZXMtc3ViIgogICAgfQoKICAgICJvdXRwdXQiOiB7CiAgICAgICJnb29kIjogJHtsb2FkZXIub3V0cHV0Lmdvb2R9ICMgd2lsbCBiZSBhdXRvbWF0aWNhbGx5IGluZmVycmVkCiAgICB9CiAgfQoKICAicmVwZWF0ZXIiOiB7CiAgICAiaW5wdXQiOiB7CiAgICAgICJzdWJzY3JpcHRpb24iOiAiZmFpbGVkLWluc2VydHMtc3ViIgogICAgfQoKICAgICJvdXRwdXQiOiB7CiAgICAgICJnb29kIjogJHtsb2FkZXIub3V0cHV0Lmdvb2R9ICMgd2lsbCBiZSBhdXRvbWF0aWNhbGx5IGluZmVycmVkCgogICAgICAiZGVhZExldHRlcnMiOiB7CiAgICAgICAgImJ1Y2tldCI6ICJnczovL2RlYWQtbGV0dGVyLWJ1Y2tldCIKICAgICAgfQogICAgfQogIH0KCiAgIm1vbml0b3JpbmciOiB7fSAjIGRpc2FibGVkCn0= \\ --resolver=/resolver.json `} @@ -169,7 +169,7 @@ For example, to override the `repeater.input.subscription` setting using system { `docker run \\ -v /path/to/configs:/configs \\ - snowplow/snowplow-bigquery-streamloader:${versions.bqLoader} \\ + snowplow/snowplow-bigquery-streamloader:1.7.1 \\ --config=/configs/bigquery.hocon \\ --resolver=/configs/resolver.json \\ -Drepeater.input.subscription="failed-inserts-sub" @@ -180,7 +180,7 @@ Or to use environment variables for every setting: { `docker run \\ -v /path/to/resolver.json:/resolver.json \\ - snowplow/snowplow-bigquery-repeater:${versions.bqLoader} \\ + snowplow/snowplow-bigquery-repeater:1.7.1 \\ --resolver=/resolver.json \\ -Dconfig.override_with_env_vars=true `} @@ -197,7 +197,7 @@ StreamLoader accepts `--config` and `--resolver` arguments, as well as any JVM s { `docker run \\ -v /path/to/configs:/configs \\ - snowplow/snowplow-bigquery-streamloader:${versions.bqLoader} \\ + snowplow/snowplow-bigquery-streamloader:1.7.1 \\ --config=/configs/bigquery.hocon \\ --resolver=/configs/resolver.json \\ -Dconfig.override_with_env_vars=true @@ -212,7 +212,7 @@ The Dataflow Loader accepts the same two arguments as StreamLoader and [any oth { `docker run \\ -v /path/to/configs:/configs \\ - snowplow/snowplow-bigquery-loader:${versions.bqLoader} \\ + snowplow/snowplow-bigquery-loader:1.7.1 \\ --config=/configs/bigquery.hocon \\ --resolver=/configs/resolver.json \\ --labels={"key1":"val1","key2":"val2"} # optional Dataflow args @@ -233,7 +233,7 @@ Mutator has three subcommands: `listen`, `create` and `add-column`. { `docker run \\ -v /path/to/configs:/configs \\ - snowplow/snowplow-bigquery-mutator:${versions.bqLoader} \\ + snowplow/snowplow-bigquery-mutator:1.7.1 \\ listen \\ --config=/configs/bigquery.hocon \\ --resolver=/configs/resolver.json \\ @@ -247,7 +247,7 @@ Mutator has three subcommands: `listen`, `create` and `add-column`. { `docker run \\ -v /path/to/configs:/configs \\ - snowplow/snowplow-bigquery-mutator:${versions.bqLoader} \\ + snowplow/snowplow-bigquery-mutator:1.7.1 \\ add-column \\ --config=/configs/bigquery.hocon \\ --resolver=/configs/resolver.json \\ @@ -264,7 +264,7 @@ The specified schema must be present in one of the Iglu registries in the resolv { `docker run \\ -v /path/to/configs:/configs \\ - snowplow/snowplow-bigquery-mutator:${versions.bqLoader} \\ + snowplow/snowplow-bigquery-mutator:1.7.1 \\ create \\ --config=/configs/bigquery.hocon \\ --resolver=/configs/resolver.json \\ @@ -281,7 +281,7 @@ We recommend constantly running Repeater on a small / cheap node or Docker conta { `docker run \\ -v /path/to/configs:/configs \\ - snowplow/snowplow-bigquery-repeater:${versions.bqLoader} \\ + snowplow/snowplow-bigquery-repeater:1.7.1 \\ --config=/configs/bigquery.hocon \\ --resolver=/configs/resolver.json \\ --bufferSize=20 \\ # size of the batch to send to the dead-letter bucket @@ -297,19 +297,19 @@ We recommend constantly running Repeater on a small / cheap node or Docker conta All applications are available as Docker images on Docker Hub, based on Ubuntu Focal and OpenJDK 11: { -`$ docker pull snowplow/snowplow-bigquery-streamloader:${versions.bqLoader} -$ docker pull snowplow/snowplow-bigquery-loader:${versions.bqLoader} -$ docker pull snowplow/snowplow-bigquery-mutator:${versions.bqLoader} -$ docker pull snowplow/snowplow-bigquery-repeater:${versions.bqLoader} +`$ docker pull snowplow/snowplow-bigquery-streamloader:1.7.1 +$ docker pull snowplow/snowplow-bigquery-loader:1.7.1 +$ docker pull snowplow/snowplow-bigquery-mutator:1.7.1 +$ docker pull snowplow/snowplow-bigquery-repeater:1.7.1 `} -

We also provide an alternative lightweight set of images based on Google's "distroless" base image, which may provide some security advantages for carrying fewer dependencies. These images are distinguished with the {`${versions.bqLoader}-distroless`} tag:

+

We also provide an alternative lightweight set of images based on Google's "distroless" base image, which may provide some security advantages for carrying fewer dependencies. These images are distinguished with the {`1.7.1-distroless`} tag:

{ -`$ docker pull snowplow/snowplow-bigquery-streamloader:${versions.bqLoader}-distroless -$ docker pull snowplow/snowplow-bigquery-loader:${versions.bqLoader}-distroless -$ docker pull snowplow/snowplow-bigquery-mutator:${versions.bqLoader}-distroless -$ docker pull snowplow/snowplow-bigquery-repeater:${versions.bqLoader}-distroless +`$ docker pull snowplow/snowplow-bigquery-streamloader:1.7.1-distroless +$ docker pull snowplow/snowplow-bigquery-loader:1.7.1-distroless +$ docker pull snowplow/snowplow-bigquery-mutator:1.7.1-distroless +$ docker pull snowplow/snowplow-bigquery-repeater:1.7.1-distroless `} Mutator, Repeater and Streamloader are also available as fatjar files attached to [releases](https://github.com/snowplow-incubator/snowplow-bigquery-loader/releases) in the project's Github repository. diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/upgrade-guides/1-0-x-upgrade-guide/index.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/upgrade-guides/1-0-x-upgrade-guide/index.md index 8c5eacb261..88e1871849 100644 --- a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/upgrade-guides/1-0-x-upgrade-guide/index.md +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/upgrade-guides/1-0-x-upgrade-guide/index.md @@ -6,7 +6,7 @@ sidebar_position: 0 ## Configuration -The only breaking change from the 0.6.x series is the new format of the configuration file. That used to be a self-describing JSON but is now HOCON. Additionally, some app-specific command-line arguments have been incorporated into the config, such as Repeater's `--failedInsertsSub` option. For more details, see the [setup guide](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/index.md#setup-guide) and +The only breaking change from the 0.6.x series is the new format of the configuration file. That used to be a self-describing JSON but is now HOCON. Additionally, some app-specific command-line arguments have been incorporated into the config, such as Repeater's `--failedInsertsSub` option. For more details, see the [setup guide](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/index.md#setup-guide) and [configuration reference](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/configuration-reference/index.md). Using Repeater as an example, if your configuration for 0.6.x looked like this: diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/upgrade-guides/2-0-0-upgrade-guide/index.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/upgrade-guides/2-0-0-upgrade-guide/index.md index bcd6cfc7d2..3f0037cf75 100644 --- a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/upgrade-guides/2-0-0-upgrade-guide/index.md +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/upgrade-guides/2-0-0-upgrade-guide/index.md @@ -5,18 +5,114 @@ sidebar_position: -20 ## Configuration -BigQuery Loader 2.0.0 brings changes to the loading setup. It is no longer neccessary to configure and deploy three independent applications (Loader, Repeator and Mutator in [1.X](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/index.md)) in order to load your data to BigQuery. +BigQuery Loader 2.0.0 brings changes to the loading setup. It is no longer neccessary to configure and deploy three independent applications (Loader, Repeater and Mutator in [1.X](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/index.md)) in order to load your data to BigQuery. Starting from 2.0.0 only one appliction is needed, which naturally introduces some breaking changes to the configuration file structure. See the [configuration reference](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/index.md) for all possible configuration parameters and the minimal [configuration samples](https://github.com/snowplow-incubator/snowplow-bigquery-loader/blob/v2/config) for each of supported cloud environments. +## Infrastructure + +Apart from Repeater and Mutator, there are also other infrastructure components that have become obsolete: +* The `types` PubSub topic connecting Loader and Mutator. +* The `failedInserts` PubSub topic connecting Loader and Repeater. +* The `deadLetter` GCS bucket used by Repeater to store data that repeatedly failed to be inserted into BigQuery. + ## Events table format -Starting from 2.0.0, BigQuery Loader no longer uses full schema version in column names for self-describing events and entities in `events` table. It uses only major schema version in the column name instead. +Starting from 2.0.0, BigQuery Loader changes its output column naming strategy. For example, for [ad_click event](https://github.com/snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.snowplow.media/ad_click_event/jsonschema/1-0-0): + +* Before an upgrade, new column would be named `unstruct_event_com_snowplowanalytics_snowplow_media_ad_click_event_1-0-0`. +* After an upgrade, new column will be named `unstruct_event_com_snowplowanalytics_snowplow_media_ad_click_event_1`. This means all new self-describing events and entities will be loaded to 'major version' - oriented columns, whereas old 'full version' - oriented columns remain unchanged, are no longer used by the new loader, and have no direct effect on loading. If neccessary, old columns need to be consolidated separately. If you are [modeling your data with dbt](/docs/modeling-your-data/modeling-your-data-with-dbt/index.md), you can use [this macro](https://github.com/snowplow/dbt-snowplow-utils#combine_column_versions-source) to aggregate the data across multiple columns. -Before 2.0.0, breaking changes introduced for the same schema family (shemas using the same major version) had no impact on your `events` table. Starting from 2.0.0, loader tries to merge all changes from the same schema family and load to the single column (with major version suffix). In case of breaking changes, loader creates recovery columns to try to load all your data, even the one referencing 'broken' schemas. You can read more about schema evolution and how recovery columns work [here](/docs/storing-querying/schemas-in-warehouse/?warehouse=bigquery#versioning). +## Recovery columns + +### What is schema evolution? + +One of Snowplow’s key features is the ability to [define custom schemas and validate events against them](https://docs.snowplow.io/docs/understanding-tracking-design/understanding-schemas-and-validation/). Over time, users often evolve the schemas, e.g. by adding new fields or changing existing fields. To accommodate these changes, BigQuery Loader 2.0.0 automatically adjusts the database tables in the warehouse accordingly. + +There are two main types of schema changes: + +**Breaking**: The schema version has to be changed in a major way (`1-2-3` → `2-0-0`). As of BigQuery Loader 2.0.0, each major schema version has its own column (`..._1`, `..._2`, etc, for example: `contexts_com.snowplowanalytics_ad_click_1`). + +**Non-breaking**: The schema version can be changed in a minor way (`1-2-3` → `1-3-0` or `1-2-3` → `1-2-4`). Data is stored in the same database column. + +### Without recovery columns + +Loader tries to format the incoming data according to the latest version of the schema it saw (for a given major version, e.g. `1-*-*`). For example, if a batch contains events with schema versions `1-0-0`, `1-0-1` and `1-0-2`, the loader derives the output schema based on version `1-0-2`. Then the loader instructs BigQuery to adjust the database column and load the data. + +This logic relies on two assumptions: + +1. **Old events compatible with new schemas.** Events with older schema versions, e.g. `1-0-0` and `1-0-1`, have to be valid against the newer ones, e.g. `1-0-2`. Those that are valid will result in failed events. + +2. **Old columns compatible with new schemas.** The corresponding BigQuery columns have to be migrated correctly from one version to another. Changes, such as altering the type of a field from `integer` to `string`, would fail. Loading would break with SQL errors and the whole batch would be stuck and hard to recover. + +These assumptions are not always clear to the users, making the process error-prone. + +### With recovery columns + +First, we support schema evolution that’s not strictly backwards compatible (although we still recommend against it since it can confuse downstream consumers of the data). This is done by _merging_ multiple schemas so that both old and new events can coexist. For example, suppose we have these two schemas: + +```json +{ + // 1-0-0 + "properties": { + "a": {"type": "integer"} + } +} +``` + +```json +{ + // 1-0-1 + "properties": { + "b": {"type": "integer"} + } +} +``` + +These would be merged into the following: +```json +{ + // merged + "properties": { + "a": {"type": "integer"}, + "b": {"type": "integer"} + } +} +``` + + +Second, the loader does not fail when it can’t modify the database column to store both old and new events. (As a reminder, an example would be changing the type of a field from `integer` to `string`.) Instead, it creates a _temporary_ column for the new data as an exception. The users can then run SQL statements to resolve this situation as they see fit. For instance, consider these two schemas: +```json +{ + // 1-0-0 + "properties": { + "a": {"type": "integer"} + } +} +``` + +```json +{ + // 1-0-1 + "properties": { + "a": {"type": "string"} + } +} +``` + +Because `1-0-1` events cannot be loaded into the same column with `1-0-0`, the data would be put in a separate column, e.g. `contexts_com_snowplowanalytics_ad_click_1_0_1_recovered_9999999`, where: + - `1_0_1` is the version of the offending schema; + - `9999999` is a hash code unique to the schema (i.e. it will change if the schema is overwritten with a different one). + +If you create a new schema `1-0-2` that reverts the offending changes and is again compatible with `1-0-0`, the data for events with that schema will be written to the original column as expected. + +### Notes +- If events with incorrectly evolved schemas do not arrive, then the recovery column would not be created. +- It is still possible to break loading by overwriting version `1-0-0` of the schema. Please, avoid doing that. +You can read more about schema evolution and how recovery columns work [here](/docs/storing-querying/schemas-in-warehouse/?warehouse=bigquery#versioning).