Releases · snowplow/snowplow-rdb-loader

04 Jul 12:07

4.1.0

dd7e315

4.1.0

Concurrent streaming transformers for horizontal scaling

Before version 4.1.0, it was only possible to run a single instance of the streaming transformer at any one time. If you tried to run multiple instances at the same time, then there was a race condition which got described in detail in a previous Discourse thread. The old setup worked great for low volume pipelines, but it meant the streaming solution was not ideal for scaling up to higher volumes.

In version 4.1.0 we have worked around the problem simply by changing the directory names in S3 to contain a UUID unique to each running transformer. Before version 4.1.0, an output directory might be called run=2022-05-01-00-00-14, but in 4.1.0 the output directory might be called like run=2022-05-01-00-00-14-b4cac3e5-9948-40e3-bd68-38abcf01cdf9. Directory names for the batch transformer are not affected.

With this simple change, you can now safely scale out your streaming transformer to have multiple instances running in parallel.

Databricks loader supports generated columns

If you load into Databricks, a great way to set up your table is to partition based on the date of the event using a generated column:

CREATE TABLE IF NOT EXISTS snowplow.events (
  app_id                      VARCHAR(255),
  collector_tstamp            TIMESTAMP       NOT NULL,
  event_name                  VARCHAR(1000),
  -- Lots of other fields go here

  -- Collector timestamp date for partitioning
  collector_tstamp_date       DATE GENERATED ALWAYS AS (DATE(collector_tstamp))
)
PARTITIONED BY (collector_tstamp_date, event_name);

This partitioning strategy is very efficient for analytic queries that filter by collector_tstamp. The Snowplow/Databricks dbt web model works particularly well with this partitioning scheme.

In RDB Loader version 4.1.0 we made a small change to the Databricks loading to account for these generated columns.

Upgrading to 4.1.0

If you are already using a recent version of RDB Loader (3.0.0 or higher) then upgrading to 4.1.0 is as simple as pulling the newest docker images. There are no changes needed to your configuration files.

docker pull snowplow/transformer-kinesis:4.1.0
docker pull snowplow/rdb-loader-redshift:4.1.0
docker pull snowplow/rdb-loader-snowflake:4.1.0
docker pull snowplow/rdb-loader-databricks:4.1.0

The Snowplow docs site has a full guide to running the RDB Loader.

Changelog

Databricks loader: Support for generated columns (#951)
Loader: Use explicit schema name everywhere (#952)
Loader: Jars cannot load jsch (#942)
Snowflake loader: region and account configuration fields should be optional (#947)
Loader: Include the SQLState when logging a SQLException (#941)
Loader: Handle run directories with UUID suffix in folder monitoring (#949)
Add UUID to streaming transformer directory structure (#945)

Assets 7

20 Jun 16:29

github-actions

4.0.4

9e4e57e

4.0.4

A patch release to make transformer-kinesis more configurable via the hocon file.

Changelog

Transformer kinesis: make Kinesis consumer more configurable (#865)
Transformer: split batch and streaming configs (#937)

Assets 7

16 Jun 19:20

github-actions

4.0.3

28e289f

4.0.3

A bug fix release, which only affects the streaming transformer.

Changelog

Transformer kinesis: version 4.0.2 Throws java.lang.InterruptedException: sleep interrupted (#938)

Assets 7

15 Jun 07:57

github-actions

4.0.2

b7bbed0

4.0.2

This patch release has several improvements to make the loaders and streaming transformer more resilient against failures. It also patches dependencies to latest versions to mitigate security vulnerabilities.

Common

Set region in the SQS client builder (#587)
Common: Snyk action should only run on push to master (#929)

Loaders

Use forked version of jsch lib for ssh (#927)
Recover from exceptions on alerting webhook (#925)
Add logging around using SSH tunnel (#923)
Timeouts on JDBC statements (#914)
Bump snowflake-jdbc to 3.13.9 (#928)
Make ON_ERROR copy option configurable (#912)

Transformer Kinesis

Bump parquet-hadoop to 1.12.3 (#933)
Exclude hadoop transitive dependencies (#932)
Always end up in consistent state (#873)
No checkpointing until after SQS message is sent (#917)
Add missing hadoop-aws dependency for s3 parquet files upload (#920)

Batch Transformer

Add fileFormat field to formats section of example hocon (#848)

Assets 7

03 Jun 09:17

github-actions

4.0.1

0de9b46

4.0.1

Common

Change http4s client backend to blaze-client (#905)

Loader Common

Fix sqs visibility extensions when processing retries (#908)

Databricks Loader

Bump Databricks JDBC driver to 2.6.25 (#910)

Assets 7

26 May 12:32

github-actions

4.0.0

f880946

4.0.0

In this release, we are introducing our new Databricks Loader. Databricks Loader will be able to load Parquet transformed data therefore we've added wide row Parquet support to both Batch Transformer and Stream Transformer.

Also, we've included various improvements and bug fixes for Stream Transformer in order to get one step closer to making it production-ready.

Common

Change http4s client backend to async-http-client (#903)
Bump http4s to 0.21.33 (#902)

Loader

Add Databricks as a destination (#860)
Check if target is ready before submitting the statement (#846)
Emit latency statistics on constant intervals (#795)
Add load_tstamp (#815, #571)

Batch Transformer

Support Parquet output option (#896)

Transformer Kinesis

Support Parquet output option (#900)
Report metrics (#862)
Add telemetry (#863)
Write shredding_complete.json to S3 (#867)
Use output of transformation in updating global state (#824)
Fix updating total and bad number of events counter in global state (#823)
Add tests for whole processing pipeline (#835)
Fix passing checkpoint action during creation of windowed records (#762)

Assets 7

18 May 13:49

github-actions

3.0.3

50493b0

3.0.3

Common

Common: bump schema-ddl to 0.15.0 (#894)

Loader

Loader: bump version of load_succeeded schema to 3.0.0 (#889)

Assets 2

12 May 12:53

github-actions

3.0.2

b8dbb17

3.0.2

Common

Common: bump snowplow-scala-analytics-sdk to 3.0.1 (#872)
Common: publish arm64 and amd64 docker images (#875)
Common: publish distroless docker image (#877)
Common: bump jackson-databind to 2.13.2.2 (#879)

Assets 2

29 Apr 09:56

spenes

3.0.1

b1ee4e5

3.0.1

Snowflake Loader

Snowflake Loader: fix folder monitoring copy statement (#851)
Snowflake Loader: make default 'storage.type' Snowflake (#828)
Snowflake Loader: resume warehouse for each loading (#843)

Assets 2

01 Apr 15:16

github-actions

3.0.0

e9974ef

3.0.0

Loader

Add Snowflake support (#792)
Support loading wide row (#791)
Extract redshift loader into a separate module (#790)
Modularize configuration to support multiple destinations (#789)

Transformer

Rename shredders to transformers (#793)
Batch: add invalid timestamp check (#652)
Batch: transform events to wide row (#649)
Kinesis: add invalid timestamp check (#659)
Kinesis: transform events to wide row (#650)
Batch: make it possible to disable spark caching via config (#808)
Batch: remove event validation (#805)

Assets 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent streaming transformers for horizontal scaling

Databricks loader supports generated columns

Upgrading to 4.1.0

Changelog

Changelog

Changelog

Common

Loaders

Transformer Kinesis

Batch Transformer

Common

Loader Common

Databricks Loader

Common

Loader

Batch Transformer

Transformer Kinesis

Common

Loader

Common

Snowflake Loader

Loader

Transformer

Releases: snowplow/snowplow-rdb-loader

4.1.0

Concurrent streaming transformers for horizontal scaling

Databricks loader supports generated columns

Upgrading to 4.1.0

Changelog

4.0.4

Changelog

4.0.3

Changelog

4.0.2

Common

Loaders

Transformer Kinesis

Batch Transformer

4.0.1

Common

Loader Common

Databricks Loader

4.0.0

Common

Loader

Batch Transformer

Transformer Kinesis

3.0.3

Common

Loader

3.0.2

Common

3.0.1

Snowflake Loader

3.0.0

Loader

Transformer