Skip to content

Releases: snowplow/snowplow-rdb-loader

5.6.2

10 Jul 10:05
Compare
Choose a tag to compare

Fixes a regression which under rare circumstances caused exceptions like:

Load failed and will not be retried: [Amazon](500310) Invalid operation: cannot alter column "xyz" of relation "com_example_foo_2", target column size should be different; = SqlState: 0A000: [Amazon](500310) Invalid operation: cannot alter column "xyz" of relation "com_example_foo_2", target column size should be different;

Changelog

  • Fix pattern matching on known exception for alter table failures (#1283)

5.6.1

28 Jun 14:16
Compare
Choose a tag to compare

A patch release to address small bugs which crept in with the 5.5.x series. These bugs only affect pipelines using SSH tunnels or pipelines sending failed events to Kinesis from the batch transformer.

Changelog

  • Loader: fix "dispatcher is shutdown" error when setting up SSH tunnel (#1278)
  • Batch transformer: use singleton badrows sink (#1274)
  • Batch transformer: custom iterator returning good data only (#1272)
  • Common: replace release-manager with s3-sync-action (#1152)

5.6.0

08 Jun 07:41
Compare
Choose a tag to compare

Starting with this version, loaders will create the database schema you've passed in the config automatically on initialization if it isn't created previously. No further configuration is needed to enable this.

The database user for the loader needs to have permission to create schemas to make this feature work. If the user doesn't have the necessary permission, the loader will just skip this step. In that case, you will need to create the schema manually prior to running the loader.

This feature only affects new deployments. If you are already successfully running the loader, nothing will change.

Changelog

  • Loader: Create the database schema on startup (#1266)
  • Stream transformer: Use Http4s client for iglu lookups (#1258)

5.5.0

26 May 08:15
Compare
Choose a tag to compare

Config parsing improvements

Before version 5.5.0, the only way of passing configuration to application was providing BASE64 encoded HOCON (for application config) and JSON (for Iglu resolver config) as a command line options.

Starting from version 5.5.0, it's possible to provide a full path to the configuration files. Here is an example, which mounts a config directory into the docker container at run time:

docker run \
  -v /path/to/config:/myconfig \
  snowplow/rdb-loader-redshift:5.5.0 \
  --config /myconfig/loader.hocon \
  --iglu-config /myconfig/resolver.json

It's no longer necessary to use BASE64 encoded strings on the command line, but to preserve compatibility the old way of configuring is still supported.

What is more, it's now possible to provide HOCON file for Iglu resolver configuration, so just like in the case of application configuration. This is important, as it allows you to utilize all great features of HOCON format for Iglu as well, like environment variable resolution. Plain JSON file is still supported.

These changes apply for all the loaders (Redshift, Snowflake, Databricks) and transformer (batch, streaming) applications.

Improved robustness of the loader

We've made quite a few small under-the-hood improvements, which we hope will make the loader more resilient against transient failures. We identified some of the most common edge-case error scenarios, where previous versions of the loader might hit an error, e.g. due to a stale connection or a network issue. The small changes include better handling of old connections, and retrying on transient failures.

Batch Transformer: transform_duration metric

Batch transformer can now send a new metric to Cloudwatch, if configured: transform_duration, which contains the duration needed to transform an input folder.

Upgrading

If you are already using a recent version of RDB Loader (3.0.0 or higher) then upgrading to 5.5.0 is as simple as pulling the newest docker images.
There are no changes needed to your configuration files.

docker pull snowplow/rdb-loader-redshift:5.5.0
docker pull snowplow/rdb-loader-snowflake:5.5.0
docker pull snowplow/rdb-loader-databricks:5.5.0
docker pull snowplow/transformer-pubsub:5.5.0
docker pull snowplow/transformer-kinesis:5.5.0

Starting from this version, batch transformer requires to use Java 11 om EMR (default is Java 8), for instance by running this script as a bootstrap action (needs to be stored on s3):

#!/bin/bash

set -e

sudo update-alternatives --set java /usr/lib/jvm/java-11-amazon-corretto.x86_64/bin/java

exit 0

Snowplow docs website has a full guide for running the RDB Loader and the transformer.

Changelog

  • Bump Snowflake driver to 3.13.30 (#1256)
  • Upgrade Databricks JDBC driver (#1254)
  • Config parsing improvements (#1252)
  • Loader: limit the total time spent retrying a failed load (#1251)
  • Loader: do not skip batches on warehouse connection failures (#1250)
  • Loader: Do not attempt rollback when connection is already closed (#1240)
  • Use sbt-snowplow-release to build docker images (#1222)
  • Loader: Improvements to webhook alerts (#1238)
  • Add load_tstamp column to table definitions (#1233)
  • Loader: Disable warnings on incomplete shredding for the streaming transformer (#967)
  • Batch Transformer: emit transform_duration metric (#1236)
  • Batch Transformer: use JDK 11 in assembly (#1241)
  • Bump dependencies with CVEs (#1234)
  • Loader: Retry failures for all warehouse operations (#1225)
  • Loader: Avoid errors for "Connection is not available" (#1223)
  • Upgrade to Cats Effect 3 (#1219)

5.4.3

25 May 08:17
Compare
Choose a tag to compare

A bug fix release to address an issue with schema values and nullable columns in the Databricks loader

Changelog

Bump schema-ddl to 0.18.2 (#1248)

5.4.2

25 May 08:16
Compare
Choose a tag to compare

A bug fix release to address an issue with schema values and nullable columns in the Databricks loader

Changelog

Transformer parquet: avoid duplicating normalised column such as eventType and event_type (#1242)

5.4.1

05 Apr 10:13
Compare
Choose a tag to compare

Patch release which fixes writing bad rows to output directory for parquet transformation (in batch transformer). See details in the issue #1229

5.4.0

20 Mar 13:38
Compare
Choose a tag to compare

This release brings a few features and bug fixes improving stability and observability of RDB Loader.

Full changelog

  • Transformer: add flag disabling atomic fields truncation to the transformer (#1217)
  • Loader fix: loaded batches must not get stuck in the retry queue (#1210)
  • Databricks loader: resilient to duplicate entries in manifest table (#1213)
  • Loader: make temp credentials session duration configurable (#1215)
  • Upgrade schema-ddl to 0.17.1 (#1207)
  • Snowflake ready check that does not require operate permission on the warehouse (#1195)
  • Add Kinesis/Pubsub badrows sink to streaming transformer (#1189)
  • Add Kinesis badrows sink to batch transformer (#1188)
  • Transformer: set event limit per parquet partition (#1178)
  • Transformer bad row count metric (#1171)
  • Don't use Spark sink for parquet badrows (#1168)

5.3.2

08 Mar 15:36
Compare
Choose a tag to compare

This patch release brings a few features and bug fixes improving stability and observability of RDB Loader

Full changelog

  • Loader: single COPY statement for each unique schema for Redshift (#1202)
  • Loader: Improve management of temporary credentials (#1205)
  • Scan Docker images with Snyk container monitor in ci.yml (#1191)
  • Add alert webhook message with summary of folder monitoring issues (#1173)
  • Enforce timeout on rolling back failed transaction (#1194)
  • Loader: Databricks surface driver log level (#1180)

5.3.1

25 Jan 15:09
Compare
Choose a tag to compare

In 5.3.0, we've introduced a bug on Snowflake Loader that makes it not copy contexts and unstruct events to events table. We've fixed this problem in version 5.3.1. Thanks mgkeen for reporting this issue. Recovery instructions for missing data can be found in the Discourse post.

Also, in this version, we've started to use VARCHAR instead of CHAR with standard fields when creating events table on Databricks Loader (Github issue).

Full changelog

  • Snowflake Loader: fix contexts and unstructured event data not being copied into Snowflake (#1185)
  • Add sts runtime dependency for v1 and v2 AWS SDKs (#1183)
  • Databricks loader: Use VARCHAR instead of CHAR when creating events table (#1175)