Skip to content

Commit

Permalink
Lake Loader 0.4.0 (#911)
Browse files Browse the repository at this point in the history
  • Loading branch information
istreeter authored Jun 7, 2024
1 parent 46d9544 commit 492349b
Show file tree
Hide file tree
Showing 8 changed files with 68 additions and 15 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -50,3 +50,25 @@ import Link from '@docusaurus/Link';
<td><code>telemetry.userProvidedId</code></td>
<td>Optional. See <Link to="/docs/getting-started-on-community-edition/telemetry/#how-can-i-help">here</Link> for more information.</td>
</tr>
<tr>
<td><code>inMemBatchBytes</code></td>
<td>Optional. Default value 25600000. Controls how many events are buffered in memory before saving the batch to local disk. The default value works well for most reasonably sized VMs.</td>
</tr>
<tr>
<td><code>cpuParallelismFactor</code></td>
<td>
Optional. Default value 0.75.
Controls how the app splits the workload into concurrent batches which can be run in parallel.
E.g. If there are 4 available processors, and cpuParallelismFraction = 0.75, then we process 3 batches concurrently.
The default value works well for most workloads.
</td>
</tr>
<tr>
<td><code>numEagerWindows</code></td>
<td>
Optional. Default value 1.
Controls how eagerly the loader starts processing the next timed window even when the previous timed window is still finalizing (committing into the lake).
By default, we start processing a timed windows if the previous 1 window is still finalizing, but we do not start processing a timed window if any more older windows are still finalizing.
The default value works well for most workloads.
</td>
</tr>
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
```mdx-code-block
import Link from '@docusaurus/Link';
```

<tr>
<td><code>output.good.location</code></td>
<td>Required, e.g. <code>gs://mybucket/events</code>. URI of the bucket location to which to write Snowplow enriched events in Delta format. The URI should start with the following prefix:
Expand All @@ -9,6 +13,9 @@
</td>
</tr>
<tr>
<td><code>output.good.dataSkippingColumns</code></td>
<td>Optional. A list of column names which will be brought to the "left-hand-side" of the events table, to enable Delta's <a href="https://docs.delta.io/latest/optimizations-oss.html#data-skipping" target="_blank">data skipping feature</a>. Defaults to the important Snowplow timestamp columns: <code>load_tstamp</code>, <code>collector_tstamp</code>, <code>derived_tstamp</code>, <code>dvce_created_tstamp</code>.</td>
<td><code>output.good.deltaTableProperties.*</code></td>
<td>
Optional. A map of key/value strings corresponding to Delta's table properties.
These can be anything <Link to="https://docs.delta.io/latest/table-properties.html">from the Delta table properties documentation</Link>.
The default properties include configuring Delta's <Link to="https://docs.delta.io/latest/optimizations-oss.html#data-skipping">data skipping feature</Link> for the important Snowplow timestamp columns: <code>load_tstamp</code>, <code>collector_tstamp</code>, <code>derived_tstamp</code>, <code>dvce_created_tstamp</code>.</td>
</tr>
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@
</tr>
<tr>
<td><code>output.good.hudiWriteOptions.*</code></td>
<td>Optional. A map of key/value strings corresponding to Hudi's configuration options for writing into a table. The default options configure `load_tstamp` as the table's partition field.</td>
<td>Optional. A map of key/value strings corresponding to Hudi's configuration options for writing into a table. The default options configure <code>load_tstamp</code> as the table's partition field.</td>
</tr>
<tr>
<td><code>output.good.hudiTableOptions.*</code></td>
<td>Optional. A map of key/value strings corresponding to Hudi's configuration options for creating a table. The default options configure `load_tstamp` as the table's partition field.</td>
<td><code>output.good.hudiTableProperties.*</code></td>
<td>Optional. A map of key/value strings corresponding to Hudi's configuration options for creating a table. The default options configure <code>load_tstamp</code> as the table's partition field.</td>
</tr>
Original file line number Diff line number Diff line change
@@ -1,11 +1,7 @@
<tr>
<td><code>output.good.type</code></td>
<td>Required, set this to <code>Iceberg</code>.</td>
</tr>
<tr>
<td><code>output.good.catalog.type</code></td>
<td>Required, set this to <code>BigLake</code></td>
</tr>
```mdx-code-block
import Link from '@docusaurus/Link';
```

<tr>
<td><code>output.good.location</code></td>
<td>Required, e.g. <code>gs://mybucket/</code>. URI of the bucket location to which to write Snowplow enriched events in Iceberg format. The URI should start with <code>gs://</code>.</td>
Expand All @@ -18,6 +14,14 @@
<td><code>output.good.table</code></td>
<td>Required. The name of the table in the BigLake database</td>
</tr>
<tr>
<td><code>output.good.icebergTableProperties.*</code></td>
<td>
Optional. A map of key/value strings corresponding to Iceberg's table properties.
These can be anything <Link to="https://iceberg.apache.org/docs/latest/configuration/">from the Iceberg table properties documentation</Link>.
The default properties include configuring Iceberg's column-level statistics for the important Snowplow timestamp columns: <code>load_tstamp</code>, <code>collector_tstamp</code>, <code>derived_tstamp</code>, <code>dvce_created_tstamp</code>.
</td>
</tr>
<tr>
<td><code>output.good.catalog.project</code></td>
<td>Required. The GCP project owning the BigLake catalog</td>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,14 @@ import Link from '@docusaurus/Link';
<td><code>output.good.table</code></td>
<td>Required. The name of the table in the Glue database</td>
</tr>
<tr>
<td><code>output.good.icebergTableProperties.*</code></td>
<td>
Optional. A map of key/value strings corresponding to Iceberg's table properties.
These can be anything <Link to="https://iceberg.apache.org/docs/latest/configuration/">from the Iceberg table properties documentation</Link>.
The default properties include configuring Iceberg's column-level statistics for the important Snowplow timestamp columns: <code>load_tstamp</code>, <code>collector_tstamp</code>, <code>derived_tstamp</code>, <code>dvce_created_tstamp</code>.
</td>
</tr>
<tr>
<td><code>output.good.catalog.options.*</code></td>
<td>
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
```mdx-code-block
import Link from '@docusaurus/Link';
```

<tr>
<td><code>input.topicName</code></td>
<td>Required. Name of the Kafka topic for the source of enriched events.</td>
Expand All @@ -20,5 +24,5 @@
</tr>
<tr>
<td><code>output.bad.producerConf.*</code></td>
<td>Optional. A map of key/value pairs for <a href="https://docs.confluent.io/platform/current/installation/configuration/producer-configs.html" target="_blank">any standard Kafka producer configuration option</a>.</td>
<td>Optional. A map of key/value pairs for <Link to="https://docs.confluent.io/platform/current/installation/configuration/producer-configs.html">any standard Kafka producer configuration option</Link>.</td>
</tr>
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,14 @@
<td><code>input.bufferSize</code></td>
<td>Optional. Default value 1. The number of batches of events which are pre-fetched from kinesis. The default value is known to work well.</td>
</tr>
<tr>
<td><code>input.workerIdentifier</code></td>
<td>Optional. Defaults to the <code>HOSTNAME</code> environment variable. The name of this KCL worker used in the dynamodb lease table.</td>
</tr>
<tr>
<td><code>input.leaseDuration</code></td>
<td>Optional. Default value <code>10 seconds</code>. The duration of shard leases. KCL workers must periodically refresh leases in the dynamodb table before this duration expires.</td>
</tr>
<tr>
<td><code>output.bad.streamName</code></td>
<td>Required. Name of the Kinesis stream that will receive failed events.</td>
Expand Down
2 changes: 1 addition & 1 deletion src/componentVersions.js
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ export const versions = {
rdbLoader: '6.0.0',
s3Loader: '2.2.8',
s3Loader22x: '2.2.8',
lakeLoader: '0.3.0',
lakeLoader: '0.4.1',
snowflakeStreamingLoader: '0.2.2',

// Data Modelling
Expand Down

0 comments on commit 492349b

Please sign in to comment.