Skip to content

Commit

Permalink
Release Normalize version 0.4.0 (#1081)
Browse files Browse the repository at this point in the history
  • Loading branch information
github-actions[bot] authored Dec 4, 2024
1 parent 07f4bf3 commit c33acaf
Show file tree
Hide file tree
Showing 5 changed files with 283 additions and 13 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -41,11 +41,12 @@ import {versions} from '@site/src/componentVersions';
<TabItem value="normalize" label="Snowplow Normalize">

<ReactMarkdown children={`
| snowplow-normalize version | dbt versions | BigQuery | Databricks | Redshift | Snowflake | Postgres |
| -------------------------------- | ----------------- | :------: | :--------: | :------: | :-------: | :------: |
| ${versions.dbtSnowplowNormalize} | >=1.4.0 to <2.0.0 ||||||
| 0.2.3 | >=1.3.0 to <2.0.0 ||||||
| 0.1.0 | >=1.0.0 to <2.0.0 ||||||
| snowplow-normalize version | dbt versions | BigQuery | Databricks | Redshift | Snowflake | Postgres | Spark |
| -------------------------------- | ----------------- | :------: | :--------: | :------: | :-------: | :------: | :------: |
| ${versions.dbtSnowplowNormalize} | >=1.4.0 to <2.0.0 |||||||
| 0.3.5 | >=1.4.0 to <2.0.0 |||||||
| 0.2.3 | >=1.3.0 to <2.0.0 |||||||
| 0.1.0 | >=1.0.0 to <2.0.0 |||||||
`} remarkPlugins={[remarkGfm]} />

</TabItem>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,22 +16,20 @@ import Badges from '@site/src/components/Badges';
<Badges badgeType="Maintained"></Badges>&nbsp;
<Badges badgeType="SPAL"></Badges>


# Snowplow Normalize Package
:::note

:::note
Normalize in this context means [database normalization](https://en.wikipedia.org/wiki/Database_normalization), as these models produce flatter data, not statistical normalization.
:::

**The package source code can be found in the [snowplow/dbt-snowplow-normalize repo](https://github.com/snowplow/dbt-snowplow-normalize ), and the docs for the [macro design are here](https://snowplow.github.io/dbt-snowplow-normalize/#/overview/snowplow_normalize ).**
**The package source code can be found in the [snowplow/dbt-snowplow-normalize repo](https://github.com/snowplow/dbt-snowplow-normalize), and the docs for the [macro design are here](https://snowplow.github.io/dbt-snowplow-normalize/#/overview/snowplow_normalize).**

The package provides [macros](https://docs.getdbt.com/docs/build/jinja-macros) and a python script that is used to generate your normalized events, filtered events, and users table for use within downstream ETL tools such as Census. See the [Model Design](#model-design) section for further details on these tables.

The package only includes the base incremental scratch model and does not have any derived models, instead it generates models in your project as if they were custom models you had built on top of the [Snowplow incremental tables](/docs/modeling-your-data/modeling-your-data-with-dbt/package-mechanics/incremental-processing/index.md), using the `_this_run` table as the base for new events to process each run. See the [configuration](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-configuration/index.md) section for the variables that apply to the incremental model.

:::note
The incremental model is simplified compared to the standard unified model, this package does not use sessions to identify which historic events to reprocess and just uses the `collector_tstamp` and package variables to identify which events to (re)process.

The incremental model is simplified compared to the standard unified model, this package does not use sessions to identify which historic events to reprocess and just uses the `snowplow__session_timestamp` (defaults to `collector_tstamp`) and package variables to identify which events to (re)process.
:::

## Model Design
Expand All @@ -54,7 +52,7 @@ For each `event_names` listed, a model is generated for records with matching ev
For example, if you have 3 `event_names` listed as `['page_view']`, `['page_ping']`, and `['link_click', 'deep_link_click']` then 3 models will be generated, each containing only those respective events from the atomic events table.

### Filtered Events Model
A single model is built that provides `event_id`, `collector_tstamp` and the name of the Normalized Event Model that the event was processed into, it does not include records for events that were not of an event type in your configuration. The model file itself is a series of `UNION` statements.
A single model is built that provides `event_id`, `snowplow__partition_tstamp` (defaults to `collector_tstamp`), and the name of the Normalized Event Model that the event was processed into, it does not include records for events that were not of an event type in your configuration. The model file itself is a series of `UNION` statements.

### Users Model
The users model provides a more traditional view of a Users table than that presented in the other Snowplow dbt packages. This model has one row per non-null `user_id` (or other identifier column you specify), and takes the latest (based on `collector_tstamp`) values from the specified contexts to ensure you always have the latest version of the information that you choose to collect about your users. This is designed to be immediately usable in downstream tools. The model file itself consists of lists of variables and a macro call.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -114,8 +114,39 @@ Depending on the use case it should either be the catalog (for Unity Catalog use

:::

### 10. Change the default partition timestamp *(optional)*

### 10. Run your model(s)
The package uses a configurable partition timestamp column, controlled by the `snowplow__partition_tstamp` variable:

```yaml
vars:
snowplow__partition_tstamp: "collector_tstamp" # Default value, any change should be a timestamp
```

The purpose of this variable is to adjust the partitioning of the derived tables to use a different timestamp (e.g., derived_tstamp) that is more suitable for analytics in the next layer.

:::warning Important Note on Custom Partition Timestamps
If you change `snowplow__partition_tstamp` to a different column (e.g., "loader_tstamp"), you MUST ensure that this column is included in the `event_columns` list in your normalize configuration for each event. Failing to do so will cause the models to fail, as the partition column must be present in the normalized output.

Example configuration when using a custom partition timestamp:
```json
{
"events": [
{
"event_names": ["page_view"],
"event_columns": [
"domain_userid",
"loader_tstamp", // Must include your custom partition timestamp here
"app_id"
],
// ... rest of configuration
}
]
}
```
:::

### 11. Run your model(s)

You can now run your models for the first time by running the below command (see the [operation](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-operation/index.md) page for more information on operation of the package):

Expand Down
2 changes: 1 addition & 1 deletion src/componentVersions.js
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ export const versions = {
dbtSnowplowMobile: '1.0.0',
dbtSnowplowMediaPlayer: '0.9.2',
dbtSnowplowUtils: '0.17.1',
dbtSnowplowNormalize: '0.3.5',
dbtSnowplowNormalize: '0.4.0',
dbtSnowplowFractribution: '0.3.6',
dbtSnowplowEcommerce: '0.9.0',

Expand Down
240 changes: 240 additions & 0 deletions src/components/JsonSchemaValidator/Schemas/dbtNormalize_0.4.0.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,240 @@
{
"definitions": {
"passthrough_vars": {
"type": "array",
"description": "> Click the plus sign to add a new entry",
"minItems": 0,
"items": {
"title": "Type",
"oneOf": [
{
"type": "string",
"title": "Column Name"
},
{
"type": "object",
"title": "SQL & Alias",
"properties": {
"sql": {
"type": "string"
},
"alias": {
"type": "string"
}
},
"required": [
"sql",
"alias"
],
"additionalProperties": false
}
]
},
"uniqueItems": true
}
},
"type": "object",
"properties": {
"snowplow__atomic_schema": {
"recommendFullRefresh": true,
"order": 3,
"consoleGroup": "required",
"type": "string",
"title": "Schema",
"description": "Schema (dataset) that contains your atomic events",
"longDescription": "The schema (dataset for BigQuery) that contains your atomic events table.",
"packageDefault": "atomic",
"group": "Warehouse and Tracker"
},
"snowplow__database": {
"recommendFullRefresh": true,
"order": 1,
"consoleGroup": "required",
"type": "string",
"title": "Database",
"description": "Database that contains your atomic events",
"longDescription": "The database that contains your atomic events table.",
"packageDefault": "target.database",
"group": "Warehouse and Tracker"
},
"snowplow__dev_target_name": {
"recommendFullRefresh": false,
"order": 87,
"consoleGroup": "advanced",
"type": "string",
"title": "Dev Target",
"description": "Target name of your development environment as defined in your `profiles.yml` file",
"longDescription": "The [target name](https://docs.getdbt.com/docs/core/connect-data-platform/profiles.yml) of your development environment as defined in your `profiles.yml` file. See the [Manifest Tables](/docs/modeling-your-data/modeling-your-data-with-dbt/package-mechanics/manifest-tables/) section for more details.",
"packageDefault": "dev",
"group": "Warehouse and Tracker"
},
"snowplow__events": {
"recommendFullRefresh": false,
"order": 9999,
"consoleGroup": "advanced",
"type": "string",
"title": "Events Table",
"description": "Reference to your events table",
"longDescription": "This is used internally by the packages to reference your events table based on other variable values and should not be changed.",
"packageDefault": "events",
"group": "Warehouse and Tracker"
},
"snowplow__allow_refresh": {
"recommendFullRefresh": true,
"order": 39,
"consoleGroup": "advanced",
"type": "boolean",
"title": "Allow Refresh",
"group": "Operation and Logic",
"longDescription": "Used as the default value to return from the `allow_refresh()` macro. This macro determines whether the manifest tables can be refreshed or not, depending on your environment. See the [Manifest Tables](/docs/modeling-your-data/modeling-your-data-with-dbt/package-mechanics/manifest-tables/) section for more details.",
"packageDefault": "false"
},
"snowplow__backfill_limit_days": {
"recommendFullRefresh": false,
"order": 41,
"consoleGroup": "advanced",
"type": "number",
"minimum": 0,
"title": "Backfill Limit",
"group": "Operation and Logic",
"longDescription": "The maximum numbers of days of new data to be processed since the latest event processed. Please refer to the [incremental logic](/docs/modeling-your-data/modeling-your-data-with-dbt/package-mechanics/incremental-processing/#package-state) section for more details.",
"packageDefault": "30",
"description": "The maximum numbers of days of new data to be processed since the latest event processed"
},
"snowplow__days_late_allowed": {
"recommendFullRefresh": true,
"order": 42,
"consoleGroup": "advanced",
"type": "number",
"minimum": 0,
"title": "Days Late Allowed",
"group": "Operation and Logic",
"longDescription": "The maximum allowed number of days between the event creation and it being sent to the collector. Exists to reduce lengthy table scans that can occur as a result of late arriving data.",
"packageDefault": "3",
"description": "The maximum allowed number of days between the event creation and it being sent to the collector"
},
"snowplow__lookback_window_hours": {
"recommendFullRefresh": false,
"order": 43,
"consoleGroup": "advanced",
"type": "number",
"minimum": 0,
"title": "Event Lookback Window",
"longDescription": "The number of hours to look before the latest event processed - to account for late arriving data, which comes out of order.",
"packageDefault": "6",
"group": "Operation and Logic",
"description": "The number of hours to look before the latest event processed - to account for late arriving data, which comes out of order"
},
"snowplow__start_date": {
"recommendFullRefresh": false,
"order": 6,
"consoleGroup": "required",
"type": "string",
"format": "date",
"title": "Start Date",
"group": "Operation and Logic",
"longDescription": "The date to start processing events from in the package on first run or a full refresh, based on `collector_tstamp`",
"packageDefault": "2020-01-01",
"description": "The date to start processing events from in the package on first run or a full refresh, based on `collector_tstamp`"
},
"snowplow__session_timestamp": {
"recommendFullRefresh": false,
"order": 55,
"consoleGroup": "advanced",
"type": "string",
"title": "Start Date",
"group": "Operation and Logic",
"longDescription": "This determines which timestamp is used to process sessions of data. It's a good idea to have this timestamp be the same timestamp as the field you partition your events table on.",
"packageDefault": "collector_tstamp",
"description": "Determines which timestamp is used to process sessions of data"
},
"snowplow__partition_tstamp": {
"recommendFullRefresh": true,
"order": 56,
"consoleGroup": "advanced",
"type": "string",
"title": "Start Date",
"group": "Operation and Logic",
"longDescription": "This determines which timestamp is used to partition the derived tables. You need to make sure that this timestamp will be present in the flattened events table.",
"packageDefault": "collector_tstamp",
"description": "This determines which timestamp is used to partition the derived tables."
},
"snowplow__upsert_lookback_days": {
"recommendFullRefresh": false,
"order": 126,
"consoleGroup": "advanced",
"type": "number",
"minimum": 0,
"title": "Upsert Lookback Days",
"group": "Operation and Logic",
"longDescription": "Number of days to look back over the incremental derived tables during the upsert. Where performance is not a concern, should be set to as long a value as possible. Having too short a period can result in duplicates. Please see the [Snowplow Optimized Materialization](/docs/modeling-your-data/modeling-your-data-with-dbt/package-mechanics/optimized-upserts/) section for more details.",
"packageDefault": "30",
"description": "Number of days to look back over the incremental derived tables during the upsert"
},
"snowplow__app_id": {
"recommendFullRefresh": false,
"order": 8,
"consoleGroup": "basic",
"type": "array",
"description": "> Click the plus sign to add a new entry",
"minItems": 0,
"title": "App IDs",
"longDescription": "A list of `app_id`s to filter the events table on for processing within the package.",
"packageDefault": "[ ] (no filter applied)",
"group": "Contexts, Filters, and Logs",
"items": {
"type": "string"
}
},
"snowplow__databricks_catalog": {
"recommendFullRefresh": true,
"order": 2,
"consoleGroup": "required",
"type": "string",
"title": "(Databricks) Catalog",
"warehouse": "Databricks",
"group": "Warehouse Specific",
"longDescription": "The catalogue your atomic events table is in. Depending on the use case it should either be the catalog (for Unity Catalog users from databricks connector 1.1.1 onwards, defaulted to `hive_metastore`) or the same value as your `snowplow__atomic_schema` (unless changed it should be 'atomic').",
"packageDefault": "hive_metastore",
"description": "The catalogue your atomic events table is in"
},
"snowplow__derived_tstamp_partitioned": {
"recommendFullRefresh": false,
"order": 9,
"consoleGroup": "basic",
"type": "boolean",
"warehouse": "Bigquery",
"title": "(Bigquery) Derived Timestamp Partition",
"longDescription": "Boolean to enable filtering the events table on `derived_tstamp` in addition to `collector_tstamp`.",
"packageDefault": "true",
"group": "Warehouse Specific"
},
"snowplow__grant_select_to": {
"recommendFullRefresh": false,
"order": 106,
"consoleGroup": "advanced",
"type": "array",
"description": "> Click the plus sign to add a new entry",
"minItems": 0,
"items": {
"type": "string",
"title": "User/Role"
},
"title": "Grant Select List",
"group": "Warehouse and Tracker",
"longDescription": "A list of users to grant select to all tables created by this package to.",
"packageDefault": "[]"
},
"snowplow__grant_schema_usage": {
"recommendFullRefresh": false,
"order": 105,
"consoleGroup": "advanced",
"type": "boolean",
"description": "Enable granting usage on schemas",
"title": "Enable grant usage",
"group": "Warehouse and Tracker",
"longDescription": "Enables granting usage on schemas interacted with on a dbt run",
"packageDefault": "true"
}
}
}

0 comments on commit c33acaf

Please sign in to comment.