Option to change partition key #44

tv2-tjebsen · 2024-08-05T08:36:45Z

Is your feature request related to a problem? Please describe.

When using normalize from within BI tools (we use Looker), we want to use derived_tstamp instead of collector_tstamp as this is the better representation of when the events actually happened.

When we allow the user to filter on time on just derived_tstamp, all partitions are scanned, as BigQuery doesn't have a perspective on how those timestamps are connected. To combat this we've created a rubber band filter that filters +/1 one day of the derived_tstamp on collector_tstamp. But this has the downside that it doesn't catch late arriving events. See SQL examples below.

Furthermore, I imagine the use of collector_tstamp is based on the assumption that the atomic table is partitioned on this field. With the new loader structure for BigQuery (my_schema_1 not my_schema_1_2_3) our events table is partitioned on load_timestamp which again I assume has implications on the data scanned for the actual runs of the normalize package, as that also needs to scan all partitions

Describe the solution you'd like

One option (and probably the easiest) would be to use start_tstamp similar to what the unified models use - from my understanding, this is mostly the same as derived_tstamp and thus utilize the partitions when the user filters on derived_tstamp.

Another option would be to allow the user to set the partition key and then this is updated in the code. Now collector_tstamp is hard coded into the model_gen.py-script as well as the normalize_events macro.

I imagine this could be part of the config-section on top of the normalize_config.json-file (and should probably be the same for all models, as otherwise there will need to be added complexity to the models)

"config":{
        ...
        "partition_key": "derived_tstamp",
    }

There might also need to be set a var in dbt_project.yml that says which partition key is used for the atomic table, as I imagine this can be a number of different values. This would probably also need to be considered in order to have the manifest work best and reduce the data processed(?)

snowplow__atomic_schema_partition_key: 'load_tstamp'

Describe alternatives you've considered

As mentioned above, we have created a filter that's always applied in order to not always query all the data available. Furthermore, we're always enforcing date filters, so that the users don't query too much data just because they can.

Example query without the rubber band filters applied:

SELECT
  app_id,
FROM
  `warehouse.snowplow_normalize.my_events`
WHERE
  TIMESTAMP_TRUNC(derived_tstamp, DAY) >= TIMESTAMP("2024-08-01")

This query processes 6.05 GB of data, despite only querying on 5 days (as of 5th of August)

With an added filter on this, we can reduce the amount processed by x-folds:

SELECT
  app_id,
FROM
  `warehouse.snowplow_normalize.my_events`
WHERE
  TIMESTAMP_TRUNC(derived_tstamp, DAY) >= TIMESTAMP("2024-08-01")
  AND
  TIMESTAMP_TRUNC(collector_tstamp, DAY) >= TIMESTAMP("2024-08-01")

which only processes 0.7 GB

In other queries we have tried, this has the effect of lowering from 65 GB to around 2 GB, and we currently only have around a month of data. This problem will only increase in size if we don't enforce a collector_tstamp filter as well.

As mentioned above, this has the downside that it doesn't include potentially late events

Additional context

N/A

Are you interested in contributing towards this feature?

Yeah, me or my team at TV 2 Norway like the normalize package and are willing to contribute

The text was updated successfully, but these errors were encountered:

agnessnowplow · 2024-08-07T16:55:15Z

Hi @tv2-tjebsen, Thank you for the detailed explanation, we understand the need to be able to be flexible with partitioning/filtering for optimizations. Regarding optimizing the event table querying, we already support changing the incremental timestamp field with the variable snowplow__session_timestamp which in your case should ideally be set to load_tstamp as that is what your atomic events table is partitioned on. The lookback_window_hours could also be reduced in this case.

As for the actual normalized derived table partitioning, currently the additional flat columns (event_columns in the config file) to take from the events table are optional, the derived_tstamp would have to be hardcoded into the package (perhaps renamed as start_tstamp to be consistent with other packages as you said) and then possibly add it as a default partitioning field. This however, would be a breaking change, therefore need to think about how best to support this. Alternatively, we could tie the derived table partitioning field universally to a variable, but then things can break easily, therefore there should be some checks in place at the time of the model creation that the column exists for all the event specifications within the config. I will raise this with the team to add as a feature request, until then the only option I see is to overwrite the partition_by field in the model config for all the derived tables as a "custom" solution.

tv2-tjebsen · 2024-08-08T08:33:11Z

Thanks for your great reply, @agnessnowplow! I was not aware of the snowplow__session_timestamp-variable. Thanks for pointing it out, it's now implemented in our instance of normalize.

I agree with the arguments for the partitioning part and agree that the fastest way to goal for us would probably be to add a more custom solution. I guess it could be solved by just changing the model_gen-script slightly. I think it would be a great addition to the package to have it more aligned with the unified setup though (or as a variable). Totally understandable that you can't introduce breaking changes and need to prioritize ✌️

tv2-tjebsen added the type:enhancement New features or improvements to existing features. label Aug 5, 2024

github-actions bot added the status:needs_triage Needs maintainer triage. label Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to change partition key #44

Option to change partition key #44

tv2-tjebsen commented Aug 5, 2024

agnessnowplow commented Aug 7, 2024

tv2-tjebsen commented Aug 8, 2024 •

edited

Loading

Option to change partition key #44

Option to change partition key #44

Comments

tv2-tjebsen commented Aug 5, 2024

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Are you interested in contributing towards this feature?

agnessnowplow commented Aug 7, 2024

tv2-tjebsen commented Aug 8, 2024 • edited Loading

tv2-tjebsen commented Aug 8, 2024 •

edited

Loading