Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to change partition key #44

Open
tv2-tjebsen opened this issue Aug 5, 2024 · 2 comments
Open

Option to change partition key #44

tv2-tjebsen opened this issue Aug 5, 2024 · 2 comments
Labels
status:needs_triage Needs maintainer triage. type:enhancement New features or improvements to existing features.

Comments

@tv2-tjebsen
Copy link

Is your feature request related to a problem? Please describe.

When using normalize from within BI tools (we use Looker), we want to use derived_tstamp instead of collector_tstamp as this is the better representation of when the events actually happened.

When we allow the user to filter on time on just derived_tstamp, all partitions are scanned, as BigQuery doesn't have a perspective on how those timestamps are connected. To combat this we've created a rubber band filter that filters +/1 one day of the derived_tstamp on collector_tstamp. But this has the downside that it doesn't catch late arriving events. See SQL examples below.

Furthermore, I imagine the use of collector_tstamp is based on the assumption that the atomic table is partitioned on this field. With the new loader structure for BigQuery (my_schema_1 not my_schema_1_2_3) our events table is partitioned on load_timestamp which again I assume has implications on the data scanned for the actual runs of the normalize package, as that also needs to scan all partitions

Describe the solution you'd like

One option (and probably the easiest) would be to use start_tstamp similar to what the unified models use - from my understanding, this is mostly the same as derived_tstamp and thus utilize the partitions when the user filters on derived_tstamp.

Another option would be to allow the user to set the partition key and then this is updated in the code. Now collector_tstamp is hard coded into the model_gen.py-script as well as the normalize_events macro.

I imagine this could be part of the config-section on top of the normalize_config.json-file (and should probably be the same for all models, as otherwise there will need to be added complexity to the models)

"config":{
        ...
        "partition_key": "derived_tstamp",
    }

There might also need to be set a var in dbt_project.yml that says which partition key is used for the atomic table, as I imagine this can be a number of different values. This would probably also need to be considered in order to have the manifest work best and reduce the data processed(?)

snowplow__atomic_schema_partition_key: 'load_tstamp'

Describe alternatives you've considered

As mentioned above, we have created a filter that's always applied in order to not always query all the data available. Furthermore, we're always enforcing date filters, so that the users don't query too much data just because they can.

Example query without the rubber band filters applied:

SELECT
  app_id,
FROM
  `warehouse.snowplow_normalize.my_events`
WHERE
  TIMESTAMP_TRUNC(derived_tstamp, DAY) >= TIMESTAMP("2024-08-01")

This query processes 6.05 GB of data, despite only querying on 5 days (as of 5th of August)

With an added filter on this, we can reduce the amount processed by x-folds:

SELECT
  app_id,
FROM
  `warehouse.snowplow_normalize.my_events`
WHERE
  TIMESTAMP_TRUNC(derived_tstamp, DAY) >= TIMESTAMP("2024-08-01")
  AND
  TIMESTAMP_TRUNC(collector_tstamp, DAY) >= TIMESTAMP("2024-08-01")

which only processes 0.7 GB

In other queries we have tried, this has the effect of lowering from 65 GB to around 2 GB, and we currently only have around a month of data. This problem will only increase in size if we don't enforce a collector_tstamp filter as well.

As mentioned above, this has the downside that it doesn't include potentially late events

Additional context

N/A

Are you interested in contributing towards this feature?

Yeah, me or my team at TV 2 Norway like the normalize package and are willing to contribute

@tv2-tjebsen tv2-tjebsen added the type:enhancement New features or improvements to existing features. label Aug 5, 2024
@github-actions github-actions bot added the status:needs_triage Needs maintainer triage. label Aug 5, 2024
@agnessnowplow
Copy link
Contributor

Hi @tv2-tjebsen, Thank you for the detailed explanation, we understand the need to be able to be flexible with partitioning/filtering for optimizations. Regarding optimizing the event table querying, we already support changing the incremental timestamp field with the variable snowplow__session_timestamp which in your case should ideally be set to load_tstamp as that is what your atomic events table is partitioned on. The lookback_window_hours could also be reduced in this case.

As for the actual normalized derived table partitioning, currently the additional flat columns (event_columns in the config file) to take from the events table are optional, the derived_tstamp would have to be hardcoded into the package (perhaps renamed as start_tstamp to be consistent with other packages as you said) and then possibly add it as a default partitioning field. This however, would be a breaking change, therefore need to think about how best to support this. Alternatively, we could tie the derived table partitioning field universally to a variable, but then things can break easily, therefore there should be some checks in place at the time of the model creation that the column exists for all the event specifications within the config. I will raise this with the team to add as a feature request, until then the only option I see is to overwrite the partition_by field in the model config for all the derived tables as a "custom" solution.

@tv2-tjebsen
Copy link
Author

tv2-tjebsen commented Aug 8, 2024

Thanks for your great reply, @agnessnowplow! I was not aware of the snowplow__session_timestamp-variable. Thanks for pointing it out, it's now implemented in our instance of normalize.

I agree with the arguments for the partitioning part and agree that the fastest way to goal for us would probably be to add a more custom solution. I guess it could be solved by just changing the model_gen-script slightly. I think it would be a great addition to the package to have it more aligned with the unified setup though (or as a variable). Totally understandable that you can't introduce breaking changes and need to prioritize ✌️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:needs_triage Needs maintainer triage. type:enhancement New features or improvements to existing features.
Projects
None yet
Development

No branches or pull requests

2 participants