Feature request: volume anomalies pre-aggregate data before computing statistics #1278

garfieldthesam · 2023-11-09T23:45:07Z

garfieldthesam
Nov 9, 2023

I need to run volume anomaly tests on very large tables. However, I cannot performantly do so because the compiled query does not pre-aggregate the row count data. For example this is what the first CTE looks like for a test run on my Databricks cluster:

with monitored_table as (
  select
    *
  from
    <very large table>
  where
    <where predicates>
),

For large tables (especially ones that are very wide), this is prohibitively expensive for 2 reasons:

A select * fetches all table columns, which isn't necessary in principle for creating a time series; this is especially costly for columnar databases
An aggregated time series can be generated from the table, providing a much smaller dataset for later CTEs in the compiled query to apply their transformations/filters to

As a result we've had to create a cumbersome system where we create derived data quality metrics tables summarizing the large table's metrics each day, and then run elementary column tests on those.

I'd like to request a rearchitecture of how the volume anomalies code works to improve performance for large tables.

garfieldthesam · 2023-12-14T02:31:29Z

garfieldthesam
Dec 14, 2023
Author

Closing and turning this into an issue instead

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: volume anomalies pre-aggregate data before computing statistics #1278

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Feature request: volume anomalies pre-aggregate data before computing statistics #1278

garfieldthesam Nov 9, 2023

Replies: 1 comment

garfieldthesam Dec 14, 2023 Author

garfieldthesam
Nov 9, 2023

garfieldthesam
Dec 14, 2023
Author