Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add subdaily granularity #5882

Merged
merged 23 commits into from
Aug 15, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
ef816e5
add bits
mirnawong1 Jul 31, 2024
6626a1c
add timespine
mirnawong1 Jul 31, 2024
b49a75b
add
mirnawong1 Aug 2, 2024
c5a594d
new page and rn
mirnawong1 Aug 2, 2024
23b373b
Merge branch 'current' into sub-granularity
mirnawong1 Aug 2, 2024
0c8362e
Merge branch 'current' into sub-granularity
mirnawong1 Aug 2, 2024
f510116
Update website/docs/docs/dbt-versions/release-notes.md
mirnawong1 Aug 2, 2024
97d5125
Merge branch 'current' into sub-granularity
mirnawong1 Aug 2, 2024
7700ae1
Merge branch 'current' into sub-granularity
mirnawong1 Aug 12, 2024
e517b12
Merge branch 'current' into sub-granularity
mirnawong1 Aug 12, 2024
9028011
update time spine and dimensions docs
Jstein77 Aug 9, 2024
31a5fc1
updates for sub daily granualrity
Jstein77 Aug 13, 2024
0cf05df
Merge branch 'current' into sub-granularity
Jstein77 Aug 13, 2024
ffb1b2e
spelling + grammar updates
Jstein77 Aug 13, 2024
23ac774
Merge branch 'current' into sub-granularity
runleonarun Aug 13, 2024
ce7dd8a
Apply suggestions from code review
matthewshaver Aug 14, 2024
695a1f6
address comments
Jstein77 Aug 14, 2024
84e0c1e
Update semantic-models.md
Jstein77 Aug 14, 2024
eb7a262
Update metrics-overview.md
Jstein77 Aug 14, 2024
8b71b5b
Apply suggestions from code review
matthewshaver Aug 15, 2024
8289f21
Merge branch 'current' into sub-granularity
matthewshaver Aug 15, 2024
2a35c8d
Apply suggestions from code review
matthewshaver Aug 15, 2024
44d405f
Update website/docs/docs/build/metricflow-time-spine.md
matthewshaver Aug 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 26 additions & 40 deletions website/docs/docs/build/dimensions.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,18 @@ sidebar_label: "Dimensions"
tags: [Metrics, Semantic Layer]
---

Dimensions are a way to group or filter information based on categories or time. It's like a special label that helps organize and analyze data.

In a data platform, dimensions are part of a larger structure called a semantic model. It's created along with other elements like [entities](/docs/build/entities) and [measures](/docs/build/measures) and used to add more details to your data that can't be easily added up or combined. In SQL, dimensions are typically included in the `group by` clause of your SQL query.
Dimensions represent the non-aggregatable columns in your data set, which are the attributes, features, or characteristics that describe or categorize data. In the context of the dbt Semantic Layer, dimensions are part of a larger structure called a semantic model. They are created along with other elements like [entities](/docs/build/entities) and [measures](/docs/build/measures) and used to add more details to your data. In SQL, dimensions are typically included in the `group by` clause of your SQL query.

<!--dimensions are non-aggregatable expressions that define the level of aggregation for a metric used to define how data is sliced or grouped in a metric. Since groups can't be aggregated, they're considered to be a property of the primary or unique entity of the table.

Groups are defined within semantic models, alongside entities and measures, and correspond to non-aggregatable columns in your dbt model that provides categorical or time-based context. In SQL, dimensions is typically included in the GROUP BY clause.-->

All dimensions require a `name`, `type` and in some cases, an `expr` parameter. The `name` for your dimension must be unique to the semantic model and can not be the same as an existing `entity` or `measure` within that same model.
All dimensions require a `name` and `type` and, in some cases, can optionally include an `expr` parameter. The `name` for your Dimension must be unique within the same semantic model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems redundant to say both "in some cases" and "optionally" - maybe pick one or the other?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup good call. I will update.


| Parameter | Description | Type |
| --------- | ----------- | ---- |
| `name` | Refers to the name of the group that will be visible to the user in downstream tools. It can also serve as an alias if the column name or SQL query reference is different and provided in the `expr` parameter. <br /><br /> Dimension names should be unique within a semantic model, but they can be non-unique across different models as MetricFlow uses [joins](/docs/build/join-logic) to identify the right dimension. | Required |
| `type` | Specifies the type of group created in the semantic model. There are two types:<br /><br />- **Categorical**: Group rows in a table by categories like geography, color, and so on. <br />- **Time**: Point to a date field in the data platform. Must be of type TIMESTAMP or equivalent in the data platform engine. <br /> - You can also use time dimensions to specify time spans for [slowly changing dimensions](/docs/build/dimensions#scd-type-ii) tables. | Required |
| `type` | Specifies the type of group created in the semantic model. There are two types:<br /><br />- **Categorical**: Describe attributes or features like geography or sales region. <br />- **Time**: Time-based dimensions like timestamps or dates. | Required |
| `type_params` | Specific type params such as if the time is primary or used as a partition | Required |
| `description` | A clear description of the dimension | Optional |
| `expr` | Defines the underlying column or SQL query for a dimension. If no `expr` is specified, MetricFlow will use the column with the same name as the group. You can use the column name itself to input a SQL expression. | Optional |
Expand Down Expand Up @@ -48,6 +46,8 @@ semantic_models:
agg_time_dimension: order_date
# --- entities ---
entities:
- name: transaction
type: primary
...
# --- measures ---
measures:
Expand All @@ -56,14 +56,18 @@ semantic_models:
dimensions:
- name: order_date
type: time
label: "Date of transaction" # Recommend adding a label to define the value displayed in downstream tools
expr: date_trunc('day', ts)
type_params:
time_granularity: day
label: "Date of transaction" # Recommend adding a label to provide more context to users consuming the data
expr: ts
- name: is_bulk_transaction
type: categorical
expr: case when quantity > 10 then true else false end
```

MetricFlow requires that all dimensions have a primary entity. This is to guarantee unique dimension names. If your data source doesn't have a primary entity, you need to assign the entity a name using the `primary_entity: entity_name` key. It doesn't necessarily have to map to a column in that table and assigning the name doesn't affect query generation.
Dimensions are bound to the primary entity of the semantic model in which they are defined. For example, if a dimension called `is_bulk_transaction` is defined in a model with `transaction` as a primary entity, then `is_bulk_transaction` is scoped to the `transaction` entity. To reference this dimension you would use the fully qualified dimension name `transaction__is_bulk_transaction`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to use an example dimension name that makes it somewhat clear why we bind it to the entity name. E.g., something like transaction__country or just changing the name to something like transaction__is_bulk would make this feel less redundant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


MetricFlow requires that all semantic models have a primary entity. This is to guarantee unique dimension names. If your data source doesn't have a primary entity, you need to assign the entity a name using the `primary_entity` key. It doesn't necessarily have to map to a column in that table and assigning the name doesn't affect query generation. An example of defining a primary entity for a data source that doesn't have a primary entity column is below:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add that for a virtual primary entity like this, you should try to make the name unique? I don't think we enforce that (we should) but it's definitely helpful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


```yaml
semantic_model:
Expand Down Expand Up @@ -93,7 +97,7 @@ This section further explains the dimension definitions, along with examples. Di

## Categorical

Categorical is used to group metrics by different categories such as product type, color, or geographical area. They can refer to existing columns in your dbt model or be calculated using a SQL expression with the `expr` parameter. An example of a category dimension is `is_bulk_transaction`, which is a group created by applying a case statement to the underlying column `quantity`. This allows users to group or filter the data based on bulk transactions.
Categorical dimensions are used to group metrics by different attributes, features, or characteristics such as product type. They can refer to existing columns in your dbt model or be calculated using a SQL expression with the `expr` parameter. An example of a categorical dimension is `is_bulk_transaction`, which is a group created by applying a case statement to the underlying column `quantity`. This allows users to group or filter the data based on bulk transactions.

```yaml
dimensions:
Expand All @@ -104,15 +108,10 @@ dimensions:

## Time

:::tip use datetime data type if using BigQuery
To use BigQuery as your data platform, time dimensions columns need to be in the datetime data type. If they are stored in another type, you can cast them to datetime using the `expr` property. Time dimensions are used to group metrics by different levels of time, such as day, week, month, quarter, and year. MetricFlow supports these granularities, which can be specified using the `time_granularity` parameter.
:::

Time has additional parameters specified under the `type_params` section. When you query one or more metrics in MetricFlow using the CLI, the default time dimension for a single metric is the aggregation time dimension, which you can refer to as `metric_time` or use the dimensions' name.
Time has additional parameters specified under the `type_params` section. When you query one or more metrics, the default time dimension for each metric is the aggregation time dimension, which you can refer to as `metric_time` or use the dimension's name.

You can use multiple time groups in separate metrics. For example, the `users_created` metric uses `created_at`, and the `users_deleted` metric uses `deleted_at`:


```bash
# dbt Cloud users
dbt sl query --metrics users_created,users_deleted --group-by metric_time__year --order-by metric_time__year
Expand All @@ -121,40 +120,27 @@ dbt sl query --metrics users_created,users_deleted --group-by metric_time__year
mf query --metrics users_created,users_deleted --group-by metric_time__year --order-by metric_time__year
```


You can set `is_partition` for time or categorical dimensions to define specific time spans. Additionally, use the `type_params` section to set `time_granularity` to adjust aggregation detail (like daily, weekly, and so on):
You can set `is_partition` for time to define specific time spans. Additionally, use the `type_params` section to set `time_granularity` to adjust aggregation details (hourly, daily, weekly, and so on).

<Tabs>

<TabItem value="is_partition" label="is_partition">

Use `is_partition: True` to show that a dimension exists over a specific time window. For example, a date-partitioned dimensional table. When you query metrics from different tables, the dbt Semantic Layer uses this parameter to ensure that the correct dimensional values are joined to measures.

You can also use `is_partition` for [categorical](#categorical) dimensions as well.

MetricFlow enables metric aggregation during query time. For example, you can aggregate the `messages_per_month` measure. If you originally had a `time_granularity` for the time dimensions `metric_time`, you can specify a yearly granularity for aggregation in your query:

```bash
# dbt Cloud users
dbt sl query --metrics messages_per_month --group-by metric_time__year --order-by metric_time__year

# dbt Core users
mf query --metrics messages_per_month --group-by metric_time__year --order metric_time__year
```

```yaml
dimensions:
- name: created_at
type: time
label: "Date of creation"
expr: date_trunc('day', ts_created) # ts_created is the underlying column name from the table
is_partition: True
expr: ts_created # ts_created is the underlying column name from the table
is_partition: True
type_params:
time_granularity: day
- name: deleted_at
type: time
label: "Date of deletion"
expr: date_trunc('day', ts_deleted) # ts_deleted is the underlying column name from the table
expr: ts_deleted # ts_deleted is the underlying column name from the table
is_partition: True
type_params:
time_granularity: day
Expand All @@ -173,28 +159,28 @@ measures:

<TabItem value="time_gran" label="time_granularity">

`time_granularity` specifies the smallest level of detail that a measure or metric should be reported at, such as daily, weekly, monthly, quarterly, or yearly. Different granularity options are available, and each metric must have a specified granularity. For example, a metric specified with weekly granularity couldn't be aggregated to a daily grain.
`time_granularity` specifies the grain of a time dimension. MetricFlow will transform the underlying column to the specified granularity. For example, if you add hourly granularity to a time dimension column, MetricFlow will run a `date_trunc` function to convert the timestamp to hourly. You can easily change the time grain at query time and aggregate it to a coarser grain, for example, from hourly to monthly. However, you can't go from a coarser grain to a finer grain (monthly to hourly).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mirnawong1 This section mentions hourly granularity, which isn't available for <=1.8. We should keep this section for 1.9+, but can we swap the word "hourly" with "daily" for <=1.8?


The current options for time granularity are day, week, month, quarter, and year.
Any granularity supported by your engine's `date_trunc` function will work, with the most common granularities being hour, day, week, month, quarter, and year.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite accurate (e.g., look at the options available for snowflake). Might be better to just list the options we support.
For sub-daily options, we support these for all engines unless otherwise noted):

  • nanosecond (snowflake only)
  • microsecond (all engines except trino)
  • millisecond
  • second
  • minute
  • hour

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


Aggregation between metrics with different granularities is possible, with the Semantic Layer returning results at the highest granularity by default. For example, when querying two metrics with daily and monthly granularity, the resulting aggregation will be at the monthly level.
Aggregation between metrics with different granularities is possible, with the Semantic Layer returning results at the coarser granularity by default. For example, when querying two metrics with daily and monthly granularity, the resulting aggregation will be at the monthly level.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think coarsest would be grammatically correct here


```yaml
dimensions:
- name: created_at
type: time
label: "Date of creation"
expr: date_trunc('day', ts_created) # ts_created is the underlying column name from the table
expr: ts_created # ts_created is the underlying column name from the table
is_partition: True
type_params:
time_granularity: day
time_granularity: hour
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mirnawong1 Can we swap in day instead of hour for <=1.8?

- name: deleted_at
type: time
label: "Date of deletion"
expr: date_trunc('day', ts_deleted) # ts_deleted is the underlying column name from the table
expr: ts_deleted # ts_deleted is the underlying column name from the table
is_partition: True
type_params:
time_granularity: day
time_granularity: day

measures:
- name: users_deleted
Expand All @@ -213,7 +199,7 @@ measures:
### SCD Type II

:::caution
Currently, there are limitations in supporting SCDs.
Currently, semantic models with SCD Type II dimensions cannot contain measures.
:::

MetricFlow supports joins against dimensions values in a semantic model built on top of a slowly changing dimension (SCD) Type II table. This is useful when you need a particular metric sliced by a group that changes over time, such as the historical trends of sales by a customer's country.
Expand Down
87 changes: 79 additions & 8 deletions website/docs/docs/build/metricflow-time-spine.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,38 @@ sidebar_label: "MetricFlow time spine"
tags: [Metrics, Semantic Layer]
---

MetricFlow uses a timespine table to construct cumulative metrics. By default, MetricFlow expects the timespine table to be named `metricflow_time_spine` and doesn't support using a different name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mirnawong1 For this entire file - can we use this deleted text for versions <=1.8, instead of the new text? The new text should only be for 1.9+.

It's common in analytics engineering to have a date dimension or "time spine" table as a base table for different types of time-based joins and aggregations. The structure of this table is typically a base column of daily or hourly dates, with additional columns for other time grains, like fiscal quarter, defined based on the base column. You can join other tables to the time spine on the base column to calculate metrics like revenue at a point in time, or to aggregate to a specific time grain.

To create this table, you need to create a model in your dbt project called `metricflow_time_spine` and add the following code:
MetricFlow requires you to define a time spine table as a project level configuration, which then is used for various time-based joins and aggregations, like cumulative metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add that the time spine needs to have day grain at minimum?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added


<File name='metricflow_time_spine.sql'>
If you already have a date dimension or time spine table in your dbt project you can simply point MetricFlow at this table. To do this, update the `model` configuration to use this table in the semantic layer. For example, given the following directory structure, you can create two time spine configurations, `time_spine_hourly` and `time_spine_daily`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think people migrating from the old time spine will think they need to rename the model? Not sure if we want to add a note about that (that you can keep the old name) to avoid confusion!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an note about this.


![Time spine directory structure](/img/docs/building-metrics/time_spines.png)
matthewshaver marked this conversation as resolved.
Show resolved Hide resolved


```yaml
models:
- name: time_spine_hourly
time_spine:
standard_granularity_column: date_hour # column for the standard grain of your table
columns:
- name: date_hour
granularity: hour # set granularity at column-level for standard_granularity_column
- name: time_spine_daily
time_spine:
standard_granularity_column: date_day # column for the standard grain of your table
columns:
- name: date_day
granularity: day # set granularity at column-level for standard_granularity_column
```

Now, break down the configuration above. It's pointing to a model called `time_spine_daily`. It sets the time spine configurations under the `time_spine` key. The `standard_granularity_column` is the lowest grain of the table, in this case, it's hourly. It needs to reference a column defined under the columns key, in this case, `date_hour`. Use the `standard_granularity_column` as the join key for the time spine table when joining tables in MetricFlow. Here, the granularity of the `standard_granularity_column` is set at the column level, in this case, `hour`.


If you need to create a time spine table from scratch, add the following code to your dbt project.
The example creates a time spine at a daily grain and an hourly grain. We recommend creating both an hourly and daily time spine, MetricFlow will use the appropriate time spine based on the granularity of the metric selected to minimize data scans.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some more detail here? Some things I think it would be helpful to know:

  • MetricFlow will use the time spine with the largest compatible granularity for a given query to ensure the most efficient query possible
  • You can add a time spine for each granularity you intend to use if minor query efficiency is more important to you than setup time / space constraints
  • We recommend having a time spine at the finest grain used in any of your dimensions to avoid unexpected errors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more context


<File name='time_spine_daily.sql'>

<VersionBlock lastVersion="1.6">

Expand All @@ -27,7 +54,7 @@ with days as (
dbt_utils.date_spine(
'day',
"to_date('01/01/2000','mm/dd/yyyy')",
"to_date('01/01/2027','mm/dd/yyyy')"
"to_date('01/01/2025','mm/dd/yyyy')"
)
}}

Expand All @@ -39,6 +66,9 @@ final as (
)

select * from final
-- filter the time spine to a specific range
where date_day > dateadd(year, -4, current_timestamp())
and date_hour < dateadd(day, 30, current_timestamp())
```

</VersionBlock>
Expand All @@ -58,7 +88,7 @@ with days as (
dbt.date_spine(
'day',
"to_date('01/01/2000','mm/dd/yyyy')",
"to_date('01/01/2027','mm/dd/yyyy')"
"to_date('01/01/2025','mm/dd/yyyy')"
)
}}

Expand All @@ -70,6 +100,8 @@ final as (
)

select * from final
where date_day > dateadd(year, -4, current_timestamp())
and date_hour < dateadd(day, 30, current_timestamp())
```

</VersionBlock>
Expand All @@ -86,7 +118,7 @@ with days as (
{{dbt_utils.date_spine(
'day',
"DATE(2000,01,01)",
"DATE(2030,01,01)"
"DATE(2025,01,01)"
)
}}
),
Expand All @@ -98,6 +130,9 @@ final as (

select *
from final
-- filter the time spine to a specific range
where date_day > dateadd(year, -4, current_timestamp())
and date_hour < dateadd(day, 30, current_timestamp())
```

</VersionBlock>
Expand All @@ -112,7 +147,7 @@ with days as (
{{dbt.date_spine(
'day',
"DATE(2000,01,01)",
"DATE(2030,01,01)"
"DATE(2025,01,01)"
)
}}
),
Expand All @@ -124,8 +159,44 @@ final as (

select *
from final
-- filter the time spine to a specific range
where date_day > dateadd(year, -4, current_timestamp())
and date_hour < dateadd(day, 30, current_timestamp())
```

</VersionBlock>

You only need to include the `date_day` column in the table. MetricFlow can handle broader levels of detail, but it doesn't currently support finer grains.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mirnawong1 for old versions, we can update this to say:
"...but finer grains are only supported in versions 1.9+."

## Hourly time spine
<File name='time_spine_hourly.sql'>

```sql
-- filename: metricflow_time_spine_hour.sql
{{
config(
materialized = 'table',
)
}}

with hours as (

{{
dbt.date_spine(
'hour',
"to_date('01/01/2000','mm/dd/yyyy')",
"to_date('01/01/2025','mm/dd/yyyy')"
)
}}

),

final as (
select cast(date_hour as timestamp) as date_hour
from hours
)

select * from final
-- filter the time spine to a specific range
where date_day > dateadd(year, -4, current_timestamp())
and date_hour < dateadd(day, 30, current_timestamp())
```
</File>
Loading
Loading