Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add subdaily granularity #5882

Merged
merged 23 commits into from
Aug 15, 2024
Merged

add subdaily granularity #5882

merged 23 commits into from
Aug 15, 2024

Conversation

mirnawong1
Copy link
Contributor

@mirnawong1 mirnawong1 commented Aug 2, 2024

resolves #5857
resolves #5908

this pr adds draft content to explain subdaily granularities in MF.

[ ] Needs PM review
[ ] Needs docs review

Outstanding questions

  • Can the user use both the default_grain and time_granularity? and how does it connect to the time_spine and when should a user it what? or is it up to them?
  • What should we communicate wrt cumulative metrics?
  • @Jstein77 do you think we should use the new 'sub-daily' page to explain granularities in general?

@mirnawong1 mirnawong1 requested a review from a team as a code owner August 2, 2024 15:12
Copy link

vercel bot commented Aug 2, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
docs-getdbt-com ✅ Ready (Inspect) Visit Preview Aug 15, 2024 2:46pm

@github-actions github-actions bot added content Improvements or additions to content size: medium This change will take up to a week to address Docs team Authored by the Docs team @dbt Labs labels Aug 2, 2024
@github-actions github-actions bot added size: large This change will more than a week to address and might require more than one person and removed size: medium This change will take up to a week to address labels Aug 13, 2024

<!--dimensions are non-aggregatable expressions that define the level of aggregation for a metric used to define how data is sliced or grouped in a metric. Since groups can't be aggregated, they're considered to be a property of the primary or unique entity of the table.

Groups are defined within semantic models, alongside entities and measures, and correspond to non-aggregatable columns in your dbt model that provides categorical or time-based context. In SQL, dimensions is typically included in the GROUP BY clause.-->

All dimensions require a `name`, `type` and in some cases, an `expr` parameter. The `name` for your dimension must be unique to the semantic model and can not be the same as an existing `entity` or `measure` within that same model.
All dimensions require a `name` and `type` and, in some cases, can optionally include an `expr` parameter. The `name` for your Dimension must be unique within the same semantic model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems redundant to say both "in some cases" and "optionally" - maybe pick one or the other?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup good call. I will update.

- name: is_bulk_transaction
type: categorical
expr: case when quantity > 10 then true else false end
```

MetricFlow requires that all dimensions have a primary entity. This is to guarantee unique dimension names. If your data source doesn't have a primary entity, you need to assign the entity a name using the `primary_entity: entity_name` key. It doesn't necessarily have to map to a column in that table and assigning the name doesn't affect query generation.
Dimensions are bound to the primary entity of the semantic model in which they are defined. For example, if a dimension called `is_bulk_transaction` is defined in a model with `transaction` as a primary entity, then `is_bulk_transaction` is scoped to the `transaction` entity. To reference this dimension you would use the fully qualified dimension name `transaction__is_bulk_transaction`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to use an example dimension name that makes it somewhat clear why we bind it to the entity name. E.g., something like transaction__country or just changing the name to something like transaction__is_bulk would make this feel less redundant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

MetricFlow requires that all dimensions have a primary entity. This is to guarantee unique dimension names. If your data source doesn't have a primary entity, you need to assign the entity a name using the `primary_entity: entity_name` key. It doesn't necessarily have to map to a column in that table and assigning the name doesn't affect query generation.
Dimensions are bound to the primary entity of the semantic model in which they are defined. For example, if a dimension called `is_bulk_transaction` is defined in a model with `transaction` as a primary entity, then `is_bulk_transaction` is scoped to the `transaction` entity. To reference this dimension you would use the fully qualified dimension name `transaction__is_bulk_transaction`.

MetricFlow requires that all semantic models have a primary entity. This is to guarantee unique dimension names. If your data source doesn't have a primary entity, you need to assign the entity a name using the `primary_entity` key. It doesn't necessarily have to map to a column in that table and assigning the name doesn't affect query generation. An example of defining a primary entity for a data source that doesn't have a primary entity column is below:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add that for a virtual primary entity like this, you should try to make the name unique? I don't think we enforce that (we should) but it's definitely helpful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


The current options for time granularity are day, week, month, quarter, and year.
Any granularity supported by your engine's `date_trunc` function will work, with the most common granularities being hour, day, week, month, quarter, and year.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite accurate (e.g., look at the options available for snowflake). Might be better to just list the options we support.
For sub-daily options, we support these for all engines unless otherwise noted):

  • nanosecond (snowflake only)
  • microsecond (all engines except trino)
  • millisecond
  • second
  • minute
  • hour

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


Aggregation between metrics with different granularities is possible, with the Semantic Layer returning results at the highest granularity by default. For example, when querying two metrics with daily and monthly granularity, the resulting aggregation will be at the monthly level.
Aggregation between metrics with different granularities is possible, with the Semantic Layer returning results at the coarser granularity by default. For example, when querying two metrics with daily and monthly granularity, the resulting aggregation will be at the monthly level.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think coarsest would be grammatically correct here


<File name='metricflow_time_spine.sql'>
If you already have a date dimension or time spine table in your dbt project you can simply point MetricFlow at this table. To do this, update the `model` configuration to use this table in the semantic layer. For example, given the following directory structure, you can create two time spine configurations, `time_spine_hourly` and `time_spine_daily`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think people migrating from the old time spine will think they need to rename the model? Not sure if we want to add a note about that (that you can keep the old name) to avoid confusion!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an note about this.



If you need to create a time spine table from scratch, add the following code to your dbt project.
The example creates a time spine at a daily grain and an hourly grain. We recommend creating both an hourly and daily time spine, MetricFlow will use the appropriate time spine based on the granularity of the metric selected to minimize data scans.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some more detail here? Some things I think it would be helpful to know:

  • MetricFlow will use the time spine with the largest compatible granularity for a given query to ensure the most efficient query possible
  • You can add a time spine for each granularity you intend to use if minor query efficiency is more important to you than setup time / space constraints
  • We recommend having a time spine at the finest grain used in any of your dimensions to avoid unexpected errors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more context

### Conversion metrics
## Default granularity for metircs

It's possible to define a default time granularity for metrics that differs from the granularity of the default aggregation time dimensions (`metric_time`). This is useful if your time dimension has a very fine grain, like second or hour, but you typically query metrics rolled up at a coarser grain. The granularity can be set using the `time_granularity` parameter on the metric and defaults to `day`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would note that while it defaults to day, if day is not available because the dimension is defined at a coarser granularity, it will default to the defined granularity for the dimension!

@@ -84,7 +84,7 @@ semantic_models:
- name: transaction_date
type: time
type_params:
time_granularity: day
time_granularity: day # Additional options include hour, week, month, quarter, year, and so on.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems weird to exclude other options like second and below if we're going to list so many. do we need this list at all?

- MetricFlow requires all dimensions to be tied to a primary entity.
Dimensions have the following characteristics:

- There are two types of dimensions: categorical and time. Categorical dimensions are for things you can't measure in numbers, while time dimensions represent dates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"...while time dimensions represent dates and timestamps"

website/sidebars.js Outdated Show resolved Hide resolved
Copy link
Contributor

@courtneyholcomb courtneyholcomb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mirnawong1 leaving comments here for what should be version blocked!

@@ -173,28 +161,34 @@ measures:

<TabItem value="time_gran" label="time_granularity">

`time_granularity` specifies the smallest level of detail that a measure or metric should be reported at, such as daily, weekly, monthly, quarterly, or yearly. Different granularity options are available, and each metric must have a specified granularity. For example, a metric specified with weekly granularity couldn't be aggregated to a daily grain.
`time_granularity` specifies the grain of a time dimension. MetricFlow will transform the underlying column to the specified granularity. For example, if you add hourly granularity to a time dimension column, MetricFlow will run a `date_trunc` function to convert the timestamp to hourly. You can easily change the time grain at query time and aggregate it to a coarser grain, for example, from hourly to monthly. However, you can't go from a coarser grain to a finer grain (monthly to hourly).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mirnawong1 This section mentions hourly granularity, which isn't available for <=1.8. We should keep this section for 1.9+, but can we swap the word "hourly" with "daily" for <=1.8?


The current options for time granularity are day, week, month, quarter, and year.
Our supported granularities are:
* nanosecond (Snowflake only)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mirnawong1 These sub-daily granularity options are showing up for all versions. Can we keep them all for 1.9+, but remove anything smaller than day for <=1.8?

is_partition: True
type_params:
time_granularity: day
time_granularity: hour
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mirnawong1 Can we swap in day instead of hour for <=1.8?

@@ -6,11 +6,45 @@ sidebar_label: "MetricFlow time spine"
tags: [Metrics, Semantic Layer]
---

MetricFlow uses a timespine table to construct cumulative metrics. By default, MetricFlow expects the timespine table to be named `metricflow_time_spine` and doesn't support using a different name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mirnawong1 For this entire file - can we use this deleted text for versions <=1.8, instead of the new text? The new text should only be for 1.9+.

```

</VersionBlock>

You only need to include the `date_day` column in the table. MetricFlow can handle broader levels of detail, but it doesn't currently support finer grains.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mirnawong1 for old versions, we can update this to say:
"...but finer grains are only supported in versions 1.9+."

import SLCourses from '/snippets/_sl-course.md';

<SLCourses/>

### Conversion metrics
## Default granularity for metircs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mirnawong1 This whole section called "Default granularity for metrics" should be version blocked to 1.9+.
Also noting that "metrics" is misspelled in the title (though maybe that's already fixed in production!)

@@ -232,10 +283,20 @@ filter: |
{{ TimeDimension('time_dimension', 'granularity') }}

filter: |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mirnawong1 Can we version block this metric filter example to versions 1.8+?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content Improvements or additions to content Docs team Authored by the Docs team @dbt Labs size: large This change will more than a week to address and might require more than one person
Projects
None yet
5 participants