From 0a4268884790d71168e8671a47a528e448ea011c Mon Sep 17 00:00:00 2001 From: Soren Spicknall Date: Wed, 16 Aug 2023 16:25:16 -0500 Subject: [PATCH] dbt dev docs updates - linting and some extra context/links (#2896) --- docs/warehouse/developing_dbt_models.md | 94 +++++++++++++++++-------- 1 file changed, 66 insertions(+), 28 deletions(-) diff --git a/docs/warehouse/developing_dbt_models.md b/docs/warehouse/developing_dbt_models.md index eb37addf3b..dc6abd6936 100644 --- a/docs/warehouse/developing_dbt_models.md +++ b/docs/warehouse/developing_dbt_models.md @@ -1,13 +1,14 @@ (developing-dbt-models)= + # Developing models in dbt Information related to contributing to the [Cal-ITP dbt project](https://github.com/cal-itp/data-infra/tree/main/warehouse). ## Resources -* If you have questions specific to our project or you encounter any issues when developing, please reach out in the [`#data-warehouse-devs`](https://cal-itp.slack.com/archives/C050ZNDUL21) or [`#data-office-hours`](https://cal-itp.slack.com/archives/C02KH3DGZL7) Cal-ITP Slack channels. +* If you have questions specific to our project or you encounter any issues when developing, please bring those questions to the [`#data-warehouse-devs`](https://cal-itp.slack.com/archives/C050ZNDUL21) or [`#data-office-hours`](https://cal-itp.slack.com/archives/C02KH3DGZL7) Cal-ITP Slack channels. Working through questions "in public" helps build shared knowledge that's searchable later on. * For Cal-ITP-specific data warehouse documentation, including high-level concepts and naming conventions, see [our Cal-ITP dbt documentation site](https://dbt-docs.calitp.org/#!/overview). This documentation is automatically generated by dbt, and incorporates the table- and column-level documentation that developers enter in YAML files in the dbt project. -* For general dbt concepts (for example, dbt [Jinja](https://docs.getdbt.com/guides/advanced/using-jinja) or [tests](https://docs.getdbt.com/docs/build/tests)), see the [general dbt documentation site](https://docs.getdbt.com/docs/introduction). +* For general dbt concepts (for example, [models](https://docs.getdbt.com/docs/build/models), dbt [Jinja](https://docs.getdbt.com/guides/advanced/using-jinja) or [tests](https://docs.getdbt.com/docs/build/tests)), see the [general dbt documentation site](https://docs.getdbt.com/docs/introduction). * For general SQL or BigQuery concepts (for example, [tables](https://cloud.google.com/bigquery/docs/tables-intro), [views](https://cloud.google.com/bigquery/docs/views-intro), or [window functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls)), see the [BigQuery docs site](https://cloud.google.com/bigquery/docs). ## How to contribute to the dbt project @@ -36,12 +37,15 @@ Once your models are working the way you want and you have added all necessary d Because the warehouse is collectively maintained and changes can affect a variety of users, please open PRs against `main` when work is ready to merge and keep an eye out for comments and questions from reviewers, who might require tweaks before merging. #### Video walkthrough + For an example of working with dbt in JupyterHub, see the recording of the [original onboarding call in April 2023 (requires Cal-ITP Google Drive access).](https://drive.google.com/file/d/1NDh_4u0-ROsH0w8J3Z1ccn_ICLAHtDhX/view?usp=drive_link) A few notes on this video: + * The documentation shown is an older version of this docs page; the information shared verbally is correct but the page has been updated. * The bug encountered towards the end of the video (that prevented us from running dbt tests) has been fixed. * The code owners mentioned in the video have changed; consult in Slack for process guidance. (modeling-considerations)= + ## Modeling considerations When developing or updating dbt models, there are some considerations which may differ from a notebook-based analysis. These can be thought of as a checklist or decision tree of questions that you should run through whenever you are editing or creating a dbt model. Longer explanations of each item are included below. @@ -50,18 +54,18 @@ When developing or updating dbt models, there are some considerations which may flowchart TD workflow_type[Are you fixing a bug or creating something new?] -identify_bug[Identify the cause of your bug.] -change_models[Make your changes.] +identify_bug[Identify the cause of your bug] +change_models[Make your changes] tool_choice[Should it be a dbt model?] -not_dbt[Use a notebook or dashboard for your analysis.] +not_dbt[Use a notebook or dashboard for your analysis] grain[What is the grain of your model?] grain_exists[Is there already a model with your desired grain?] -new_column[Add a column to the existing model.] -new_model[Create a new model.] -test_changes[Test your changes.] -new_column[Add a column to the existing model.] -tests_and_docs[Add dbt tests and documentation.] -merge_model_changes[Merge your changes.] +new_column[Add a column to the existing model] +new_model[Create a new model] +test_changes[Test your changes] +new_column[Add a column to the existing model] +tests_and_docs[Add dbt tests and documentation] +merge_model_changes[Merge your changes] workflow_type -- fixing a bug --> identify_bug identify_bug --> change_models @@ -80,7 +84,8 @@ tests_and_docs --> merge_model_changes ``` (identify-bug)= -### Identify the cause of your bug. + +### Identify the cause of your bug ```{admonition} Example bug troubleshooting walkthrough Here is a series of recordings showing a workflow for debugging a failing dbt test. The resulting PR is [#2892](https://github.com/cal-itp/data-infra/pull/2892). @@ -92,6 +97,7 @@ Here is a series of recordings showing a workflow for debugging a failing dbt te ``` Usually, bug are caused by: + * New or historical data issues. For example, an agency may be doing something in their GTFS data that we didn't expect and this may have broken one of our models. This can happen with brand new data that is coming in or in historical data that wasn't included in local testing (this is especially relevant for RT data, where local testing usually includes a very small subset of the full data.) * GTFS or data schema bugs. Sometimes we may have misinterpreted the GTFS spec (or another incoming data model) and modeled something incorrectly. * SQL bugs. Sometimes we may have written SQL incorrectly (for example, used the wrong kind of join.) @@ -109,18 +115,22 @@ If you noticed an issue that wasn't caused by a failing test, you can start with In either case, you may need to consider upstream models. To identify your model's parents, you can look at the [dbt docs website](https://dbt-docs.calitp.org/#!/overview) page for your model. [See the dbt docs](https://docs.getdbt.com/docs/collaborate/documentation#navigating-the-documentation-site) for how to look at the model's lineage. You can modify the model selector in the bottom middle to just `+` to only see the model's parents. You can also run `poetry run dbt ls -s + --resource-type model` to see a model's parents just on the command line. Try to figure out where the root cause of the problem is occurring. This may involve running ad-hoc SQL queries to inspect the models involved. (tool_choice)= + ### Should it be a dbt model? Changes to dbt models are likely to be appropriate when one or more of the following is true: + * There is a consistent or ongoing need for this data. dbt can ensure that transformations are performed consistently at scale, every day. * The data is big. Doing transformations in BigQuery can be more performant than doing them in notebooks or any workflow where the large data must be loaded into local memory. -* We want to use the same model across multiple domains or tools. The BigQuery data warehouse is the easiest way to provide consistent data throughout the Cal-ITP data ecosystem (in JupyterHub, Metabase, open data publishing, the reports site, etc.) +* We want to use the same data across multiple domains or tools. The BigQuery data warehouse is the easiest way to provide consistent data throughout the Cal-ITP data ecosystem (in JupyterHub, Metabase, open data publishing, the reports site, etc.) dbt models may not be appropriate when: -* You are doing exploratory data analysis. It will almost always be faster to do initial exploration of data via Jupyter/Python than in SQL. + +* You are doing exploratory data analysis, especially on inconsistently-constructed data. It will almost always be faster to do initial exploration of data via Jupyter/Python than in SQL. If you only plan to use the data for a short period of time, or plan to reshape it many speculatively before you settle on a more long-lived form, you probably don't need to represent it with a dbt model quite yet. * You want to apply a simple transformation (for example, a grouped summary or filter) to answer a specific question. In this case, it may be more appropriate to create a Metabase dashboard with the desired transformations. (model-grain)= + ### What is the grain of your model? [*Grain* means "what does a row represent"](https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/grain/). For example, for: Do you want one row per route per day? One row per fare transaction? One row per organization per month? @@ -128,6 +138,7 @@ dbt models may not be appropriate when: This concept of grain can be one of the biggest differences between notebook-based analysis and warehouse analytics engineering. In notebooks, you may be making a lot of transformations and saving each step out as its own dataframe, and you may use functions for reusable transformation steps. In warehouse development, we want to be focused on making reusable models where the data itself is the common building block across analyses. That often means trying to make only one table in the warehouse for each grain, regardless of how many different types of analysis it might be used for. (grain-exists)= + ### Is there already a model with this grain? If there is already a model with the grain you are targeting, you should almost always add new columns to that existing model rather than making a new model with the same grain. @@ -141,20 +152,24 @@ To figure out if there is a model with your desired grain, you can [search the d For dimensions you may need to think more about whether you are truly making a new dimension, or whether you are simply applying a filter on an existing dimension (for example, `dim_bus_vehicles` could be a subset of an existing `dim_vehicles` dimension model, in which case you could just add a boolean column `is_bus` on `dim_vehicles` rather than making a new dedicated model.) (change-models)= -### Make your changes. + +### Make your changes The kind of changes you make will depend on what you've discovered in previous steps. Fixing a bug might involve changing model SQL, changing tests to reflect a new understanding, or something else. Adding a column might involve a simple change on one model or require updating several parent models. Creating a new model for brand new data may involve a new external table or it might be a straightforward transformation just in the mart. Some examples of different types of changes are listed below. If you find yourself making big changes that seem likely to significantly affect other users, you may need to step back and convene a conversation to make sure that everyone is on board; see for example [this Google Doc about Airtable schema changes](https://docs.google.com/document/d/1F4METWYNip5nobcPZSUg-5XGtC1XX1hWMQfEmERPPd4/edit#heading=h.dsfdelw2zz3n) where stakeholders confirmed how they wanted to handle schema changes in the warehouse. The [downstream impacts section below](model-downstream-impacts) has suggestions for how to assess the impacts of your changes. #### Example bug fix PRs + Here are a few example `data-infra` PRs that fixed past bugs: -- [PR #2076](https://github.com/cal-itp/data-infra/pull/2076) fixed two bugs: There was a hardcoded incorrect value in our SQL that was causing Sundays to not appear in our scheduled service index (SQL syntax bug), and there was a bug in how we were handling the relationship between `calendar_dates` and `calendar` (GTFS logic bug). -- [PR #2623](https://github.com/cal-itp/data-infra/pull/2623) fixed bugs caused by unexpected calendar data from a producer. +* [PR #2076](https://github.com/cal-itp/data-infra/pull/2076) fixed two bugs: There was a hardcoded incorrect value in our SQL that was causing Sundays to not appear in our scheduled service index (SQL syntax bug), and there was a bug in how we were handling the relationship between `calendar_dates` and `calendar` (GTFS logic bug). +* [PR #2623](https://github.com/cal-itp/data-infra/pull/2623) fixed bugs caused by unexpected calendar data from a producer. #### Example new column PRs + Here are a few example `data-infra` PRs that added columns to existing models: + * [PR #2778](https://github.com/cal-itp/data-infra/pull/2778) is a simple example of adding a column that already exists in staging to a mart table. * For intermediate examples of adding a column in a staging table and propagating it through a few different downstream models, see * [PR #2768](https://github.com/cal-itp/data-infra/pull/2768) @@ -162,14 +177,17 @@ Here are a few example `data-infra` PRs that added columns to existing models: * [PR #2383](https://github.com/cal-itp/data-infra/pull/2383) adds a column to Airtable data end-to-end (starting from the raw data/external tables; this involves non-dbt code). #### Example new model PRs + Here are a few `data-infra` PRs that created brand new models: + * [PR #2686](https://github.com/cal-itp/data-infra/pull/2686) created a new model based on existing warehouse data. * For examples of adding models to dbt end-to-end (starting from raw data/external tables; this involves non-dbt code), see: - * [PR #2509](https://github.com/cal-itp/data-infra/pull/2509) - * [PR #2781](https://github.com/cal-itp/data-infra/pull/2781) + * [PR #2509](https://github.com/cal-itp/data-infra/pull/2509) + * [PR #2781](https://github.com/cal-itp/data-infra/pull/2781) (test-changes)= -### Test your changes. + +### Test your changes Once you have made some changes, it is important to test them. @@ -182,26 +200,35 @@ The first step is running your changes in the test/staging environment. You can What to test/check will vary based on what you're doing, but below are some example things to consider. #### Column values + Are the values in your column/model what you expect? For example, are there nulls? Does the column have all the values you anticipated (for example, if you have a day of the week column, is data from all 7 days present)? If it's numeric, what are the minimum and maximum values; do they make sense (for example, if you have a percentage column, is it always between 0 and 100)? What is the most common value? + * To check nulls: - ``` + + ```sql SELECT * FROM WHERE IS NULL ``` + * To check distinct values in a column: - ``` + + ```sql SELECT DISTINCT FROM ``` + * To check min/max: - ``` + + ```sql SELECT MIN(), MAX() FROM ``` + * To check most common values: - ``` + + ```sql SELECT , COUNT(*) AS ct @@ -215,12 +242,14 @@ Are the values in your column/model what you expect? For example, are there null To confirm that the grain is what you expect, you should check whether an anticipated unique key is actually unique. For example, if you were making a daily shapes table, you might expect that `date + feed_key + shape_id` would be unique. Similarly, you should have a ballpark idea of the order of magnitude of the number of rows you expect. If you're making a yearly organizations table and your table has a million rows, something is likely off. Some example queries could be: * To check row count: - ``` + + ```sql SELECT COUNT(*) FROM ``` * To check row count by some attribute (for example, rows per date): - ``` + + ```sql SELECT , COUNT(*) AS ct FROM GROUP BY 1 @@ -228,7 +257,8 @@ To confirm that the grain is what you expect, you should check whether an antici ``` * To check uniqueness based on a combination of a few columns: - ``` + + ```sql WITH tbl AS ( SELECT * FROM ), @@ -250,6 +280,7 @@ To confirm that the grain is what you expect, you should check whether an antici LEFT JOIN tbl USING (, , ) ORDER BY , , ``` + #### Performance While testing, you should keep an eye on the performance (cost/data efficiency) of the model: @@ -276,7 +307,9 @@ If you make your table incremental, you should make sure to run both a full refr ``` (model-downstream-impacts)= + #### Downstream impacts + Another important consideration is the potential downstream impacts of your changes, particularly if you are changing existing models. You can run dbt tests on the downstream models using `poetry run dbt test -s +`. You should make sure that your changes do not cause new test failures in downstream models. @@ -286,14 +319,17 @@ Check which models are downstream of your changes using `poetry run dbt ls -s + --resource-type exposure`. #### Other considerations + Other questions will be more specific to your changes or goals, but it's usually a good idea to take a second and brainstorm things that you would expect to be true and check whether your model reflects them. For example, we expect more trip activity during AM/PM peak periods than in the middle of the night; is that true in your model? What is the balance of weekend to weekday activity in your model, and does it make sense for the context? (tests-and-docs)= -### Add dbt tests and documentation. + +### Add dbt tests and documentation Once you are satisfied with your changes, you should add tests and documentation, both of which are vital to keeping the project maintainable over time. (dbt-tests)= + #### dbt tests [dbt tests](https://docs.getdbt.com/docs/build/tests) help us ensure baseline model validity and guarantees over time (for example: "this ID is unique"). A dbt test failure should be something that you'd want to fix quickly to ensure models work for downstream users. So, **a test failure should be something you'd want to act on** by doing something like fixing, dropping, or adding a warning flag on failing rows. A test can also be thought of an assertion: a not-null test on a column asserts that that column is never null. @@ -317,7 +353,9 @@ Try to write a description that will make sense to future readers. It is helpful Model documentation should make the [grain](model-grain) clear. (merge-model-changes)= + ### Merge your changes + Once you have finished work, you should make a PR to get your changes merged into `main`. PRs that sit and become stale may become problematic if other people make changes to models before they merge that cause them to behave unexpectedly. Once your changes merge, if they will impact other users (for example by changing a high-traffic model), you may want to announce your changes on Slack in [`#data-warehouse-devs`](https://cal-itp.slack.com/archives/C050ZNDUL21), [`#data-analysis`](https://cal-itp.slack.com/archives/C02H6JUSS9L), or a similar channel.