From f8dfdb311e45912842f17425363fd27143684441 Mon Sep 17 00:00:00 2001 From: Laurie <55149902+lauriemerrell@users.noreply.github.com> Date: Wed, 16 Aug 2023 10:12:38 -0500 Subject: [PATCH] dbt modeling guidance docs (#2874) * start working on modeling decisions flowchart * fix admonition * fix single quotes * expand on bug identification and test linking * fixing bugs, grain overview * fix box type and more docs * example testing * add testing, documentation docs * fix links * rephrase some things, finish flowchat * fix emoji breaking mermaid and add callout about incremental models * add example stakeholder doc and rearrange a bit * move incremental warning, a few more tweaks * actually move incremental warning * pr comments * add loom links for debugging a failing test * reorder * tweak video link name * clarify bug types * more clarifications, esp for incremental models * linter * reference correct github action --- docs/warehouse/developing_dbt_models.md | 314 ++++++++++++++++++++++-- 1 file changed, 295 insertions(+), 19 deletions(-) diff --git a/docs/warehouse/developing_dbt_models.md b/docs/warehouse/developing_dbt_models.md index 618af21c28..eb37addf3b 100644 --- a/docs/warehouse/developing_dbt_models.md +++ b/docs/warehouse/developing_dbt_models.md @@ -5,55 +5,331 @@ Information related to contributing to the [Cal-ITP dbt project](https://github. ## Resources -* If you have questions specific to our project or you encounter any issues when developing, please reach out in the `#data-warehouse-devs` or `#data-office-hours` Cal-ITP Slack channels. +* If you have questions specific to our project or you encounter any issues when developing, please reach out in the [`#data-warehouse-devs`](https://cal-itp.slack.com/archives/C050ZNDUL21) or [`#data-office-hours`](https://cal-itp.slack.com/archives/C02KH3DGZL7) Cal-ITP Slack channels. * For Cal-ITP-specific data warehouse documentation, including high-level concepts and naming conventions, see [our Cal-ITP dbt documentation site](https://dbt-docs.calitp.org/#!/overview). This documentation is automatically generated by dbt, and incorporates the table- and column-level documentation that developers enter in YAML files in the dbt project. -* For general dbt concepts (for example, dbt [Jinja](https://docs.getdbt.com/guides/advanced/using-jinja) or [tests](https://docs.getdbt.com/docs/build/tests)), see the [general dbt documentation site](https://docs.getdbt.com/docs/introduction) +* For general dbt concepts (for example, dbt [Jinja](https://docs.getdbt.com/guides/advanced/using-jinja) or [tests](https://docs.getdbt.com/docs/build/tests)), see the [general dbt documentation site](https://docs.getdbt.com/docs/introduction). * For general SQL or BigQuery concepts (for example, [tables](https://cloud.google.com/bigquery/docs/tables-intro), [views](https://cloud.google.com/bigquery/docs/views-intro), or [window functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls)), see the [BigQuery docs site](https://cloud.google.com/bigquery/docs). ## How to contribute to the dbt project ### Getting started -To get set up to contribute to the dbt project via JupyterHub, follow [the README in the data-infra repo warehouse folder](https://github.com/cal-itp/data-infra/blob/main/warehouse/README.md#setting-up-the-project-in-your-jupyterhub-personal-server). If you hit any trouble with setup, let folks know in the #data-warehouse-devs or #data-office-hours channel in the Cal-ITP Slack. +To get set up to contribute to the dbt project via JupyterHub, follow [the README in the data-infra repo warehouse folder](https://github.com/cal-itp/data-infra/blob/main/warehouse/README.md#setting-up-the-project-in-your-jupyterhub-personal-server). If you hit any trouble with setup, let folks know in the [`#data-warehouse-devs`](https://cal-itp.slack.com/archives/C050ZNDUL21) or [`#data-office-hours`](https://cal-itp.slack.com/archives/C02KH3DGZL7) Cal-ITP Slack channels. -We also recommend that everyone who does dbt development joins the `#data-warehouse-devs` channel in the Cal-ITP Slack workspace to ask questions, collaborate, and build shared knowledge. +We recommend that everyone who does dbt development joins the [`#data-warehouse-devs`](https://cal-itp.slack.com/archives/C050ZNDUL21) channel in the Cal-ITP Slack to ask questions, collaborate, and build shared knowledge. ### Developer workflow +```{admonition} See next section for modeling considerations +This section describes the high-level mechanics/process of the developer workflow to edit the dbt project. +**Please read the [next section](developing_dbt_models#modeling-considerations) for things you should consider from the data modeling perspective.** +``` + To test your work while developing dbt models, you can edit the `.sql` files for your models, save your changes, and then [run the model from the command line](https://github.com/cal-itp/data-infra/tree/main/warehouse#dbt-commands) to execute the SQL you updated. To inspect tables as you are working, the fastest method is usually to run some manual test queries or "preview" the tables in the [BigQuery user interface](https://console.cloud.google.com/bigquery?project=cal-itp-data-infra-staging). You can also use something like [`pandas.read_gbq`](https://pandas.pydata.org/docs/reference/api/pandas.read_gbq.html) to perform example queries in a notebook. When you run dbt commands locally on JupyterHub, your models will be created in the `cal-itp-data-infra-staging._` BigQuery dataset. Note that this is in the `cal-itp-data-infra-staging` Google Cloud Platform project, *not* the production `cal-itp-data-infra` project. -Once your models are working the way you want, please make sure to update the associated YAML files (there will generally be one or two YAML files per folder with model tests, documentation, and additional configuration.) Especially if you created a brand-new model, you will want to add tests for things like unique, non-null primary keys and valid foreign keys. The YAML is also where table- and column-level documentation is populated. [Here is an example YAML file from our project](https://github.com/cal-itp/data-infra/blob/main/warehouse/models/mart/gtfs/_mart_gtfs_dims.yml), and [here is an example PR that created a new mart table with accompanying documentation](https://github.com/cal-itp/data-infra/pull/2097). +Once your models are working the way you want and you have added all necessary documentation and tests in YAML files ([see below](developing_dbt_models#modeling-considerations) for more on modeling, documentation, and testing considerations), you are ready to merge. + +Because the warehouse is collectively maintained and changes can affect a variety of users, please open PRs against `main` when work is ready to merge and keep an eye out for comments and questions from reviewers, who might require tweaks before merging. -Because the warehouse is collectively maintained and changes can affect a variety of users, please open PRs against `main` when work is ready to merge and keep an eye out for comments and questions from reviewers, who might require tweaks before merging. See CONTRIBUTING.md in the repo for more information on GitHub practices.) +#### Video walkthrough +For an example of working with dbt in JupyterHub, see the recording of the [original onboarding call in April 2023 (requires Cal-ITP Google Drive access).](https://drive.google.com/file/d/1NDh_4u0-ROsH0w8J3Z1ccn_ICLAHtDhX/view?usp=drive_link) A few notes on this video: +* The documentation shown is an older version of this docs page; the information shared verbally is correct but the page has been updated. +* The bug encountered towards the end of the video (that prevented us from running dbt tests) has been fixed. +* The code owners mentioned in the video have changed; consult in Slack for process guidance. +(modeling-considerations)= ## Modeling considerations -When developing dbt models, there are some considerations which may differ from considerations for a notebook-based analysis. +When developing or updating dbt models, there are some considerations which may differ from a notebook-based analysis. These can be thought of as a checklist or decision tree of questions that you should run through whenever you are editing or creating a dbt model. Longer explanations of each item are included below. + +```{mermaid} +flowchart TD + +workflow_type[Are you fixing a bug or creating something new?] +identify_bug[Identify the cause of your bug.] +change_models[Make your changes.] +tool_choice[Should it be a dbt model?] +not_dbt[Use a notebook or dashboard for your analysis.] +grain[What is the grain of your model?] +grain_exists[Is there already a model with your desired grain?] +new_column[Add a column to the existing model.] +new_model[Create a new model.] +test_changes[Test your changes.] +new_column[Add a column to the existing model.] +tests_and_docs[Add dbt tests and documentation.] +merge_model_changes[Merge your changes.] + +workflow_type -- fixing a bug --> identify_bug +identify_bug --> change_models +workflow_type -- creating something new --> tool_choice +tool_choice -- dbt model--> grain +tool_choice -- not dbt --> not_dbt +grain --> grain_exists +grain_exists -- there is an existing model with this grain --> new_column +grain_exists -- there is no existing model with this grain --> new_model +new_column --> change_models +new_model --> change_models +change_models --> test_changes +test_changes -- identify issues--> change_models +test_changes -- no issues --> tests_and_docs +tests_and_docs --> merge_model_changes +``` + +(identify-bug)= +### Identify the cause of your bug. + +```{admonition} Example bug troubleshooting walkthrough +Here is a series of recordings showing a workflow for debugging a failing dbt test. The resulting PR is [#2892](https://github.com/cal-itp/data-infra/pull/2892). + +1. [Find the compiled test SQL](https://www.loom.com/share/0bf1eaa6d3374be782eb18859f24e08f?sid=0ab9251a-723d-4a5a-9d77-0be3b116a021) +2. [Run the test SQL](https://www.loom.com/share/e57a163ecd8c4b15af0959fb0b4ab3eb?sid=36b5cf66-9832-4538-8813-c3dd982e6a77) +3. [Confirm the nature of the problem](https://www.loom.com/share/cf82e6a7ab824d8dbd572d9371ccf6dc?sid=9d31aa40-ff34-4c34-9fd9-22985c7c57e4) +4. [Plan a fix](https://www.loom.com/share/99133f1172c44540a683e423f4ad91ef?sid=e199aed5-00e0-4acc-98de-24f696e4267e) +``` + +Usually, bug are caused by: +* New or historical data issues. For example, an agency may be doing something in their GTFS data that we didn't expect and this may have broken one of our models. This can happen with brand new data that is coming in or in historical data that wasn't included in local testing (this is especially relevant for RT data, where local testing usually includes a very small subset of the full data.) +* GTFS or data schema bugs. Sometimes we may have misinterpreted the GTFS spec (or another incoming data model) and modeled something incorrectly. +* SQL bugs. Sometimes we may have written SQL incorrectly (for example, used the wrong kind of join.) -### When to develop or update a model +How to investigate the bug depends on how the bug was noticed. -One key question to ask is whether a given data need is best met by a new dbt model or updates to an existing model vs. some other tool or process. +If there was a failing dbt test, you can `dbt compile` locally to compile the project SQL. You can then find the SQL for the failing test (follow the [dbt testing FAQ under "one of my tests failed, how can I debug it?"](https://docs.getdbt.com/docs/build/tests#faqs) to find the compiled test SQL). Run that SQL in BigQuery to see the rows that are failing. -Changes to dbt models are likely to be appropriate, and often beneficial over other approaches, when one or more of the following is true: -* There is a consistent or ongoing need for the same transformations. dbt can ensure that transformations are performed consistently at scale, every day. -* Transformations are needed on large data. Doing transformations in BigQuery can be more performant than doing them in notebooks or any workflow where the large data must be loaded into local memory. +```{note} +When you `dbt compile` locally, you will compile SQL that's pointed at the staging project and your namespaced dataset. Make sure to change those references when you run the compiled SQL. So, `cal-itp-data-infra-staging.laurie_mart_gtfs.fct_scheduled_trips` would become `cal-itp-data-infra.mart_gtfs.fct_scheduled_trips`. +``` + +If you noticed an issue that wasn't caused by a failing test, you can start with the model that you noticed the problem in. + +In either case, you may need to consider upstream models. To identify your model's parents, you can look at the [dbt docs website](https://dbt-docs.calitp.org/#!/overview) page for your model. [See the dbt docs](https://docs.getdbt.com/docs/collaborate/documentation#navigating-the-documentation-site) for how to look at the model's lineage. You can modify the model selector in the bottom middle to just `+` to only see the model's parents. You can also run `poetry run dbt ls -s + --resource-type model` to see a model's parents just on the command line. Try to figure out where the root cause of the problem is occurring. This may involve running ad-hoc SQL queries to inspect the models involved. + +(tool_choice)= +### Should it be a dbt model? + +Changes to dbt models are likely to be appropriate when one or more of the following is true: +* There is a consistent or ongoing need for this data. dbt can ensure that transformations are performed consistently at scale, every day. +* The data is big. Doing transformations in BigQuery can be more performant than doing them in notebooks or any workflow where the large data must be loaded into local memory. * We want to use the same model across multiple domains or tools. The BigQuery data warehouse is the easiest way to provide consistent data throughout the Cal-ITP data ecosystem (in JupyterHub, Metabase, open data publishing, the reports site, etc.) -dbt model updates may not be appropriate when: -* There is insufficient support in dbt or BigQuery for the necessary tooling. The biggest current example is geospatial work; once we have [Python models in the dbt project](https://github.com/cal-itp/data-infra/issues/2359), there will be fewer limitations. -* You are doing exploratory data analysis, especially on inconsistently-constructed data. It will almost always be faster to do initial exploration of data via Jupyter/Python than in SQL. +dbt models may not be appropriate when: +* You are doing exploratory data analysis. It will almost always be faster to do initial exploration of data via Jupyter/Python than in SQL. +* You want to apply a simple transformation (for example, a grouped summary or filter) to answer a specific question. In this case, it may be more appropriate to create a Metabase dashboard with the desired transformations. + +(model-grain)= +### What is the grain of your model? + +[*Grain* means "what does a row represent"](https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/grain/). For example, for: Do you want one row per route per day? One row per fare transaction? One row per organization per month? + +This concept of grain can be one of the biggest differences between notebook-based analysis and warehouse analytics engineering. In notebooks, you may be making a lot of transformations and saving each step out as its own dataframe, and you may use functions for reusable transformation steps. In warehouse development, we want to be focused on making reusable models where the data itself is the common building block across analyses. That often means trying to make only one table in the warehouse for each grain, regardless of how many different types of analysis it might be used for. + +(grain-exists)= +### Is there already a model with this grain? + +If there is already a model with the grain you are targeting, you should almost always add new columns to that existing model rather than making a new model with the same grain. + +``` {admonition} Example: fct_scheduled_trips +Consider [`fct_scheduled_trips`](https://dbt-docs.calitp.org/#!/model/model.calitp_warehouse.fct_scheduled_trips). This is our core trip-level table. Every scheduled trip should have a row in this model and attributes that you might want from that trip should be present for easy access. As a result, this table has a lot of columns, because when we need new information about trips, we add it here. For example, when we wanted to fix time zone handling for trips, we [added those columns](https://github.com/cal-itp/data-infra/pull/2457) instead of creating a new model. +``` + +To figure out if there is a model with your desired grain, you can [search the dbt docs](https://dbt-docs.calitp.org/#!/overview) for relevant terms. For example, if you want a table of routes, you can search "routes" to see what models already exist. You can also explore the dependency tree for a related table (like `dim_routes`) to see if you can find a table that looks like it has the right grain. You can also see [our dbt docs homepage](https://dbt-docs.calitp.org/#!/overview) for a discussion of table naming conventions to interpret dimension, fact, and bridge tables. + +For dimensions you may need to think more about whether you are truly making a new dimension, or whether you are simply applying a filter on an existing dimension (for example, `dim_bus_vehicles` could be a subset of an existing `dim_vehicles` dimension model, in which case you could just add a boolean column `is_bus` on `dim_vehicles` rather than making a new dedicated model.) + +(change-models)= +### Make your changes. + +The kind of changes you make will depend on what you've discovered in previous steps. Fixing a bug might involve changing model SQL, changing tests to reflect a new understanding, or something else. Adding a column might involve a simple change on one model or require updating several parent models. Creating a new model for brand new data may involve a new external table or it might be a straightforward transformation just in the mart. Some examples of different types of changes are listed below. + +If you find yourself making big changes that seem likely to significantly affect other users, you may need to step back and convene a conversation to make sure that everyone is on board; see for example [this Google Doc about Airtable schema changes](https://docs.google.com/document/d/1F4METWYNip5nobcPZSUg-5XGtC1XX1hWMQfEmERPPd4/edit#heading=h.dsfdelw2zz3n) where stakeholders confirmed how they wanted to handle schema changes in the warehouse. The [downstream impacts section below](model-downstream-impacts) has suggestions for how to assess the impacts of your changes. + +#### Example bug fix PRs +Here are a few example `data-infra` PRs that fixed past bugs: +- [PR #2076](https://github.com/cal-itp/data-infra/pull/2076) fixed two bugs: There was a hardcoded incorrect value in our SQL that was causing Sundays to not appear in our scheduled service index (SQL syntax bug), and there was a bug in how we were handling the relationship between `calendar_dates` and `calendar` (GTFS logic bug). +- [PR #2623](https://github.com/cal-itp/data-infra/pull/2623) fixed bugs caused by unexpected calendar data from a producer. + + +#### Example new column PRs +Here are a few example `data-infra` PRs that added columns to existing models: +* [PR #2778](https://github.com/cal-itp/data-infra/pull/2778) is a simple example of adding a column that already exists in staging to a mart table. +* For intermediate examples of adding a column in a staging table and propagating it through a few different downstream models, see + * [PR #2768](https://github.com/cal-itp/data-infra/pull/2768) + * [PR #2601](https://github.com/cal-itp/data-infra/pull/2686) +* [PR #2383](https://github.com/cal-itp/data-infra/pull/2383) adds a column to Airtable data end-to-end (starting from the raw data/external tables; this involves non-dbt code). + +#### Example new model PRs +Here are a few `data-infra` PRs that created brand new models: +* [PR #2686](https://github.com/cal-itp/data-infra/pull/2686) created a new model based on existing warehouse data. +* For examples of adding models to dbt end-to-end (starting from raw data/external tables; this involves non-dbt code), see: + * [PR #2509](https://github.com/cal-itp/data-infra/pull/2509) + * [PR #2781](https://github.com/cal-itp/data-infra/pull/2781) + +(test-changes)= +### Test your changes. + +Once you have made some changes, it is important to test them. + +```{admonition} Different types of testing +Functional testing during development is different than adding dbt tests ([described below](dbt-tests)). dbt tests ensure some floor of model validity over time; while developing, you should run more holistic tests to ensure that your code is working as expected. +``` + +The first step is running your changes in the test/staging environment. You can run a command like `poetry run dbt run -s +` to run your model and its antecedents. Your models will be created in the `cal-itp-data-infra-staging._` BigQuery dataset. Note that this is in the `cal-itp-data-infra-staging` Google Cloud Platform project, *not* the production `cal-itp-data-infra` project. + +What to test/check will vary based on what you're doing, but below are some example things to consider. + +#### Column values +Are the values in your column/model what you expect? For example, are there nulls? Does the column have all the values you anticipated (for example, if you have a day of the week column, is data from all 7 days present)? If it's numeric, what are the minimum and maximum values; do they make sense (for example, if you have a percentage column, is it always between 0 and 100)? What is the most common value? +* To check nulls: + ``` + SELECT * FROM + WHERE IS NULL + ``` +* To check distinct values in a column: + ``` + SELECT DISTINCT + FROM + ``` +* To check min/max: + ``` + SELECT + MIN(), + MAX() + FROM + ``` +* To check most common values: + ``` + SELECT + , + COUNT(*) AS ct + FROM + GROUP BY 1 + ORDER BY ct DESC + ``` + +#### Row count and uniqueness + +To confirm that the grain is what you expect, you should check whether an anticipated unique key is actually unique. For example, if you were making a daily shapes table, you might expect that `date + feed_key + shape_id` would be unique. Similarly, you should have a ballpark idea of the order of magnitude of the number of rows you expect. If you're making a yearly organizations table and your table has a million rows, something is likely off. Some example queries could be: + +* To check row count: + ``` + SELECT COUNT(*) FROM + ``` + +* To check row count by some attribute (for example, rows per date): + ``` + SELECT , COUNT(*) AS ct + FROM + GROUP BY 1 + ORDER BY 1 + ``` + +* To check uniqueness based on a combination of a few columns: + ``` + WITH tbl AS ( + SELECT * FROM + ), + + dups AS ( + SELECT + , + , + , + COUNT(*) AS ct + FROM tbl + -- adjust this based on the number of columns that make the composite unique key + GROUP BY 1, 2, 3 + HAVING ct > 1 + ) + + SELECT * + FROM dups + LEFT JOIN tbl USING (, , ) + ORDER BY , , + ``` +#### Performance + +While testing, you should keep an eye on the performance (cost/data efficiency) of the model: + +* When you run the dbt model locally, look at how many bytes are billed to build the model(s). +* Before you run test queries, [check the bytes estimates](https://cloud.google.com/bigquery/docs/best-practices-costs#use-query-validator) (these may not be accurate for queries on [views](https://cloud.google.com/bigquery/docs/views-intro#view_pricing) or [clustered tables](https://cloud.google.com/bigquery/docs/clustered-tables#clustered_table_pricing)) +* After you run test queries, look at the total bytes billed after the fact in the **Job Information** tab in the **Query results** section of the BigQuery console. + +If the model takes more than 100 GB to build, or if test queries seem to be reading a lot of data (this is subjective; it's ok to build a sense over time), you may want to consider performance optimizations. + +Below are a few options to improve performance. [Data infra PR #2711](https://github.com/cal-itp/data-infra/pull/2711) has examples of several different types of performance interventions. + +* If the model is expensive to **build**: First, try to figure out what specific steps are expensive. You can run individual portions of your model SQL in the BigQuery console to assess the performance of individual [CTEs](https://docs.getdbt.com/terms/cte). + * If the model involves transformations on a lot of data that doesn't need to be reprocessed every day, you may want to make the model [incremental](https://docs.getdbt.com/docs/build/incremental-models). You can run `poetry run dbt ls -s config.materialized:incremental --resource-type model` to see examples of other incremental models in the repo. + * If the model reads data from an expensive parent table, you may want to consider leveraging clustering or partitioning on that parent table to make a join or select more efficient. See [this comment on data infra PR #2743](https://github.com/cal-itp/data-infra/pull/2743#pullrequestreview-1570532320) for an example of a case where changing a join condition was a more appropriate performance intervention than making the table incremental. +* If the model is expensive to **query**: The main interventions to make a model more efficient to query involve changing the data storage. + * Consider storing it as a [table rather than a view](https://docs.getdbt.com/docs/build/materializations). + * If the model is already a table, you can consider [partitioning](https://cloud.google.com/bigquery/docs/partitioned-tables) or [clustering](https://cloud.google.com/bigquery/docs/clustered-tables#when_to_use_clustering) on columns that will commonly be used as filters. + +```{warning} +Incremental models have two different run modes: **full refreshes** (which re-process all historical data available) and **incremental runs** that load data in batches based on your incremental logic. These two modes run different code. + +If you make your table incremental, you should make sure to run both a full refresh (use the `--full-refresh` flag) and an incremental run (after the table has already been built once; no flag) in your testing to ensure that both are working as expected. +``` + +(model-downstream-impacts)= +#### Downstream impacts +Another important consideration is the potential downstream impacts of your changes, particularly if you are changing existing models. + +You can run dbt tests on the downstream models using `poetry run dbt test -s +`. You should make sure that your changes do not cause new test failures in downstream models. + +Check which models are downstream of your changes using `poetry run dbt ls -s + --resource-type model`. If your model has a lot of descendents, consider performing additional tests to ensure that your changes will not cause problems downstream. + +To check for impacts on defined downstream artifacts (like the reports site and open data publishing), you can check which [exposures](https://docs.getdbt.com/docs/build/exposures) are downstream of your model using `poetry run dbt ls -s + --resource-type exposure`. + +#### Other considerations +Other questions will be more specific to your changes or goals, but it's usually a good idea to take a second and brainstorm things that you would expect to be true and check whether your model reflects them. For example, we expect more trip activity during AM/PM peak periods than in the middle of the night; is that true in your model? What is the balance of weekend to weekday activity in your model, and does it make sense for the context? + +(tests-and-docs)= +### Add dbt tests and documentation. + +Once you are satisfied with your changes, you should add tests and documentation, both of which are vital to keeping the project maintainable over time. + +(dbt-tests)= +#### dbt tests + +[dbt tests](https://docs.getdbt.com/docs/build/tests) help us ensure baseline model validity and guarantees over time (for example: "this ID is unique"). A dbt test failure should be something that you'd want to fix quickly to ensure models work for downstream users. So, **a test failure should be something you'd want to act on** by doing something like fixing, dropping, or adding a warning flag on failing rows. A test can also be thought of an assertion: a not-null test on a column asserts that that column is never null. + +dbt tests are run every day in Airflow and alert when models fail. Because they run every day and execute SQL code, there is some tradeoff with cost: we don't want to test too excessively because that could become wasteful. + +We usually prefer to have tests on [tables (rather than views)](https://docs.getdbt.com/docs/build/materializations) for cost reasons. Most tables, especially in mart datasets, should have at least a primary key test that tests that there is a unique, non-null column; this is one way to monitor that the [grain](model-grain) of the model is stable and is not being violated. + +You may want to find a model similar to the one you're changing and see what tests that other model has. + +```{admonition} Make sure your tests pass! +After you add your tests, you should make sure they pass by running `poetry run dbt test -s `. If your tests don't pass, you should [figure out why](identify-bug) and [make changes](change-models) until they do. +``` + +#### Documentation + +[Documentation in dbt](https://docs.getdbt.com/docs/collaborate/documentation) helps different data users collaborate by explaining what a given model or column is. Documentation should answer the question: **What information would someone else need to use this model/column effectively?** All models should have a description and most columns should too. + +Try to write a description that will make sense to future readers. It is helpful to be specific, for example saying "April 2023" instead of "now" or "current". + +Model documentation should make the [grain](model-grain) clear. + +(merge-model-changes)= +### Merge your changes +Once you have finished work, you should make a PR to get your changes merged into `main`. PRs that sit and become stale may become problematic if other people make changes to models before they merge that cause them to behave unexpectedly. -### Materializations, performance, and cost +Once your changes merge, if they will impact other users (for example by changing a high-traffic model), you may want to announce your changes on Slack in [`#data-warehouse-devs`](https://cal-itp.slack.com/archives/C050ZNDUL21), [`#data-analysis`](https://cal-itp.slack.com/archives/C02H6JUSS9L), or a similar channel. -Because the dbt project is run and built every day, we want to be mindful of cost and be efficient in how we build and structure our models. The [dbt docs materializations page](https://docs.getdbt.com/docs/build/materializations#overview) provides a good overview of different materialization options and associated considerations. +```{warning} +[Incremental models](https://docs.getdbt.com/docs/build/incremental-models) downstream of your changes may require a **full refresh** after your changes merge. -When developing a new model, or updating an existing model, it is helpful to keep an eye on the number of bytes billed to build the model (this information is printed in the terminal output from dbt.) As a rule of thumb in our project, models that take more than 100 GB to build should probably be optimized a bit more, potentially by being made [incremental](https://docs.getdbt.com/docs/build/materializations#incremental). +To check for incremental models downstream of your model, run `poetry run dbt ls -s +,config.materialized:incremental --resource-type model`. If you need to refresh incremental models: +1. Wait for the [build-dbt](https://github.com/cal-itp/data-infra/actions/workflows/build-dbt.yml) GitHub action associated with your PR to complete after you merge. -Performance is one of the hardest things to manage when you are new to developing in SQL, so please don't hesitate to ask questions (the `#data-warehouse-devs` or `#data-office-hours` Cal-ITP Slack channels are good places to ask) as you get used to the options. +2. Go into the [Airflow UI](https://o1d2fa0877cf3fb10p-tp.appspot.com/home) and go to the [transform_warehouse_full_refresh DAG](https://github.com/cal-itp/data-infra/tree/main/airflow/dags/transform_warehouse_full_refresh). **Specify appropriate model selectors to only refresh models that were affected by your changes** and then run the DAG task. +``` ## Helpful talks and presentations