From 9b5230938b392f8e6ca1a7c7ea319074bc26f101 Mon Sep 17 00:00:00 2001 From: mirnawong1 Date: Fri, 12 Jul 2024 16:05:35 +0100 Subject: [PATCH 1/9] test --- website/docs/docs/introduction.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/website/docs/docs/introduction.md b/website/docs/docs/introduction.md index 5301dae396d..64f1da9652b 100644 --- a/website/docs/docs/introduction.md +++ b/website/docs/docs/introduction.md @@ -9,6 +9,16 @@ pagination_prev: null dbt compiles and runs your analytics code against your data platform, enabling you and your team to collaborate on a single source of truth for metrics, insights, and business definitions. This single source of truth, combined with the ability to define tests for your data, reduces errors when logic changes, and alerts you when issues arise. +## Title case header + +The tech writer writes teh documentation. It is organized by the writer too recieve this program. + +I can't can't seem to understand teh main points e.g. etc. + + +## About dbt Cloud +dbt Cloud is a data transformation workflow. + Read more about why we want to enable analysts to work more like software engineers in [The dbt Viewpoint](/community/resources/viewpoint). Learn how other data practitioners around the world are using dbt by [joining the dbt Community](https://www.getdbt.com/community/join-the-community). From e32b100280fbc8bcca610472167013efe8084922 Mon Sep 17 00:00:00 2001 From: Mirna Wong <89008547+mirnawong1@users.noreply.github.com> Date: Fri, 12 Jul 2024 16:07:57 +0100 Subject: [PATCH 2/9] Update introduction.md --- website/docs/docs/introduction.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/docs/introduction.md b/website/docs/docs/introduction.md index 64f1da9652b..5dfaa1af839 100644 --- a/website/docs/docs/introduction.md +++ b/website/docs/docs/introduction.md @@ -9,7 +9,7 @@ pagination_prev: null dbt compiles and runs your analytics code against your data platform, enabling you and your team to collaborate on a single source of truth for metrics, insights, and business definitions. This single source of truth, combined with the ability to define tests for your data, reduces errors when logic changes, and alerts you when issues arise. -## Title case header +## Title Case Header The tech writer writes teh documentation. It is organized by the writer too recieve this program. From d5776b7b80c8c72accb171887209e55d31c44e51 Mon Sep 17 00:00:00 2001 From: Ly Nguyen Date: Thu, 15 Aug 2024 11:34:10 -0700 Subject: [PATCH 3/9] Check a guide --- website/docs/guides/bigquery-qs.md | 321 ++++++++++++++++++++++------- 1 file changed, 249 insertions(+), 72 deletions(-) diff --git a/website/docs/guides/bigquery-qs.md b/website/docs/guides/bigquery-qs.md index 1ba5f7b0021..5401d57f2b6 100644 --- a/website/docs/guides/bigquery-qs.md +++ b/website/docs/guides/bigquery-qs.md @@ -13,101 +13,222 @@ recently_updated: true ## Introduction -In this quickstart guide, you'll learn how to use dbt Cloud with BigQuery. It will show you how to: +In this quickstart guide, you'll learn how to use dbt Cloud with Snowflake. It will show you how to: -- Create a Google Cloud Platform (GCP) project. -- Access sample data in a public dataset. -- Connect dbt Cloud to BigQuery. +- Create a new Snowflake worksheet. +- Load sample data into your Snowflake account. +- Connect dbt Cloud to Snowflake. - Take a sample query and turn it into a model in your dbt project. A model in dbt is a select statement. +- Add sources to your dbt project. Sources allow you to name and describe the raw data already loaded into Snowflake. - Add tests to your models. - Document your models. - Schedule a job to run. +Snowflake also provides a quickstart for you to learn how to use dbt Cloud. It makes use of a different public dataset (Knoema Economy Data Atlas) than what's shown in this guide. For more information, refer to [Accelerating Data Teams with dbt Cloud & Snowflake](https://quickstarts.snowflake.com/guide/accelerating_data_teams_with_snowflake_and_dbt_cloud_hands_on_lab/) in the Snowflake docs. + :::tip Videos for you You can check out [dbt Fundamentals](https://learn.getdbt.com/courses/dbt-fundamentals) for free if you're interested in course learning with videos. -::: +You can also watch the [YouTube video on dbt and Snowflake](https://www.youtube.com/watch?v=kbCkwhySV_I&list=PL0QYlrC86xQm7CoOH6RS7hcgLnd3OQioG). +::: + ### Prerequisites​ -- You have a [dbt Cloud account](https://www.getdbt.com/signup/). -- You have a [Google account](https://support.google.com/accounts/answer/27441?hl=en). -- You can use a personal or work account to set up BigQuery through [Google Cloud Platform (GCP)](https://cloud.google.com/free). +- You have a [dbt Cloud account](https://www.getdbt.com/signup/). +- You have a [trial Snowflake account](https://signup.snowflake.com/). During trial account creation, make sure to choose the **Enterprise** Snowflake edition so you have `ACCOUNTADMIN` access. For a full implementation, you should consider organizational questions when choosing a cloud provider. For more information, see [Introduction to Cloud Platforms](https://docs.snowflake.com/en/user-guide/intro-cloud-platforms.html) in the Snowflake docs. For the purposes of this setup, all cloud providers and regions will work so choose whichever you’d like. ### Related content - Learn more with [dbt Learn courses](https://learn.getdbt.com) +- [How we configure Snowflake](https://blog.getdbt.com/how-we-configure-snowflake/) - [CI jobs](/docs/deploy/continuous-integration) - [Deploy jobs](/docs/deploy/deploy-jobs) - [Job notifications](/docs/deploy/job-notifications) - [Source freshness](/docs/deploy/source-freshness) -## Create a new GCP project​ +## Create a new Snowflake worksheet +1. Log in to your trial Snowflake account. +2. In the Snowflake UI, click **+ Worksheet** in the upper right corner to create a new worksheet. + +## Load data +The data used here is stored as CSV files in a public S3 bucket and the following steps will guide you through how to prepare your Snowflake account for that data and upload it. + +1. Create a new virtual warehouse, two new databases (one for raw data, the other for future dbt development), and two new schemas (one for `jaffle_shop` data, the other for `stripe` data). + + To do this, run these SQL commands by typing them into the Editor of your new Snowflake worksheet and clicking **Run** in the upper right corner of the UI: + ```sql + create warehouse transforming; + create database raw; + create database analytics; + create schema raw.jaffle_shop; + create schema raw.stripe; + ``` + +2. In the `raw` database and `jaffle_shop` and `stripe` schemas, create three tables and load relevant data into them: + + - First, delete all contents (empty) in the Editor of the Snowflake worksheet. Then, run this SQL command to create the `customer` table: + + ```sql + create table raw.jaffle_shop.customers + ( id integer, + first_name varchar, + last_name varchar + ); + ``` -1. Go to the [BigQuery Console](https://console.cloud.google.com/bigquery) after you log in to your Google account. If you have multiple Google accounts, make sure you’re using the correct one. -2. Create a new project from the [Manage resources page](https://console.cloud.google.com/projectcreate?previousPage=%2Fcloud-resource-manager%3Fwalkthrough_id%3Dresource-manager--create-project%26project%3D%26folder%3D%26organizationId%3D%23step_index%3D1&walkthrough_id=resource-manager--create-project). For more information, refer to [Creating a project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#creating_a_project) in the Google Cloud docs. GCP automatically populates the Project name field for you. You can change it to be more descriptive for your use. For example, `dbt Learn - BigQuery Setup`. + - Delete all contents in the Editor, then run this command to load data into the `customer` table: -## Create BigQuery datasets + ```sql + copy into raw.jaffle_shop.customers (id, first_name, last_name) + from 's3://dbt-tutorial-public/jaffle_shop_customers.csv' + file_format = ( + type = 'CSV' + field_delimiter = ',' + skip_header = 1 + ); + ``` + - Delete all contents in the Editor (empty), then run this command to create the `orders` table: + ```sql + create table raw.jaffle_shop.orders + ( id integer, + user_id integer, + order_date date, + status varchar, + _etl_loaded_at timestamp default current_timestamp + ); + ``` -1. From the [BigQuery Console](https://console.cloud.google.com/bigquery), click **Editor**. Make sure to select your newly created project, which is available at the top of the page. -1. Verify that you can run SQL queries. Copy and paste these queries into the Query Editor: + - Delete all contents in the Editor, then run this command to load data into the `orders` table: + ```sql + copy into raw.jaffle_shop.orders (id, user_id, order_date, status) + from 's3://dbt-tutorial-public/jaffle_shop_orders.csv' + file_format = ( + type = 'CSV' + field_delimiter = ',' + skip_header = 1 + ); + ``` + - Delete all contents in the Editor (empty), then run this command to create the `payment` table: + ```sql + create table raw.stripe.payment + ( id integer, + orderid integer, + paymentmethod varchar, + status varchar, + amount integer, + created date, + _batched_at timestamp default current_timestamp + ); + ``` + - Delete all contents in the Editor, then run this command to load data into the `payment` table: + ```sql + copy into raw.stripe.payment (id, orderid, paymentmethod, status, amount, created) + from 's3://dbt-tutorial-public/stripe_payments.csv' + file_format = ( + type = 'CSV' + field_delimiter = ',' + skip_header = 1 + ); + ``` +3. Verify that the data is loaded by running these SQL queries. Confirm that you can see output for each one. ```sql - select * from `dbt-tutorial.jaffle_shop.customers`; - select * from `dbt-tutorial.jaffle_shop.orders`; - select * from `dbt-tutorial.stripe.payment`; + select * from raw.jaffle_shop.customers; + select * from raw.jaffle_shop.orders; + select * from raw.stripe.payment; ``` - Click **Run**, then check for results from the queries. For example: -
- -
-2. Create new datasets from the [BigQuery Console](https://console.cloud.google.com/bigquery). For more information, refer to [Create datasets](https://cloud.google.com/bigquery/docs/datasets#create-dataset) in the Google Cloud docs. Datasets in BigQuery are equivalent to schemas in a traditional database. On the **Create dataset** page: - - **Dataset ID** — Enter a name that fits the purpose. This name is used like schema in fully qualified references to your database objects such as `database.schema.table`. As an example for this guide, create one for `jaffle_shop` and another one for `stripe` afterward. - - **Data location** — Leave it blank (the default). It determines the GCP location of where your data is stored. The current default location is the US multi-region. All tables within this dataset will share this location. - - **Enable table expiration** — Leave it unselected (the default). The default for the billing table expiration is 60 days. Because billing isn’t enabled for this project, GCP defaults to deprecating tables. - - **Google-managed encryption key** — This option is available under **Advanced options**. Allow Google to manage encryption (the default). -
- -
-3. After you create the `jaffle_shop` dataset, create one for `stripe` with all the same values except for **Dataset ID**. - -## Generate BigQuery credentials {#generate-bigquery-credentials} -In order to let dbt connect to your warehouse, you'll need to generate a keyfile. This is analogous to using a database username and password with most other data warehouses. - -1. Start the [GCP credentials wizard](https://console.cloud.google.com/apis/credentials/wizard). Make sure your new project is selected in the header. If you do not see your account or project, click your profile picture to the right and verify you are using the correct email account. For **Credential Type**: - - From the **Select an API** dropdown, choose **BigQuery API** - - Select **Application data** for the type of data you will be accessing - - Click **Next** to create a new service account. -2. Create a service account for your new project from the [Service accounts page](https://console.cloud.google.com/projectselector2/iam-admin/serviceaccounts?supportedpurview=project). For more information, refer to [Create a service account](https://developers.google.com/workspace/guides/create-credentials#create_a_service_account) in the Google Cloud docs. As an example for this guide, you can: - - Type `dbt-user` as the **Service account name** - - From the **Select a role** dropdown, choose **BigQuery Job User** and **BigQuery Data Editor** roles and click **Continue** - - Leave the **Grant users access to this service account** fields blank - - Click **Done** -3. Create a service account key for your new project from the [Service accounts page](https://console.cloud.google.com/iam-admin/serviceaccounts?walkthrough_id=iam--create-service-account-keys&start_index=1#step_index=1). For more information, refer to [Create a service account key](https://cloud.google.com/iam/docs/creating-managing-service-account-keys#creating) in the Google Cloud docs. When downloading the JSON file, make sure to use a filename you can easily remember. For example, `dbt-user-creds.json`. For security reasons, dbt Labs recommends that you protect this JSON file like you would your identity credentials; for example, don't check the JSON file into your version control software. - -## Connect dbt Cloud to BigQuery​ -1. Create a new project in [dbt Cloud](https://cloud.getdbt.com/). From **Account settings** (using the gear menu in the top right corner), click **+ New Project**. +## Connect dbt Cloud to Snowflake + +There are two ways to connect dbt Cloud to Snowflake. The first option is Partner Connect, which provides a streamlined setup to create your dbt Cloud account from within your new Snowflake trial account. The second option is to create your dbt Cloud account separately and build the Snowflake connection yourself (connect manually). If you want to get started quickly, dbt Labs recommends using Partner Connect. If you want to customize your setup from the very beginning and gain familiarity with the dbt Cloud setup flow, dbt Labs recommends connecting manually. + + + + +Using Partner Connect allows you to create a complete dbt account with your [Snowflake connection](/docs/cloud/connect-data-platform/connect-snowflake), [a managed repository](/docs/collaborate/git/managed-repository), [environments](/docs/build/custom-schemas#managing-environments), and credentials. + +1. In the Snowflake UI, click on the home icon in the upper left corner. In the left sidebar, select **Data Products**. Then, select **Partner Connect**. Find the dbt tile by scrolling or by searching for dbt in the search bar. Click the tile to connect to dbt. + + + + If you’re using the classic version of the Snowflake UI, you can click the **Partner Connect** button in the top bar of your account. From there, click on the dbt tile to open up the connect box. + + + +2. In the **Connect to dbt** popup, find the **Optional Grant** option and select the **RAW** and **ANALYTICS** databases. This will grant access for your new dbt user role to each database. Then, click **Connect**. + + + + + +3. Click **Activate** when a popup appears: + + + + + +4. After the new tab loads, you will see a form. If you already created a dbt Cloud account, you will be asked to provide an account name. If you haven't created account, you will be asked to provide an account name and password. + + + +5. After you have filled out the form and clicked **Complete Registration**, you will be logged into dbt Cloud automatically. + +6. From your **Account Settings** in dbt Cloud (using the gear menu in the upper right corner), choose the "Partner Connect Trial" project and select **snowflake** in the overview table. Select edit and update the fields **Database** and **Warehouse** to be `analytics` and `transforming`, respectively. + + + + + + + + + +1. Create a new project in dbt Cloud. From **Account settings** (using the gear menu in the top right corner), click **+ New Project**. 2. Enter a project name and click **Continue**. -3. For the warehouse, click **BigQuery** then **Next** to set up your connection. -4. Click **Upload a Service Account JSON File** in settings. -5. Select the JSON file you downloaded in [Generate BigQuery credentials](#generate-bigquery-credentials) and dbt Cloud will fill in all the necessary fields. -6. Click **Test Connection**. This verifies that dbt Cloud can access your BigQuery account. -7. Click **Next** if the test succeeded. If it failed, you might need to go back and regenerate your BigQuery credentials. +3. For the warehouse, click **Snowflake** then **Next** to set up your connection. + + + +4. Enter your **Settings** for Snowflake with: + * **Account** — Find your account by using the Snowflake trial account URL and removing `snowflakecomputing.com`. The order of your account information will vary by Snowflake version. For example, Snowflake's Classic console URL might look like: `oq65696.west-us-2.azure.snowflakecomputing.com`. The AppUI or Snowsight URL might look more like: `snowflakecomputing.com/west-us-2.azure/oq65696`. In both examples, your account will be: `oq65696.west-us-2.azure`. For more information, see [Account Identifiers](https://docs.snowflake.com/en/user-guide/admin-account-identifier.html) in the Snowflake docs. + + + + * **Role** — Leave blank for now. You can update this to a default Snowflake role later. + * **Database** — `analytics`. This tells dbt to create new models in the analytics database. + * **Warehouse** — `transforming`. This tells dbt to use the transforming warehouse that was created earlier. + + +5. Enter your **Development Credentials** for Snowflake with: + * **Username** — The username you created for Snowflake. The username is not your email address and is usually your first and last name together in one word. + * **Password** — The password you set when creating your Snowflake account. + * **Schema** — You’ll notice that the schema name has been auto created for you. By convention, this is `dbt_`. This is the schema connected directly to your development environment, and it's where your models will be built when running dbt within the Cloud IDE. + * **Target name** — Leave as the default. + * **Threads** — Leave as 4. This is the number of simultaneous connects that dbt Cloud will make to build models concurrently. + + + +6. Click **Test Connection**. This verifies that dbt Cloud can access your Snowflake account. +7. If the connection test succeeds, click **Next**. If it fails, you may need to check your Snowflake settings and credentials. + + + ## Set up a dbt Cloud managed repository - +If you used Partner Connect, you can skip to [initializing your dbt project](#initialize-your-dbt-project-and-start-developing) as the Partner Connect provides you with a managed repository. Otherwise, you will need to create your repository connection. + ## Initialize your dbt project​ and start developing Now that you have a repository configured, you can initialize your project and start development in dbt Cloud: 1. Click **Start developing in the IDE**. It might take a few minutes for your project to spin up for the first time as it establishes your git connection, clones your repo, and tests the connection to the warehouse. -2. Above the file tree to the left, click **Initialize dbt project**. This builds out your folder structure with example models. -3. Make your initial commit by clicking **Commit and sync**. Use the commit message `initial commit` and click **Commit**. This creates the first commit to your managed repo and allows you to open a branch where you can add new dbt code. +2. Above the file tree to the left, click **Initialize your project**. This builds out your folder structure with example models. +3. Make your initial commit by clicking **Commit and sync**. Use the commit message `initial commit`. This creates the first commit to your managed repo and allows you to open a branch where you can add new dbt code. 4. You can now directly query data from your warehouse and execute `dbt run`. You can try this out now: - - Click **+ Create new file**, add this query to the new file, and click **Save as** to save the new file: + - Click **+ Create new file**, add this query to the new file, and click **Save as** to save the new file: ```sql - select * from `dbt-tutorial.jaffle_shop.customers` + select * from raw.jaffle_shop.customers ``` - In the command line bar at the bottom, enter `dbt run` and click **Enter**. You should see a `dbt run succeeded` message. @@ -124,7 +245,6 @@ Name the new branch `add-customers-model`. 2. Name the file `customers.sql`, then click **Create**. 3. Copy the following query into the file and click **Save**. - ```sql with customers as ( @@ -133,7 +253,7 @@ with customers as ( first_name, last_name - from `dbt-tutorial`.jaffle_shop.customers + from raw.jaffle_shop.customers ), @@ -145,7 +265,7 @@ orders as ( order_date, status - from `dbt-tutorial`.jaffle_shop.orders + from raw.jaffle_shop.orders ), @@ -187,14 +307,6 @@ select * from final Later, you can connect your business intelligence (BI) tools to these views and tables so they only read cleaned up data rather than raw data in your BI tool. -#### FAQs - - - - - - - ## Change the way your model is materialized @@ -218,7 +330,7 @@ Later, you can connect your business intelligence (BI) tools to these views and first_name, last_name - from `dbt-tutorial`.jaffle_shop.customers + from raw.jaffle_shop.customers ``` @@ -232,7 +344,7 @@ Later, you can connect your business intelligence (BI) tools to these views and order_date, status - from `dbt-tutorial`.jaffle_shop.orders + from raw.jaffle_shop.orders ``` @@ -295,14 +407,79 @@ Later, you can connect your business intelligence (BI) tools to these views and This time, when you performed a `dbt run`, separate views/tables were created for `stg_customers`, `stg_orders` and `customers`. dbt inferred the order to run these models. Because `customers` depends on `stg_customers` and `stg_orders`, dbt builds `customers` last. You do not need to explicitly define these dependencies. - #### FAQs {#faq-2} - +## Build models on top of sources + +Sources make it possible to name and describe the data loaded into your warehouse by your extract and load tools. By declaring these tables as sources in dbt, you can: +- select from source tables in your models using the `{{ source() }}` function, helping define the lineage of your data +- test your assumptions about your source data +- calculate the freshness of your source data + +1. Create a new YML file `models/sources.yml`. +2. Declare the sources by copying the following into the file and clicking **Save**. + + + + ```yml + version: 2 + + sources: + - name: jaffle_shop + description: This is a replica of the Postgres database used by our app + database: raw + schema: jaffle_shop + tables: + - name: customers + description: One record per customer. + - name: orders + description: One record per order. Includes cancelled and deleted orders. + ``` + + + +3. Edit the `models/stg_customers.sql` file to select from the `customers` table in the `jaffle_shop` source. + + + + ```sql + select + id as customer_id, + first_name, + last_name + + from {{ source('jaffle_shop', 'customers') }} + ``` + + + +4. Edit the `models/stg_orders.sql` file to select from the `orders` table in the `jaffle_shop` source. + + + + ```sql + select + id as order_id, + user_id as customer_id, + order_date, + status + + from {{ source('jaffle_shop', 'orders') }} + ``` + + + +5. Execute `dbt run`. + + The results of your `dbt run` will be exactly the same as the previous step. Your `stg_customers` and `stg_orders` + models will still query from the same raw data source in Snowflake. By using `source`, you can + test and document your raw data and also understand the lineage of your sources. + + From 0f83ce0b8ac6b21f75d0ae8ac44d328daa84e82f Mon Sep 17 00:00:00 2001 From: Ly Nguyen Date: Thu, 15 Aug 2024 11:45:55 -0700 Subject: [PATCH 4/9] Check a reference page --- .../resource-configs/fabric-configs.md | 905 +++++++++++++++++- 1 file changed, 856 insertions(+), 49 deletions(-) diff --git a/website/docs/reference/resource-configs/fabric-configs.md b/website/docs/reference/resource-configs/fabric-configs.md index 8ab0a63a644..e4ab525c7f5 100644 --- a/website/docs/reference/resource-configs/fabric-configs.md +++ b/website/docs/reference/resource-configs/fabric-configs.md @@ -3,103 +3,910 @@ title: "Microsoft Fabric DWH configurations" id: "fabric-configs" --- -## Materializations + -Ephemeral materialization is not supported due to T-SQL not supporting nested CTEs. It may work in some cases when you're working with very simple ephemeral models. +## Use `project` and `dataset` in configurations -### Tables +- `schema` is interchangeable with the BigQuery concept `dataset` +- `database` is interchangeable with the BigQuery concept of `project` -Tables are default materialization. +For our reference documentation, you can declare `project` in place of `database.` +This will allow you to read and write from multiple BigQuery projects. Same for `dataset`. + +## Using table partitioning and clustering + +### Partition clause + +BigQuery supports the use of a [partition by](https://cloud.google.com/bigquery/docs/data-definition-language#specifying_table_partitioning_options) clause to easily partition a by a column or expression. This option can help decrease latency and cost when querying large tables. Note that partition pruning [only works](https://cloud.google.com/bigquery/docs/querying-partitioned-tables#pruning_limiting_partitions) when partitions are filtered using literal values (so selecting partitions using a won't improve performance). + +The `partition_by` config can be supplied as a dictionary with the following format: + +```python +{ + "field": "", + "data_type": "", + "granularity": "" + + # Only required if data_type is "int64" + "range": { + "start": , + "end": , + "interval": + } +} +``` + +#### Partitioning by a date or timestamp + +When using a `datetime` or `timestamp` column to partition data, you can create partitions with a granularity of hour, day, month, or year. A `date` column supports granularity of day, month and year. Daily partitioning is the default for all column types. + +If the `data_type` is specified as a `date` and the granularity is day, dbt will supply the field as-is +when configuring table partitioning. + defaultValue="source" + values={[ + { label: 'Source code', value: 'source', }, + { label: 'Compiled code', value: 'compiled', }, + ] +}> + - + - +```sql +{{ config( + materialized='table', + partition_by={ + "field": "created_at", + "data_type": "timestamp", + "granularity": "day" + } +)}} + +select + user_id, + event_name, + created_at + +from {{ ref('events') }} +``` + + + + + + + + +```sql +create table `projectname`.`analytics`.`bigquery_table` +partition by timestamp_trunc(created_at, day) +as ( + + select + user_id, + event_name, + created_at + + from `analytics`.`events` + +) +``` + + + + + + +#### Partitioning by an "ingestion" date or timestamp + +BigQuery supports an [older mechanism of partitioning](https://cloud.google.com/bigquery/docs/partitioned-tables#ingestion_time) based on the time when each row was ingested. While we recommend using the newer and more ergonomic approach to partitioning whenever possible, for very large datasets, there can be some performance improvements to using this older, more mechanistic approach. [Read more about the `insert_overwrite` incremental strategy below](#copying-ingestion-time-partitions). + +dbt will always instruct BigQuery to partition your table by the values of the column specified in `partition_by.field`. By configuring your model with `partition_by.time_ingestion_partitioning` set to `True`, dbt will use that column as the input to a `_PARTITIONTIME` pseudocolumn. Unlike with newer column-based partitioning, you must ensure that the values of your partitioning column match exactly the time-based granularity of your partitions. + + + + + + +```sql +{{ config( + materialized="incremental", + partition_by={ + "field": "created_date", + "data_type": "timestamp", + "granularity": "day", + "time_ingestion_partitioning": true + } +) }} + +select + user_id, + event_name, + created_at, + -- values of this column must match the data type + granularity defined above + timestamp_trunc(created_at, day) as created_date + +from {{ ref('events') }} +``` + + + + + + + + +```sql +create table `projectname`.`analytics`.`bigquery_table` (`user_id` INT64, `event_name` STRING, `created_at` TIMESTAMP) +partition by timestamp_trunc(_PARTITIONTIME, day); + +insert into `projectname`.`analytics`.`bigquery_table` (_partitiontime, `user_id`, `event_name`, `created_at`) +select created_date as _partitiontime, * EXCEPT(created_date) from ( + select + user_id, + event_name, + created_at, + -- values of this column must match granularity defined above + timestamp_trunc(created_at, day) as created_date + + from `projectname`.`analytics`.`events` +); +``` + + + + + + +#### Partitioning with integer buckets + +If the `data_type` is specified as `int64`, then a `range` key must also +be provided in the `partition_by` dict. dbt will use the values provided in +the `range` dict to generate the partitioning clause for the table. + + + + + + +```sql +{{ config( + materialized='table', + partition_by={ + "field": "user_id", + "data_type": "int64", + "range": { + "start": 0, + "end": 100, + "interval": 10 + } + } +)}} + +select + user_id, + event_name, + created_at + +from {{ ref('events') }} +``` + + + + + + + + +```sql +create table analytics.bigquery_table +partition by range_bucket( + customer_id, + generate_array(0, 100, 10) +) +as ( + + select + user_id, + event_name, + created_at + + from analytics.events + +) +``` + + + + + + +#### Additional partition configs + +If your model has `partition_by` configured, you may optionally specify two additional configurations: + +- `require_partition_filter` (boolean): If set to `true`, anyone querying this model _must_ specify a partition filter, otherwise their query will fail. This is recommended for very large tables with obvious partitioning schemes, such as event streams grouped by day. Note that this will affect other dbt models or tests that try to select from this model, too. + +- `partition_expiration_days` (integer): If set for date- or timestamp-type partitions, the partition will expire that many days after the date it represents. E.g. A partition representing `2021-01-01`, set to expire after 7 days, will no longer be queryable as of `2021-01-08`, its storage costs zeroed out, and its contents will eventually be deleted. Note that [table expiration](#controlling-table-expiration) will take precedence if specified. + + + +```sql +{{ config( + materialized = 'table', + partition_by = { + "field": "created_at", + "data_type": "timestamp", + "granularity": "day" + }, + require_partition_filter = true, + partition_expiration_days = 7 +)}} + +``` + + + +### Clustering Clause + +BigQuery tables can be [clustered](https://cloud.google.com/bigquery/docs/clustered-tables) to colocate related data. + +Clustering on a single column: + + ```sql {{ - config( - materialized='table' - ) + config( + materialized = "table", + cluster_by = "order_id", + ) }} -select * -from ... +select * from ... ``` - +Clustering on a multiple columns: + + - +```sql +{{ + config( + materialized = "table", + cluster_by = ["customer_id", "order_id"], + ) +}} + +select * from ... +``` + + + +## Managing KMS Encryption + +[Customer managed encryption keys](https://cloud.google.com/bigquery/docs/customer-managed-encryption) can be configured for BigQuery tables using the `kms_key_name` model configuration. - +### Using KMS Encryption + +To specify the KMS key name for a model (or a group of models), use the `kms_key_name` model configuration. The following example sets the `kms_key_name` for all of the models in the `encrypted/` directory of your dbt project. + + ```yaml + +name: my_project +version: 1.0.0 + +... + models: - your_project_name: - materialized: view - staging: - materialized: table + my_project: + encrypted: + +kms_key_name: 'projects/PROJECT_ID/locations/global/keyRings/test/cryptoKeys/quickstart' ``` - +## Labels and Tags - +### Specifying labels + +dbt supports the specification of BigQuery labels for the tables and views that it creates. These labels can be specified using the `labels` model config. + +The `labels` config can be provided in a model config, or in the `dbt_project.yml` file, as shown below. + + BigQuery key-value pair entries for labels larger than 63 characters are truncated. + +**Configuring labels in a model file** + + + +```sql +{{ + config( + materialized = "table", + labels = {'contains_pii': 'yes', 'contains_pie': 'no'} + ) +}} + +select * from {{ ref('another_model') }} +``` + + + +**Configuring labels in dbt_project.yml** + + + +```yaml + +models: + my_project: + snowplow: + +labels: + domain: clickstream + finance: + +labels: + domain: finance +``` + + + + + + + +### Specifying tags +BigQuery table and view *tags* can be created by supplying an empty string for the label value. + + + +```sql +{{ + config( + materialized = "table", + labels = {'contains_pii': ''} + ) +}} + +select * from {{ ref('another_model') }} +``` + + + +### Policy tags +BigQuery enables [column-level security](https://cloud.google.com/bigquery/docs/column-level-security-intro) by setting [policy tags](https://cloud.google.com/bigquery/docs/best-practices-policy-tags) on specific columns. + +dbt enables this feature as a column resource property, `policy_tags` (_not_ a node config). + + + +```yaml +version: 2 + +models: +- name: policy_tag_table + columns: + - name: field + policy_tags: + - 'projects//locations//taxonomies//policyTags/' +``` + + + +Please note that in order for policy tags to take effect, [column-level `persist_docs`](/reference/resource-configs/persist_docs) must be enabled for the model, seed, or snapshot. Consider using [variables](/docs/build/project-variables) to manage taxonomies and make sure to add the required security [roles](https://cloud.google.com/bigquery/docs/column-level-security-intro#roles) to your BigQuery service account key. + +## Merge behavior (incremental models) + +The [`incremental_strategy` config](/docs/build/incremental-strategy) controls how dbt builds incremental models. dbt uses a [merge statement](https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax) on BigQuery to refresh incremental tables. + +The `incremental_strategy` config can be set to one of two values: + - `merge` (default) + - `insert_overwrite` -## Seeds +### Performance and cost -By default, `dbt-fabric` will attempt to insert seed files in batches of 400 rows. -If this exceeds Microsoft Fabric Synapse Data Warehouse 2100 parameter limit, the adapter will automatically limit to the highest safe value possible. +The operations performed by dbt while building a BigQuery incremental model can +be made cheaper and faster by using [clustering keys](#clustering-keys) in your +model configuration. See [this guide](https://discourse.getdbt.com/t/benchmarking-incremental-strategies-on-bigquery/981) for more information on performance tuning for BigQuery incremental models. -To set a different default seed value, you can set the variable `max_batch_size` in your project configuration. +**Note:** These performance and cost benefits are applicable to incremental models +built with either the `merge` or the `insert_overwrite` incremental strategy. - +### The `merge` strategy + The `merge` incremental strategy will generate a `merge` statement that looks + something like: + +```merge +merge into {{ destination_table }} DEST +using ({{ model_sql }}) SRC +on SRC.{{ unique_key }} = DEST.{{ unique_key }} + +when matched then update ... +when not matched then insert ... +``` + +The 'merge' approach automatically updates new data in the destination incremental table but requires scanning all source tables referenced in the model SQL, as well as destination tables. This can be slow and expensive for large data volumes. [Partitioning and clustering](#using-table-partitioning-and-clustering) techniques mentioned earlier can help mitigate these issues. + +**Note:** The `unique_key` configuration is required when the `merge` incremental +strategy is selected. + +### The `insert_overwrite` strategy + +The `insert_overwrite` strategy generates a merge statement that replaces entire partitions +in the destination table. **Note:** this configuration requires that the model is configured +with a [Partition clause](#partition-clause). The `merge` statement that dbt generates +when the `insert_overwrite` strategy is selected looks something like: + +```sql +/* + Create a temporary table from the model SQL +*/ +create temporary table {{ model_name }}__dbt_tmp as ( + {{ model_sql }} +); + +/* + If applicable, determine the partitions to overwrite by + querying the temp table. +*/ + +declare dbt_partitions_for_replacement array; +set (dbt_partitions_for_replacement) = ( + select as struct + array_agg(distinct date(max_tstamp)) + from `my_project`.`my_dataset`.{{ model_name }}__dbt_tmp +); + +/* + Overwrite partitions in the destination table which match + the partitions in the temporary table +*/ +merge into {{ destination_table }} DEST +using {{ model_name }}__dbt_tmp SRC +on FALSE + +when not matched by source and {{ partition_column }} in unnest(dbt_partitions_for_replacement) +then delete + +when not matched then insert ... +``` + +For a complete writeup on the mechanics of this approach, see +[this explainer post](https://discourse.getdbt.com/t/bigquery-dbt-incremental-changes/982). + +#### Determining partitions to overwrite + +dbt is able to determine the partitions to overwrite dynamically from the values +present in the temporary table, or statically using a user-supplied configuration. + +The "dynamic" approach is simplest (and the default), but the "static" approach +will reduce costs by eliminating multiple queries in the model build script. + +#### Static partitions + +To supply a static list of partitions to overwrite, use the `partitions` configuration. + + + +```sql +{% set partitions_to_replace = [ + 'timestamp(current_date)', + 'timestamp(date_sub(current_date, interval 1 day))' +] %} + +{{ + config( + materialized = 'incremental', + incremental_strategy = 'insert_overwrite', + partition_by = {'field': 'session_start', 'data_type': 'timestamp'}, + partitions = partitions_to_replace + ) +}} + +with events as ( + + select * from {{ref('events')}} + + {% if is_incremental() %} + -- recalculate yesterday + today + where timestamp_trunc(event_timestamp, day) in ({{ partitions_to_replace | join(',') }}) + {% endif %} + +), + +... rest of model ... +``` + + + +This example model serves to replace the data in the destination table for both +_today_ and _yesterday_ every day that it is run. It is the fastest and cheapest +way to incrementally update a table using dbt. If we wanted this to run more dynamically— +let’s say, always for the past 3 days—we could leverage dbt’s baked-in [datetime macros](https://github.com/dbt-labs/dbt-core/blob/dev/octavius-catto/core/dbt/include/global_project/macros/etc/datetime.sql) and write a few of our own. + +Think of this as "full control" mode. You must ensure that expressions or literal values in the the `partitions` config have proper quoting when templated, and that they match the `partition_by.data_type` (`timestamp`, `datetime`, `date`, or `int64`). Otherwise, the filter in the incremental `merge` statement will raise an error. + +#### Dynamic partitions + +If no `partitions` configuration is provided, dbt will instead: + +1. Create a temporary table for your model SQL +2. Query the temporary table to find the distinct partitions to be overwritten +3. Query the destination table to find the _max_ partition in the database + +When building your model SQL, you can take advantage of the introspection performed +by dbt to filter for only _new_ data. The max partition in the destination table +will be available using the `_dbt_max_partition` BigQuery scripting variable. **Note:** +this is a BigQuery SQL variable, not a dbt Jinja variable, so no jinja brackets are +required to access this variable. + +**Example model SQL:** + +```sql +{{ + config( + materialized = 'incremental', + partition_by = {'field': 'session_start', 'data_type': 'timestamp'}, + incremental_strategy = 'insert_overwrite' + ) +}} + +with events as ( + + select * from {{ref('events')}} + + {% if is_incremental() %} + + -- recalculate latest day's data + previous + -- NOTE: The _dbt_max_partition variable is used to introspect the destination table + where date(event_timestamp) >= date_sub(date(_dbt_max_partition), interval 1 day) + +{% endif %} + +), + +... rest of model ... +``` + +#### Copying partitions + +If you are replacing entire partitions in your incremental runs, you can opt to do so with the [copy table API](https://cloud.google.com/bigquery/docs/managing-tables#copy-table) and partition decorators rather than a `merge` statement. While this mechanism doesn't offer the same visibility and ease of debugging as the SQL `merge` statement, it can yield significant savings in time and cost for large datasets because the copy table API does not incur any costs for inserting the data - it's equivalent to the `bq cp` gcloud command line interface (CLI) command. + +You can enable this by switching on `copy_partitions: True` in the `partition_by` configuration. This approach works only in combination with "dynamic" partition replacement. + + + +```sql +{{ config( + materialized="incremental", + incremental_strategy="insert_overwrite", + partition_by={ + "field": "created_date", + "data_type": "timestamp", + "granularity": "day", + "time_ingestion_partitioning": true, + "copy_partitions": true + } +) }} + +select + user_id, + event_name, + created_at, + -- values of this column must match the data type + granularity defined above + timestamp_trunc(created_at, day) as created_date + +from {{ ref('events') }} +``` + + + + + +``` +... +[0m16:03:13.017641 [debug] [Thread-3 (]: BigQuery adapter: Copying table(s) "/projects/projectname/datasets/analytics/tables/bigquery_table__dbt_tmp$20230112" to "/projects/projectname/datasets/analytics/tables/bigquery_table$20230112" with disposition: "WRITE_TRUNCATE" +... +``` + + + +## Controlling table expiration + +By default, dbt-created tables never expire. You can configure certain model(s) +to expire after a set number of hours by setting `hours_to_expiration`. + +:::info Note +The `hours_to_expiration` only applies to initial creation of the underlying table. It doesn't reset for incremental models when they do another run. +::: + + + +```yml +models: + [](/reference/resource-configs/resource-path): + +hours_to_expiration: 6 + +``` + + + + + +```sql + +{{ config( + hours_to_expiration = 6 +) }} + +select ... + +``` + + + +## Authorized Views + +If the `grant_access_to` config is specified for a model materialized as a +view, dbt will grant the view model access to select from the list of datasets +provided. See [BQ docs on authorized views](https://cloud.google.com/bigquery/docs/share-access-views) +for more details. + + + + + +```yml +models: + [](/reference/resource-configs/resource-path): + +grant_access_to: + - project: project_1 + dataset: dataset_1 + - project: project_2 + dataset: dataset_2 +``` + + + + + +```sql + +{{ config( + grant_access_to=[ + {'project': 'project_1', 'dataset': 'dataset_1'}, + {'project': 'project_2', 'dataset': 'dataset_2'} + ] +) }} +``` + + + +Views with this configuration will be able to select from objects in `project_1.dataset_1` and `project_2.dataset_2`, even when they are located elsewhere and queried by users who do not otherwise have access to `project_1.dataset_1` and `project_2.dataset_2`. + + + +## Materialized views + +The BigQuery adapter supports [materialized views](https://cloud.google.com/bigquery/docs/materialized-views-intro) +with the following configuration parameters: + +| Parameter | Type | Required | Default | Change Monitoring Support | +|----------------------------------------------------------------------------------|------------------------|----------|---------|---------------------------| +| [`on_configuration_change`](/reference/resource-configs/on_configuration_change) | `` | no | `apply` | n/a | +| [`cluster_by`](#clustering-clause) | `[]` | no | `none` | drop/create | +| [`partition_by`](#partition-clause) | `{}` | no | `none` | drop/create | +| [`enable_refresh`](#auto-refresh) | `` | no | `true` | alter | +| [`refresh_interval_minutes`](#auto-refresh) | `` | no | `30` | alter | +| [`max_staleness`](#auto-refresh) (in Preview) | `` | no | `none` | alter | +| [`description`](/reference/resource-properties/description) | `` | no | `none` | alter | +| [`labels`](#specifying-labels) | `{: }` | no | `none` | alter | +| [`hours_to_expiration`](#controlling-table-expiration) | `` | no | `none` | alter | +| [`kms_key_name`](#using-kms-encryption) | `` | no | `none` | alter | + + + + + + + + +```yaml +models: + [](/reference/resource-configs/resource-path): + [+](/reference/resource-configs/plus-prefix)[materialized](/reference/resource-configs/materialized): materialized_view + [+](/reference/resource-configs/plus-prefix)[on_configuration_change](/reference/resource-configs/on_configuration_change): apply | continue | fail + [+](/reference/resource-configs/plus-prefix)[cluster_by](#clustering-clause): | [] + [+](/reference/resource-configs/plus-prefix)[partition_by](#partition-clause): + - field: + - data_type: timestamp | date | datetime | int64 + # only if `data_type` is not 'int64' + - granularity: hour | day | month | year + # only if `data_type` is 'int64' + - range: + - start: + - end: + - interval: + [+](/reference/resource-configs/plus-prefix)[enable_refresh](#auto-refresh): true | false + [+](/reference/resource-configs/plus-prefix)[refresh_interval_minutes](#auto-refresh): + [+](/reference/resource-configs/plus-prefix)[max_staleness](#auto-refresh): + [+](/reference/resource-configs/plus-prefix)[description](/reference/resource-properties/description): + [+](/reference/resource-configs/plus-prefix)[labels](#specifying-labels): {: } + [+](/reference/resource-configs/plus-prefix)[hours_to_expiration](#acontrolling-table-expiration): + [+](/reference/resource-configs/plus-prefix)[kms_key_name](##using-kms-encryption): +``` + + + + + + + + + ```yaml -vars: - max_batch_size: 200 # Any integer less than or equal to 2100 will do. +version: 2 + +models: + - name: [] + config: + [materialized](/reference/resource-configs/materialized): materialized_view + [on_configuration_change](/reference/resource-configs/on_configuration_change): apply | continue | fail + [cluster_by](#clustering-clause): | [] + [partition_by](#partition-clause): + - field: + - data_type: timestamp | date | datetime | int64 + # only if `data_type` is not 'int64' + - granularity: hour | day | month | year + # only if `data_type` is 'int64' + - range: + - start: + - end: + - interval: + [enable_refresh](#auto-refresh): true | false + [refresh_interval_minutes](#auto-refresh): + [max_staleness](#auto-refresh): + [description](/reference/resource-properties/description): + [labels](#specifying-labels): {: } + [hours_to_expiration](#acontrolling-table-expiration): + [kms_key_name](##using-kms-encryption): +``` + + + + + + + + + + +```jinja +{{ config( + [materialized](/reference/resource-configs/materialized)='materialized_view', + [on_configuration_change](/reference/resource-configs/on_configuration_change)="apply" | "continue" | "fail", + [cluster_by](#clustering-clause)="" | [""], + [partition_by](#partition-clause)={ + "field": "", + "data_type": "timestamp" | "date" | "datetime" | "int64", + + # only if `data_type` is not 'int64' + "granularity": "hour" | "day" | "month" | "year, + + # only if `data_type` is 'int64' + "range": { + "start": , + "end": , + "interval": , + } + }, + + # auto-refresh options + [enable_refresh](#auto-refresh)= true | false, + [refresh_interval_minutes](#auto-refresh)=, + [max_staleness](#auto-refresh)="", + + # additional options + [description](/reference/resource-properties/description)="", + [labels](#specifying-labels)={ + "": "", + }, + [hours_to_expiration](#acontrolling-table-expiration)=, + [kms_key_name](##using-kms-encryption)="", +) }} ``` -## Snapshots + + + + +Many of these parameters correspond to their table counterparts and have been linked above. +The set of parameters unique to materialized views covers [auto-refresh functionality](#auto-refresh). + +Learn more about these parameters in BigQuery's docs: +- [CREATE MATERIALIZED VIEW statement](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_materialized_view_statement) +- [materialized_view_option_list](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#materialized_view_option_list) + +### Auto-refresh -Columns in source tables can not have any constraints. -If, for example, any column has a `NOT NULL` constraint, an error will be thrown. +| Parameter | Type | Required | Default | Change Monitoring Support | +|------------------------------|--------------|----------|---------|---------------------------| +| `enable_refresh` | `` | no | `true` | alter | +| `refresh_interval_minutes` | `` | no | `30` | alter | +| `max_staleness` (in Preview) | `` | no | `none` | alter | -## Indexes +BigQuery supports [automatic refresh](https://cloud.google.com/bigquery/docs/materialized-views-manage#automatic_refresh) configuration for materialized views. +By default, a materialized view will automatically refresh within 5 minutes of changes in the base table, but not more frequently than once every 30 minutes. +BigQuery only officially supports the configuration of the frequency (the "once every 30 minutes" frequency); +however, there is a feature in preview that allows for the configuration of the staleness (the "5 minutes" refresh). +dbt will monitor these parameters for changes and apply them using an `ALTER` statement. -Indexes are not supported by Microsoft Fabric Synapse Data Warehouse. Any Indexes provided as a configuration is ignored by the adapter. +Learn more about these parameters in BigQuery's docs: +- [materialized_view_option_list](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#materialized_view_option_list) +- [max_staleness](https://cloud.google.com/bigquery/docs/materialized-views-create#max_staleness) -## Grants with auto provisioning +### Limitations -Grants with auto provisioning is not supported by Microsoft Fabric Synapse Data Warehouse at this time. +As with most data platforms, there are limitations associated with materialized views. Some worth noting include: -## Incremental +- Materialized view SQL has a [limited feature set](https://cloud.google.com/bigquery/docs/materialized-views-create#supported-mvs). +- Materialized view SQL cannot be updated; the materialized view must go through a `--full-refresh` (DROP/CREATE). +- The `partition_by` clause on a materialized view must match that of the underlying base table. +- While materialized views can have descriptions, materialized view *columns* cannot. +- Recreating/dropping the base table requires recreating/dropping the materialized view. -Fabric supports both `delete+insert` and `append` strategy. +Find more information about materialized view limitations in Google's BigQuery [docs](https://cloud.google.com/bigquery/docs/materialized-views-intro#limitations). -If a unique key is not provided, it will default to the `append` strategy. + -## Permissions + -The Microsoft Entra identity (user or service principal) must be a Fabric Workspace admin to work on the database level at this time. Fine grain access control will be incorporated in the future. +## Python models -## cross-database macros +The BigQuery adapter supports Python models with the following additional configuration parameters: -Not supported at this time. +| Parameter | Type | Required | Default | Valid values | +|-------------------------|-------------|----------|-----------|------------------| +| `enable_list_inference` | `` | no | `True` | `True`, `False` | +| `intermediate_format` | `` | no | `parquet` | `parquet`, `orc` | -## dbt-utils +### The `enable_list_inference` parameter +The `enable_list_inference` parameter enables a PySpark data frame to read multiple records in the same operation. +By default, this is set to `True` to support the default `intermediate_format` of `parquet`. -Not supported at this time. However, dbt-fabric offers some utils macros. Please check out [utils macros](https://github.com/microsoft/dbt-fabric/tree/main/dbt/include/fabric/macros/utils). +### The `intermediate_format` parameter +The `intermediate_format` parameter specifies which file format to use when writing records to a table. The default is `parquet`. + From 47a7eff24a8b042bab4aa5b9fb4543ca6a209fd1 Mon Sep 17 00:00:00 2001 From: Ly Nguyen Date: Thu, 15 Aug 2024 11:54:49 -0700 Subject: [PATCH 5/9] Check a blog --- ...-building-your-semantic-layer-in-pieces.md | 198 +++++++++++++----- 1 file changed, 142 insertions(+), 56 deletions(-) diff --git a/website/blog/2024-07-08-building-your-semantic-layer-in-pieces.md b/website/blog/2024-07-08-building-your-semantic-layer-in-pieces.md index 53704131700..de1990047ad 100644 --- a/website/blog/2024-07-08-building-your-semantic-layer-in-pieces.md +++ b/website/blog/2024-07-08-building-your-semantic-layer-in-pieces.md @@ -9,90 +9,176 @@ date: 2024-07-10 is_featured: true --- -The [dbt Semantic Layer](/docs/use-dbt-semantic-layer/dbt-sl) is founded on the idea that data transformation should be both _flexible_, allowing for on-the-fly aggregations grouped and filtered by definable dimensions and _version-controlled and tested_. Like any other codebase, you should have confidence that your transformations express your organization’s business logic correctly. Historically, you had to choose between these options, but the dbt Semantic Layer brings them together. This has required new paradigms for _how_ you express your transformations though. +At dbt Labs, we’ve always believed in meeting analytics engineers where they are. That’s why we’re so excited to announce that today, analytics engineers within the Microsoft Ecosystem can use dbt Cloud with not only Microsoft Fabric but also Azure Synapse Analytics Dedicated SQL Pools (ASADSP). - +Since the early days of dbt, folks have been interested having MSFT data platforms. Huge shoutout to [Mikael Ene](https://github.com/mikaelene) and [Jacob Mastel](https://github.com/jacobm001) for their efforts back in 2019 on the original SQL Server adapters ([dbt-sqlserver](https://github.com/dbt-msft/dbt-sqlserver) and [dbt-mssql](https://github.com/jacobm001/dbt-mssql), respectively) -Because of this, we’ve noticed when talking to dbt users that they _want_ to adopt the Semantic Layer, but feel daunted by the idea of migrating their transformations to this new paradigm. The good news is that you do _not_ need to make a huge one-time migration. +The journey for the Azure Synapse dbt adapter, dbt-synapse, is closely tied to my journey with dbt. I was the one who forked dbt-sqlserver into dbt-synapse in April of 2020. I had first learned of dbt only a month earlier and knew immediately that my team needed the tool. With a great deal of assistance from Jeremy and experts at Microsoft, my team and I got it off the ground and started using it. When I left my team at Avanade in early 2022 to join dbt Labs, I joked that I wasn’t actually leaving the team; I was just temporarily embedding at dbt Labs to expedite dbt Labs getting into Cloud. Two years later, I can tell my team that the mission has been accomplished! Kudos to all the folks who have contributed to the TSQL adapters either directly in GitHub or in the community Slack channels. The integration would not exist if not for you! -We’re here to discuss another way: building a Semantic Layer in pieces. Our goal is to make sure you derive increased leverage and velocity from each step on your journey. If you’re eager to start building but have limited bandwidth (like most busy analytics engineers), this one is especially for you. + -## System of a noun: deciding what happens where +## Fabric Best Practices -When you’re using the dbt Semantic Layer, you want to _minimize_ _the modeling that exists outside of dbt_. Eliminate it completely if you can. Why? +With the introduction of dbt Cloud support for Microsoft Fabric and Azure Synapse Analytics Dedicated SQL Pools, we're opening up new possibilities for analytics engineers in the Microsoft Ecosystem. -- It’s **duplicative, patchy, and confusing** as discussed above. -- It’s **less powerful**. -- You **can’t** **test** it. -- Depending on the tool, oftentimes you **can’t** **version control** it. +The goal of this blog is to ensure a great experience for both -What you want is a unified development flow that handles **normalized transformation in dbt models** and **dynamic denormalization in the dbt Semantic Layer** (meaning it dynamically combines and reshapes normalized data models into different formats whenever you need them). +- end-user data analysts who rely upon the data products built with dbt and +- the analytics engineers, who should predominately spend time creating and maintaining data products instead of maintaining and spinning up infrastructure +- data engineers who focus on data movement and ingestion into Synapse -:::info -🏎️ **The Semantic Layer is a denormalization engine.** dbt transforms your data into clean, normalized marts. The dbt Semantic Layer is a denormalization engine that dynamically connects and molds these building blocks into the maximum amount of shapes available _dynamically_. -::: +To achieve this goal, this post will cover four main areas + +- Microsoft Fabric: the future of data warehousing in the Microsoft/Azure stack +- strategic recommendations for provisioning Synapse environment +- data modeling in dbt: Synapse style +- Considerations for upstream and downstream of a Synapse-backed dbt project + +With that, let’s dive in! + +## Fabric is the future + +Many data teams currently use Azure Synapse dedicated pools. However, Fabric Synapse Data Warehouse is the future of data warehousing in the Microsoft Ecosystem. Azure Synapse Analytics will remain available for a few more years, but Microsoft’s main focus is on Fabric as we can see in their roadmap and launches. + +Because data platform migrations are complex and time-consuming, it’s perfectly reasonable to still be using dbt with Azure Synapse for the next two years while the migration is under way. Thankfully, if your team already is using ASADSP, transitioning to the new Cloud offering will be much more straightforward than the migration from on-premise databases to the Cloud. + +In addition, if you're already managing your Synapse warehouse with a dbt project, you'll benefit from an even smoother migration process. Your DDL statements will be automatically handled, reducing the need for manual refactoring. + +Bottom line, Fabric is the future of data warehousing for Microsoft customers, and Synapse is will be deprecated at an as-of-yet undeclared End-of-Life. + + There’s undeniable potential offered by Fabric with it’s: + +- fully-separated storage and compute, and +- pay-per-second compute. + +These two things alone greatly simplify the below section on Resource Provisioning. + +For more information, see: -This enables a more **flexible consumption layer**, meaning downstream tools (like AI or dashboards) can sit as directly on top of Semantic Layer-generated artifacts and APIs as possible, and focus on what makes them shine instead of being burdened by basic dynamic modeling and aggregation tasks. Any tool-specific constructs should typically operate as close to **transparent pass-throughs** as you can make them, primarily serving to surface metrics and dimensions from the Semantic Layer in your downstream tool. There may be exceptions of course, but as a general guiding principle this gets you the most dynamic denormalization ability, and thus value, from your Semantic Layer code. +- the official guide: [Migration: Azure Synapse Analytics dedicated SQL pools to Fabric](https://learn.microsoft.com/en-us/fabric/data-warehouse/migration-synapse-dedicated-sql-pool-warehouse). +- this blog about [the Future of Azure Synapse Analytics](https://blog.fabric.microsoft.com/en-us/blog/microsoft-fabric-explained-for-existing-synapse-users/) -So now we’ve established the system, let’s dig into the _plan_ for how we can get there iteratively. +## Resource Provisioning -## The plan: towards iterative velocity +Here are some considerations if you’re setting up an environment from scratch. If the infrastructure of multiple Synapse dedicated SQL pools and a Git repo already exist, you can skip to the next section, though a review of the below as a refresher wouldn’t hurt. -1. **Identify a Data Product that is impactful** Find something that is in heavy use and high value, but fairly narrow scope. **Don’t start with a broad executive dashboard** that shows metrics from across the company because you’re looking to optimize for migrating the **smallest amount of modeling for the highest amount of impact** that you can. +### minimize pools; maximize DWUs - For example, a good starting place would be a dashboard focused on Customer Acquisition Cost (CAC) that relies on a narrow set of metrics and underlying tables that are nonetheless critical for your company. -2. **Catalog the models and their columns that service the Data Product**, both **in dbt _and_ the BI tool**, including rollups, metrics tables, and marts that support those. Pay special attention to aggregations as these will constitute _metrics_. You can reference [this example Google Sheet](https://docs.google.com/spreadsheets/d/1BR62C5jY6L5f5NvieMcA7OVldSFxu03Y07TG3waq0As/edit?usp=sharing) for one-way you might track this. -3. [**Melt the frozen rollups**](https://docs.getdbt.com/best-practices/how-we-build-our-metrics/semantic-layer-6-terminology) in your dbt project, as well as variations modeled in your BI tool, **into Semantic Layer code.** We’ll go much more in-depth on this process, and we encourage you to read more about this tactical terminology (frozen, rollup, etc) in the link — it will be used throughout this article! -4. **Create a parallel version of your data product that points to Semantic Layer artifacts, audit, and then publish.** Creating in parallel takes the pressure off, allowing you to fix any issues and publish gracefully. You’ll keep the existing Data Product as-is while swapping the clone to be supplied with data from the Semantic Layer. +#### definitions - +- dedicated SQL pools: effectively one data warehouse +- Data warehouse units (DWUs): the size of the cluster -These steps constitute an **iterative piece** you will ship as you **progressively** move code into your Semantic Layer. As we dig into how to do this, we’ll discuss the **immediate value** this provides to your team and stakeholders. Broadly, it enables you to drastically increase [**iteration velocity**](https://www.linkedin.com/posts/rauchg_iteration-velocity-is-the-right-metric-to-activity-7087498430226313216-BVIP?utm_source=share&utm_medium=member_desktop). +#### number of pools -The process of **melting static, frozen tables** into more flexible, fluid, **dynamic Semantic Layer code** is not complex, but it’s helpful to dig into the specific steps in the process. In the next section, we’ll dive into what this looks like in practice so you have a solid understanding of the "what’s required". +With Synapse, a warehouse is both storage and compute. That is to say, to access data, the cluster needs to be on and warmed up. -This is the most **technical, detailed, and specific section of this article**, so make sure to bookmark it and **reference it** as often as you can until the process becomes as intuitive as regular modeling in dbt! +If you only have one team of analytics engineers, you should have two SQL pools: one for development and one for production. If you have multiple distinct teams that will be modeling data in Synapse using dbt, consider using dbt Cloud’s Mesh paradigm to enable cross team collaboration. -## Migrating a chunk: step-by-step +Each should be at the highest tier that you can afford. You should also consider purchasing “year-long reservations” for a steep discount. -### 1. Identify target +Some folks will recommend looking into scaling up and down pools based on demand. However, I’ve learned from personal experience that this optimization is not a free lunch and will require significant investment to not only build out but maintain. A large enough instance that is on whenever needed, keeps at least half an engineers time free to work on actual data modeling rather than platform maintenance. -1. **Identify a relatively normalized mart that is powering rollups in dbt**. If you do your rollups in your BI tool, start there. But we recommend starting with the frozen tables in dbt _first_ and moving through the flow of the DAG progressively, bringing logic in your BI tool into play last. This is because we want to iteratively break up these frozen concepts in such a way that we benefit from earlier parts of the chain being migrated already. Think "moving left-to-right in a big DAG" that spans all your tools. - - ✅ `orders`, `customers` — these are basic concepts powering your business, so should be marts models materialized via dbt. - - ❌ `active_accounts_per_week` — this is built on top of the above, and something we want to generate dynamically in the dbt Semantic Layer. - - Put another way: `customers` and `orders` are **normalized building blocks**, `active_accounts_per_week` is a **rollup** and we always want to _migrate those to the Semantic Layer_. - +#### DWUs -### 2. Catalog the inputs +The starting tier is `DW100c`, which costs $1.20/hour, has limitations such as only allowing 4 concurrent queries. To add 4 concurrent queries, you must increase the DWH tier. For every increase in 100 `c`'s, you gain an additional 4 concurrent queries. -1. Identify **normalized columns** and **ignore any aggregation columns** for now. For example, `order_id`, `ordered_at`, `customer_id`, `order_total` are fields we want to put in our semantic model, a window function that sums `customer_cac` _statically_ in the dbt model is _not_ a field we want in our semantic model because we want to _dynamically_ codify that calculation as a metric in the Semantic Layer. - 1. If you find in the next step that you can’t express a certain calculation in the Semantic Layer yet, use dbt to model it**.** This is the beauty of having your Semantic Layer code integrated in your dbt codebase, it’s easy to manage the push and pull of the line between the Transformation and Semantic Layers because you’re managing **a cohesive set of code and tooling.** +If this warehouse is intended to be the single source of truth for data analysts, you should design it to perform for that use case. In all likelihood, that means paying for a higher tier. Just like the above discussed potential for saving money by turning the cluster on and off as needed, paying for a lower tier, introduces another host of problems. If the limitation of 4 concurrent queries becomes a bottleneck, your choice is to either -### 3. Write Semantic Layer code +- design infrastructure to push the data out of Synapse and into a Azure SQL db or elsewhere +- increase the tier of service paid (i.e. increase the `DWU`s) -1. **Start with the semantic model** going through column by column and putting all identified columns from Step 2 into the 3 semantic buckets: - 1. [**Entities**](/docs/build/entities) — these are the spine of your semantic concepts or objects, you can think of them as roughly correlating to IDs or keys that form the grain. - 2. [**Dimensions**](/docs/build/dimensions) — these are ways of grouping and bucketing these objects or concepts, such as time and categories. - 3. [**Measures**](/docs/build/measures) — these are numeric values that you want to aggregate such as an order total or number of times a user clicked an ad. -2. **Create metrics for the aggregation columns** we didn’t encode into the semantic model. -3. Now, **identify a rollup you want to melt**. Refer to the [earlier example](#1-identify-target) to help distinguish these types of models. -4. **Repeat these steps for any** **other concepts** that you need to create that rollup e.g. `active_accounts_per_week` may need **both `customers` and `orders`.** -5. **Create metrics for the aggregation columns present in the rollup**. If your rollup references multiple models, put metrics in the YAML file that is most closely related to the grain or key aggregation of the table. For example, `active_accounts_per_week` is aggregated at a weekly time grain, but the key metric counts customer accounts, so we’d want to put that metric in the `customers.yml` or `sem_customers.yml` file (depending on [the naming system](/best-practices/how-we-build-our-metrics/semantic-layer-7-semantic-structure) you prefer). If it also contained a metric aggregating total orders in a given week, we’d put that metric into `orders.yml` or `sem_orders.yml`. -6. **Create [saved queries with exports](/docs/build/saved-queries)** configured to materialize your new Semantic Layer-based artifacts into the warehouse in parallel with the frozen rollup. This will allow us to shift consumption tools and audit results. +I’m of the opinion that minimizing Cloud spend should not come at the expense of developer productivity — both sides of the equation need to be considered. As such, I advocate predominately for the latter of the above two choices. -### 4. Connect external tools in parallel +### Deployment Resources -1. Now, **shift your external analysis tool to point at the Semantic Layer exports instead of the rollup**. Remember, we only want to shift the pointers for the rollup that we’ve migrated, everything else should stay pointing to frozen rollups. We’re migrating iteratively in pieces! - 1. If your downstream tools have an integration with the Semantic Layer, you’ll want to set that up as well. This will allow not only [declarative caching](/docs/use-dbt-semantic-layer/sl-cache#declarative-caching) of common query patterns with exports but also easy, totally dynamic on-the-fly queries. -2. Once you’ve replicated the previous state of things, with the Semantic Layer providing the data instead of frozen rollups, now you’re ready to **shift the transformations happening in your BI tool into the Semantic Layer**, following the same process. -3. Finally, to **feel the new speed and power you’ve unlocked**, ask a stakeholder for a dimension or metric that’s on their wishlist for the data product you’re working with. Then, bask in the glory of amazing them when you ship it an hour later! +In the Microsoft ecosystem, data warehouse deployments are more commonly conducted with Azure Data Factory instead of Azure DevOps pipelines or GitHub Actions. We recommend separating dbt project deployments from any ingestion pipeline defined in ADF. -:::tip -💁🏻‍♀️ If your BI tool allows it, make sure to do the BI-related steps above **in a development environment**. If it doesn’t have these capabilities, stick with duplicating the data product you’re re-building and perform this there so you can swap it later after you’ve tested it thoroughly. +However, if you must use ADF as the deployment pipeline, it is possible to use dbt Cloud APIs. Running dbt Core within Azure Data Factory can be challenging as there’s no easy way to install and invoke dbt Core, because there’s no easy way to install and run Python. The workarounds aren’t great, for example: Setting up dbt calls via Azure Serverless Functions and invoking them from ADF. + +### access control + +#### permissions for analytics engineers + +:::caution +⚠️ User-based Azure Active Directory authentication is not yet supported in dbt Cloud. As a workaround, consider having a [Service Principal](https://learn.microsoft.com/en-us/entra/identity-platform/app-objects-and-service-principals?tabs=browser) made for each contributing Analytics Engineer for use in dbt Cloud ::: -## Deep impact +In the development warehouse, each user should have the following privileges: `EXECUTE`, `SELECT`, `INSERT`, `UPDATE`, and `DELETE`. + +#### service principal permissions + +In addition, a service principal is required for dbt Cloud to directly interact with both the warehouse and your Git service provider (e.g. GitHub or Azure DevOps). + +Only the Service Principal in charge of deployment has the above permissions in production. End users have only `SELECT` access to this environment. + +## Model Considerations + +The magic begins when the environments are provisioned and dbt Cloud is connected. + +With dbt on Synapse, you can own the entire data transformation workflow from raw data to modeled data that data analysts and end users rely upon. The end product of which will be documented and tested. + +With dbt Cloud, things are even more streamlined. The dbt Cloud CLI allows developers to build only the models they need for a PR, deferring to the production environment for dependencies. There’s also dbt Explorer, which now has column-level lineage. + +While there are already platform-agnostic best practice guides that still apply for Synapse, there are some additional factors related to data distribution and indexing. + +### distributions & indices + +Working in ASADSP, it is important to remember that you’re working in a [Massively-Parallel Processing (MPP) architecture](https://www.indicative.com/resource/what-is-massively-parallel-processing-mpp/). + +What this means for an analytics engineer working using dedicated SQL pools is that for every table model, it must have an `index` and `distribution` configured. In `dbt-synapse` the defaults are: + +- index: `CLUSTERED COLUMNSTORE INDEX` +- distribution `ROUND_ROBIN` + +If you want something different, you can define it like below. For more information, see [dbt docs: configurations for Azure Synapse DWH: Indices and distributions](https://docs.getdbt.com/reference/resource-configs/azuresynapse-configs#indices-and-distributions). + +```sql +{{ + config( + index='HEAP', + dist='ROUND_ROBIN' + ) +}} +SELECT * FROM {{ ref('some_model') }} +``` + +A distribution specifies how the table rows should be stored across the 60 nodes of the cluster. The goal is to provide a configuration that both: + +1. ensures data is split evenly across the nodes of the cluster, and +2. minimizes inter-node movement of data. + +For example, imagine querying a 100-row seed table in a downstream model. Using `distribution=ROUND_ROBIN` instructs the pool to evenly distribute the rows between the 60 node, which equates to having only one or two rows in each node. This `SELECT`-ing all these an operation that touches all 60 nodes. The end result is that the query will run much slower than you might expect. + +The optimal distribution is `REPLICATE` which will load a full copy of the table to every node. In this scenario, any node can return the 100 rows without coordination from the others. This is ideal for a lookup table which could limit the result set within each node before aggregating each nodes results. + + +#### more information + +- [Guidance for designing distributed tables using dedicated SQL pool in Azure Synapse Analytics](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute) +- [source code for `synapse__create_table_as()` macro](https://github.com/microsoft/dbt-synapse/blob/master/dbt/include/synapse/macros/materializations/models/table/create_table_as.sql) + + +## Deployments & Ecosystem + +With the infrastructure in place and the analytics engineers enabled with best practices, the final piece is to think through how a dbt project sits in the larger data stack of your organization both upstream and downstream. + +### Upstream + +In dbt, we assume the data has already been ingested into the warehouse raw. This follows a broader paradigm known as Extract-Load-Transform (ELT). The same goes for dbt with Azure Synapse. The goal should be to have the data ingested into Synapse that is as “untouched” as possible from when it came from the upstream source system. It’s common for data teams using Azure Data Factory to continue to imploy an ETL-paradigm where data is transformed before it even lands in the warehouse. We do not recommend this, as it results in critical data transformation living outside of the dbt project, and therefore undocumented. + +If you have not already, engage the central/upstream data engineering team to devise a plan to integrate data extraction and movement in tools such as SSIS and Azure Data Factory with the transformation performed via dbt Cloud. + +### Downstream Consumers (Power BI) + +It is extremely common in MSFT data ecosystem to have significant amounts of data modeling live within Power BI reports and/or datasets. This is ok up to a certain point. + +The correct approach is not to mandate that all data modeling should be done in dbt with `SQL`. Instead seek out the most business critical Power BI datasets and reports. Any modeling done in those reports should be upstreamed into the dbt project where it can be properly tested and documented. + +There should be a continuous effort to take and Power Query code written in PBI as transformation code and to upstream it into the data warehouse where the modeling can be tested, documented, reused by others and deployed with confidence. + +## Conclusion -The first time you turn around a newly sliced, diced, filtered, and rolled up metric table for a stakeholder in under an hour instead of a week, not only you, but the stakeholder will immediately feel the value and power of the Semantic Layer. +There’s great opportunity in dbt Cloud today for data teams using Azure Synapse. While Fabric is the future, there’s meaningful considerations when it comes to resource provisioning, model design, and deployments within the larger ecosystem. -dbt Labs’ mission is to create and disseminate organizational knowledge. This process, and building a Semantic Layer generally, is about encoding organizational knowledge in such a way that it creates and disseminates _leverage_. Enabled by this process, you can start building your Semantic Layer _today_, without waiting for the magical capacity for a giant overhaul to materialize. Building iterative velocity as you progress, your team can finally make any BI tool deliver the way you need it to. +As we look ahead, we're excited about the possibilities that Microsoft Fabric holds for the future of data analytics. With dbt Cloud and Azure Synapse, analytics engineers can be disseminate knowledge with confidence to the rest of their organization. From ca057f95a8f342580de74b26b3b82048abae59d1 Mon Sep 17 00:00:00 2001 From: Ly Nguyen Date: Thu, 15 Aug 2024 12:25:24 -0700 Subject: [PATCH 6/9] Revert "Check a blog" This reverts commit 47a7eff24a8b042bab4aa5b9fb4543ca6a209fd1. --- ...-building-your-semantic-layer-in-pieces.md | 198 +++++------------- 1 file changed, 56 insertions(+), 142 deletions(-) diff --git a/website/blog/2024-07-08-building-your-semantic-layer-in-pieces.md b/website/blog/2024-07-08-building-your-semantic-layer-in-pieces.md index de1990047ad..53704131700 100644 --- a/website/blog/2024-07-08-building-your-semantic-layer-in-pieces.md +++ b/website/blog/2024-07-08-building-your-semantic-layer-in-pieces.md @@ -9,176 +9,90 @@ date: 2024-07-10 is_featured: true --- -At dbt Labs, we’ve always believed in meeting analytics engineers where they are. That’s why we’re so excited to announce that today, analytics engineers within the Microsoft Ecosystem can use dbt Cloud with not only Microsoft Fabric but also Azure Synapse Analytics Dedicated SQL Pools (ASADSP). +The [dbt Semantic Layer](/docs/use-dbt-semantic-layer/dbt-sl) is founded on the idea that data transformation should be both _flexible_, allowing for on-the-fly aggregations grouped and filtered by definable dimensions and _version-controlled and tested_. Like any other codebase, you should have confidence that your transformations express your organization’s business logic correctly. Historically, you had to choose between these options, but the dbt Semantic Layer brings them together. This has required new paradigms for _how_ you express your transformations though. -Since the early days of dbt, folks have been interested having MSFT data platforms. Huge shoutout to [Mikael Ene](https://github.com/mikaelene) and [Jacob Mastel](https://github.com/jacobm001) for their efforts back in 2019 on the original SQL Server adapters ([dbt-sqlserver](https://github.com/dbt-msft/dbt-sqlserver) and [dbt-mssql](https://github.com/jacobm001/dbt-mssql), respectively) + -The journey for the Azure Synapse dbt adapter, dbt-synapse, is closely tied to my journey with dbt. I was the one who forked dbt-sqlserver into dbt-synapse in April of 2020. I had first learned of dbt only a month earlier and knew immediately that my team needed the tool. With a great deal of assistance from Jeremy and experts at Microsoft, my team and I got it off the ground and started using it. When I left my team at Avanade in early 2022 to join dbt Labs, I joked that I wasn’t actually leaving the team; I was just temporarily embedding at dbt Labs to expedite dbt Labs getting into Cloud. Two years later, I can tell my team that the mission has been accomplished! Kudos to all the folks who have contributed to the TSQL adapters either directly in GitHub or in the community Slack channels. The integration would not exist if not for you! +Because of this, we’ve noticed when talking to dbt users that they _want_ to adopt the Semantic Layer, but feel daunted by the idea of migrating their transformations to this new paradigm. The good news is that you do _not_ need to make a huge one-time migration. - +We’re here to discuss another way: building a Semantic Layer in pieces. Our goal is to make sure you derive increased leverage and velocity from each step on your journey. If you’re eager to start building but have limited bandwidth (like most busy analytics engineers), this one is especially for you. -## Fabric Best Practices +## System of a noun: deciding what happens where -With the introduction of dbt Cloud support for Microsoft Fabric and Azure Synapse Analytics Dedicated SQL Pools, we're opening up new possibilities for analytics engineers in the Microsoft Ecosystem. +When you’re using the dbt Semantic Layer, you want to _minimize_ _the modeling that exists outside of dbt_. Eliminate it completely if you can. Why? -The goal of this blog is to ensure a great experience for both +- It’s **duplicative, patchy, and confusing** as discussed above. +- It’s **less powerful**. +- You **can’t** **test** it. +- Depending on the tool, oftentimes you **can’t** **version control** it. -- end-user data analysts who rely upon the data products built with dbt and -- the analytics engineers, who should predominately spend time creating and maintaining data products instead of maintaining and spinning up infrastructure -- data engineers who focus on data movement and ingestion into Synapse +What you want is a unified development flow that handles **normalized transformation in dbt models** and **dynamic denormalization in the dbt Semantic Layer** (meaning it dynamically combines and reshapes normalized data models into different formats whenever you need them). -To achieve this goal, this post will cover four main areas - -- Microsoft Fabric: the future of data warehousing in the Microsoft/Azure stack -- strategic recommendations for provisioning Synapse environment -- data modeling in dbt: Synapse style -- Considerations for upstream and downstream of a Synapse-backed dbt project - -With that, let’s dive in! - -## Fabric is the future - -Many data teams currently use Azure Synapse dedicated pools. However, Fabric Synapse Data Warehouse is the future of data warehousing in the Microsoft Ecosystem. Azure Synapse Analytics will remain available for a few more years, but Microsoft’s main focus is on Fabric as we can see in their roadmap and launches. - -Because data platform migrations are complex and time-consuming, it’s perfectly reasonable to still be using dbt with Azure Synapse for the next two years while the migration is under way. Thankfully, if your team already is using ASADSP, transitioning to the new Cloud offering will be much more straightforward than the migration from on-premise databases to the Cloud. - -In addition, if you're already managing your Synapse warehouse with a dbt project, you'll benefit from an even smoother migration process. Your DDL statements will be automatically handled, reducing the need for manual refactoring. - -Bottom line, Fabric is the future of data warehousing for Microsoft customers, and Synapse is will be deprecated at an as-of-yet undeclared End-of-Life. - - There’s undeniable potential offered by Fabric with it’s: - -- fully-separated storage and compute, and -- pay-per-second compute. - -These two things alone greatly simplify the below section on Resource Provisioning. - -For more information, see: - -- the official guide: [Migration: Azure Synapse Analytics dedicated SQL pools to Fabric](https://learn.microsoft.com/en-us/fabric/data-warehouse/migration-synapse-dedicated-sql-pool-warehouse). -- this blog about [the Future of Azure Synapse Analytics](https://blog.fabric.microsoft.com/en-us/blog/microsoft-fabric-explained-for-existing-synapse-users/) - -## Resource Provisioning - -Here are some considerations if you’re setting up an environment from scratch. If the infrastructure of multiple Synapse dedicated SQL pools and a Git repo already exist, you can skip to the next section, though a review of the below as a refresher wouldn’t hurt. - -### minimize pools; maximize DWUs - -#### definitions - -- dedicated SQL pools: effectively one data warehouse -- Data warehouse units (DWUs): the size of the cluster - -#### number of pools - -With Synapse, a warehouse is both storage and compute. That is to say, to access data, the cluster needs to be on and warmed up. - -If you only have one team of analytics engineers, you should have two SQL pools: one for development and one for production. If you have multiple distinct teams that will be modeling data in Synapse using dbt, consider using dbt Cloud’s Mesh paradigm to enable cross team collaboration. - -Each should be at the highest tier that you can afford. You should also consider purchasing “year-long reservations” for a steep discount. - -Some folks will recommend looking into scaling up and down pools based on demand. However, I’ve learned from personal experience that this optimization is not a free lunch and will require significant investment to not only build out but maintain. A large enough instance that is on whenever needed, keeps at least half an engineers time free to work on actual data modeling rather than platform maintenance. - -#### DWUs - -The starting tier is `DW100c`, which costs $1.20/hour, has limitations such as only allowing 4 concurrent queries. To add 4 concurrent queries, you must increase the DWH tier. For every increase in 100 `c`'s, you gain an additional 4 concurrent queries. - -If this warehouse is intended to be the single source of truth for data analysts, you should design it to perform for that use case. In all likelihood, that means paying for a higher tier. Just like the above discussed potential for saving money by turning the cluster on and off as needed, paying for a lower tier, introduces another host of problems. If the limitation of 4 concurrent queries becomes a bottleneck, your choice is to either - -- design infrastructure to push the data out of Synapse and into a Azure SQL db or elsewhere -- increase the tier of service paid (i.e. increase the `DWU`s) - -I’m of the opinion that minimizing Cloud spend should not come at the expense of developer productivity — both sides of the equation need to be considered. As such, I advocate predominately for the latter of the above two choices. - -### Deployment Resources - -In the Microsoft ecosystem, data warehouse deployments are more commonly conducted with Azure Data Factory instead of Azure DevOps pipelines or GitHub Actions. We recommend separating dbt project deployments from any ingestion pipeline defined in ADF. - -However, if you must use ADF as the deployment pipeline, it is possible to use dbt Cloud APIs. Running dbt Core within Azure Data Factory can be challenging as there’s no easy way to install and invoke dbt Core, because there’s no easy way to install and run Python. The workarounds aren’t great, for example: Setting up dbt calls via Azure Serverless Functions and invoking them from ADF. - -### access control - -#### permissions for analytics engineers - -:::caution -⚠️ User-based Azure Active Directory authentication is not yet supported in dbt Cloud. As a workaround, consider having a [Service Principal](https://learn.microsoft.com/en-us/entra/identity-platform/app-objects-and-service-principals?tabs=browser) made for each contributing Analytics Engineer for use in dbt Cloud +:::info +🏎️ **The Semantic Layer is a denormalization engine.** dbt transforms your data into clean, normalized marts. The dbt Semantic Layer is a denormalization engine that dynamically connects and molds these building blocks into the maximum amount of shapes available _dynamically_. ::: -In the development warehouse, each user should have the following privileges: `EXECUTE`, `SELECT`, `INSERT`, `UPDATE`, and `DELETE`. - -#### service principal permissions - -In addition, a service principal is required for dbt Cloud to directly interact with both the warehouse and your Git service provider (e.g. GitHub or Azure DevOps). - -Only the Service Principal in charge of deployment has the above permissions in production. End users have only `SELECT` access to this environment. +This enables a more **flexible consumption layer**, meaning downstream tools (like AI or dashboards) can sit as directly on top of Semantic Layer-generated artifacts and APIs as possible, and focus on what makes them shine instead of being burdened by basic dynamic modeling and aggregation tasks. Any tool-specific constructs should typically operate as close to **transparent pass-throughs** as you can make them, primarily serving to surface metrics and dimensions from the Semantic Layer in your downstream tool. There may be exceptions of course, but as a general guiding principle this gets you the most dynamic denormalization ability, and thus value, from your Semantic Layer code. -## Model Considerations +So now we’ve established the system, let’s dig into the _plan_ for how we can get there iteratively. -The magic begins when the environments are provisioned and dbt Cloud is connected. +## The plan: towards iterative velocity -With dbt on Synapse, you can own the entire data transformation workflow from raw data to modeled data that data analysts and end users rely upon. The end product of which will be documented and tested. +1. **Identify a Data Product that is impactful** Find something that is in heavy use and high value, but fairly narrow scope. **Don’t start with a broad executive dashboard** that shows metrics from across the company because you’re looking to optimize for migrating the **smallest amount of modeling for the highest amount of impact** that you can. -With dbt Cloud, things are even more streamlined. The dbt Cloud CLI allows developers to build only the models they need for a PR, deferring to the production environment for dependencies. There’s also dbt Explorer, which now has column-level lineage. + For example, a good starting place would be a dashboard focused on Customer Acquisition Cost (CAC) that relies on a narrow set of metrics and underlying tables that are nonetheless critical for your company. +2. **Catalog the models and their columns that service the Data Product**, both **in dbt _and_ the BI tool**, including rollups, metrics tables, and marts that support those. Pay special attention to aggregations as these will constitute _metrics_. You can reference [this example Google Sheet](https://docs.google.com/spreadsheets/d/1BR62C5jY6L5f5NvieMcA7OVldSFxu03Y07TG3waq0As/edit?usp=sharing) for one-way you might track this. +3. [**Melt the frozen rollups**](https://docs.getdbt.com/best-practices/how-we-build-our-metrics/semantic-layer-6-terminology) in your dbt project, as well as variations modeled in your BI tool, **into Semantic Layer code.** We’ll go much more in-depth on this process, and we encourage you to read more about this tactical terminology (frozen, rollup, etc) in the link — it will be used throughout this article! +4. **Create a parallel version of your data product that points to Semantic Layer artifacts, audit, and then publish.** Creating in parallel takes the pressure off, allowing you to fix any issues and publish gracefully. You’ll keep the existing Data Product as-is while swapping the clone to be supplied with data from the Semantic Layer. -While there are already platform-agnostic best practice guides that still apply for Synapse, there are some additional factors related to data distribution and indexing. + -### distributions & indices +These steps constitute an **iterative piece** you will ship as you **progressively** move code into your Semantic Layer. As we dig into how to do this, we’ll discuss the **immediate value** this provides to your team and stakeholders. Broadly, it enables you to drastically increase [**iteration velocity**](https://www.linkedin.com/posts/rauchg_iteration-velocity-is-the-right-metric-to-activity-7087498430226313216-BVIP?utm_source=share&utm_medium=member_desktop). -Working in ASADSP, it is important to remember that you’re working in a [Massively-Parallel Processing (MPP) architecture](https://www.indicative.com/resource/what-is-massively-parallel-processing-mpp/). +The process of **melting static, frozen tables** into more flexible, fluid, **dynamic Semantic Layer code** is not complex, but it’s helpful to dig into the specific steps in the process. In the next section, we’ll dive into what this looks like in practice so you have a solid understanding of the "what’s required". -What this means for an analytics engineer working using dedicated SQL pools is that for every table model, it must have an `index` and `distribution` configured. In `dbt-synapse` the defaults are: +This is the most **technical, detailed, and specific section of this article**, so make sure to bookmark it and **reference it** as often as you can until the process becomes as intuitive as regular modeling in dbt! -- index: `CLUSTERED COLUMNSTORE INDEX` -- distribution `ROUND_ROBIN` +## Migrating a chunk: step-by-step -If you want something different, you can define it like below. For more information, see [dbt docs: configurations for Azure Synapse DWH: Indices and distributions](https://docs.getdbt.com/reference/resource-configs/azuresynapse-configs#indices-and-distributions). +### 1. Identify target -```sql -{{ - config( - index='HEAP', - dist='ROUND_ROBIN' - ) -}} -SELECT * FROM {{ ref('some_model') }} -``` +1. **Identify a relatively normalized mart that is powering rollups in dbt**. If you do your rollups in your BI tool, start there. But we recommend starting with the frozen tables in dbt _first_ and moving through the flow of the DAG progressively, bringing logic in your BI tool into play last. This is because we want to iteratively break up these frozen concepts in such a way that we benefit from earlier parts of the chain being migrated already. Think "moving left-to-right in a big DAG" that spans all your tools. + - ✅ `orders`, `customers` — these are basic concepts powering your business, so should be marts models materialized via dbt. + - ❌ `active_accounts_per_week` — this is built on top of the above, and something we want to generate dynamically in the dbt Semantic Layer. + - Put another way: `customers` and `orders` are **normalized building blocks**, `active_accounts_per_week` is a **rollup** and we always want to _migrate those to the Semantic Layer_. + -A distribution specifies how the table rows should be stored across the 60 nodes of the cluster. The goal is to provide a configuration that both: +### 2. Catalog the inputs -1. ensures data is split evenly across the nodes of the cluster, and -2. minimizes inter-node movement of data. +1. Identify **normalized columns** and **ignore any aggregation columns** for now. For example, `order_id`, `ordered_at`, `customer_id`, `order_total` are fields we want to put in our semantic model, a window function that sums `customer_cac` _statically_ in the dbt model is _not_ a field we want in our semantic model because we want to _dynamically_ codify that calculation as a metric in the Semantic Layer. + 1. If you find in the next step that you can’t express a certain calculation in the Semantic Layer yet, use dbt to model it**.** This is the beauty of having your Semantic Layer code integrated in your dbt codebase, it’s easy to manage the push and pull of the line between the Transformation and Semantic Layers because you’re managing **a cohesive set of code and tooling.** -For example, imagine querying a 100-row seed table in a downstream model. Using `distribution=ROUND_ROBIN` instructs the pool to evenly distribute the rows between the 60 node, which equates to having only one or two rows in each node. This `SELECT`-ing all these an operation that touches all 60 nodes. The end result is that the query will run much slower than you might expect. +### 3. Write Semantic Layer code -The optimal distribution is `REPLICATE` which will load a full copy of the table to every node. In this scenario, any node can return the 100 rows without coordination from the others. This is ideal for a lookup table which could limit the result set within each node before aggregating each nodes results. +1. **Start with the semantic model** going through column by column and putting all identified columns from Step 2 into the 3 semantic buckets: + 1. [**Entities**](/docs/build/entities) — these are the spine of your semantic concepts or objects, you can think of them as roughly correlating to IDs or keys that form the grain. + 2. [**Dimensions**](/docs/build/dimensions) — these are ways of grouping and bucketing these objects or concepts, such as time and categories. + 3. [**Measures**](/docs/build/measures) — these are numeric values that you want to aggregate such as an order total or number of times a user clicked an ad. +2. **Create metrics for the aggregation columns** we didn’t encode into the semantic model. +3. Now, **identify a rollup you want to melt**. Refer to the [earlier example](#1-identify-target) to help distinguish these types of models. +4. **Repeat these steps for any** **other concepts** that you need to create that rollup e.g. `active_accounts_per_week` may need **both `customers` and `orders`.** +5. **Create metrics for the aggregation columns present in the rollup**. If your rollup references multiple models, put metrics in the YAML file that is most closely related to the grain or key aggregation of the table. For example, `active_accounts_per_week` is aggregated at a weekly time grain, but the key metric counts customer accounts, so we’d want to put that metric in the `customers.yml` or `sem_customers.yml` file (depending on [the naming system](/best-practices/how-we-build-our-metrics/semantic-layer-7-semantic-structure) you prefer). If it also contained a metric aggregating total orders in a given week, we’d put that metric into `orders.yml` or `sem_orders.yml`. +6. **Create [saved queries with exports](/docs/build/saved-queries)** configured to materialize your new Semantic Layer-based artifacts into the warehouse in parallel with the frozen rollup. This will allow us to shift consumption tools and audit results. +### 4. Connect external tools in parallel -#### more information +1. Now, **shift your external analysis tool to point at the Semantic Layer exports instead of the rollup**. Remember, we only want to shift the pointers for the rollup that we’ve migrated, everything else should stay pointing to frozen rollups. We’re migrating iteratively in pieces! + 1. If your downstream tools have an integration with the Semantic Layer, you’ll want to set that up as well. This will allow not only [declarative caching](/docs/use-dbt-semantic-layer/sl-cache#declarative-caching) of common query patterns with exports but also easy, totally dynamic on-the-fly queries. +2. Once you’ve replicated the previous state of things, with the Semantic Layer providing the data instead of frozen rollups, now you’re ready to **shift the transformations happening in your BI tool into the Semantic Layer**, following the same process. +3. Finally, to **feel the new speed and power you’ve unlocked**, ask a stakeholder for a dimension or metric that’s on their wishlist for the data product you’re working with. Then, bask in the glory of amazing them when you ship it an hour later! -- [Guidance for designing distributed tables using dedicated SQL pool in Azure Synapse Analytics](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute) -- [source code for `synapse__create_table_as()` macro](https://github.com/microsoft/dbt-synapse/blob/master/dbt/include/synapse/macros/materializations/models/table/create_table_as.sql) - - -## Deployments & Ecosystem - -With the infrastructure in place and the analytics engineers enabled with best practices, the final piece is to think through how a dbt project sits in the larger data stack of your organization both upstream and downstream. - -### Upstream - -In dbt, we assume the data has already been ingested into the warehouse raw. This follows a broader paradigm known as Extract-Load-Transform (ELT). The same goes for dbt with Azure Synapse. The goal should be to have the data ingested into Synapse that is as “untouched” as possible from when it came from the upstream source system. It’s common for data teams using Azure Data Factory to continue to imploy an ETL-paradigm where data is transformed before it even lands in the warehouse. We do not recommend this, as it results in critical data transformation living outside of the dbt project, and therefore undocumented. - -If you have not already, engage the central/upstream data engineering team to devise a plan to integrate data extraction and movement in tools such as SSIS and Azure Data Factory with the transformation performed via dbt Cloud. - -### Downstream Consumers (Power BI) - -It is extremely common in MSFT data ecosystem to have significant amounts of data modeling live within Power BI reports and/or datasets. This is ok up to a certain point. - -The correct approach is not to mandate that all data modeling should be done in dbt with `SQL`. Instead seek out the most business critical Power BI datasets and reports. Any modeling done in those reports should be upstreamed into the dbt project where it can be properly tested and documented. - -There should be a continuous effort to take and Power Query code written in PBI as transformation code and to upstream it into the data warehouse where the modeling can be tested, documented, reused by others and deployed with confidence. +:::tip +💁🏻‍♀️ If your BI tool allows it, make sure to do the BI-related steps above **in a development environment**. If it doesn’t have these capabilities, stick with duplicating the data product you’re re-building and perform this there so you can swap it later after you’ve tested it thoroughly. +::: -## Conclusion +## Deep impact -There’s great opportunity in dbt Cloud today for data teams using Azure Synapse. While Fabric is the future, there’s meaningful considerations when it comes to resource provisioning, model design, and deployments within the larger ecosystem. +The first time you turn around a newly sliced, diced, filtered, and rolled up metric table for a stakeholder in under an hour instead of a week, not only you, but the stakeholder will immediately feel the value and power of the Semantic Layer. -As we look ahead, we're excited about the possibilities that Microsoft Fabric holds for the future of data analytics. With dbt Cloud and Azure Synapse, analytics engineers can be disseminate knowledge with confidence to the rest of their organization. +dbt Labs’ mission is to create and disseminate organizational knowledge. This process, and building a Semantic Layer generally, is about encoding organizational knowledge in such a way that it creates and disseminates _leverage_. Enabled by this process, you can start building your Semantic Layer _today_, without waiting for the magical capacity for a giant overhaul to materialize. Building iterative velocity as you progress, your team can finally make any BI tool deliver the way you need it to. From 94de24d1cb64fdd803d1a357247ae531a78bce00 Mon Sep 17 00:00:00 2001 From: Ly Nguyen Date: Thu, 15 Aug 2024 12:29:27 -0700 Subject: [PATCH 7/9] Revert "Check a reference page" This reverts commit 0f83ce0b8ac6b21f75d0ae8ac44d328daa84e82f. --- .../resource-configs/fabric-configs.md | 905 +----------------- 1 file changed, 49 insertions(+), 856 deletions(-) diff --git a/website/docs/reference/resource-configs/fabric-configs.md b/website/docs/reference/resource-configs/fabric-configs.md index e4ab525c7f5..8ab0a63a644 100644 --- a/website/docs/reference/resource-configs/fabric-configs.md +++ b/website/docs/reference/resource-configs/fabric-configs.md @@ -3,910 +3,103 @@ title: "Microsoft Fabric DWH configurations" id: "fabric-configs" --- - +## Materializations -## Use `project` and `dataset` in configurations +Ephemeral materialization is not supported due to T-SQL not supporting nested CTEs. It may work in some cases when you're working with very simple ephemeral models. -- `schema` is interchangeable with the BigQuery concept `dataset` -- `database` is interchangeable with the BigQuery concept of `project` +### Tables -For our reference documentation, you can declare `project` in place of `database.` -This will allow you to read and write from multiple BigQuery projects. Same for `dataset`. - -## Using table partitioning and clustering - -### Partition clause - -BigQuery supports the use of a [partition by](https://cloud.google.com/bigquery/docs/data-definition-language#specifying_table_partitioning_options) clause to easily partition a by a column or expression. This option can help decrease latency and cost when querying large tables. Note that partition pruning [only works](https://cloud.google.com/bigquery/docs/querying-partitioned-tables#pruning_limiting_partitions) when partitions are filtered using literal values (so selecting partitions using a won't improve performance). - -The `partition_by` config can be supplied as a dictionary with the following format: - -```python -{ - "field": "", - "data_type": "", - "granularity": "" - - # Only required if data_type is "int64" - "range": { - "start": , - "end": , - "interval": - } -} -``` - -#### Partitioning by a date or timestamp - -When using a `datetime` or `timestamp` column to partition data, you can create partitions with a granularity of hour, day, month, or year. A `date` column supports granularity of day, month and year. Daily partitioning is the default for all column types. - -If the `data_type` is specified as a `date` and the granularity is day, dbt will supply the field as-is -when configuring table partitioning. +Tables are default materialization. - - - - -```sql -{{ config( - materialized='table', - partition_by={ - "field": "created_at", - "data_type": "timestamp", - "granularity": "day" - } -)}} - -select - user_id, - event_name, - created_at - -from {{ ref('events') }} -``` - - - - - - - - -```sql -create table `projectname`.`analytics`.`bigquery_table` -partition by timestamp_trunc(created_at, day) -as ( - - select - user_id, - event_name, - created_at - - from `analytics`.`events` - -) -``` - - - - - - -#### Partitioning by an "ingestion" date or timestamp - -BigQuery supports an [older mechanism of partitioning](https://cloud.google.com/bigquery/docs/partitioned-tables#ingestion_time) based on the time when each row was ingested. While we recommend using the newer and more ergonomic approach to partitioning whenever possible, for very large datasets, there can be some performance improvements to using this older, more mechanistic approach. [Read more about the `insert_overwrite` incremental strategy below](#copying-ingestion-time-partitions). - -dbt will always instruct BigQuery to partition your table by the values of the column specified in `partition_by.field`. By configuring your model with `partition_by.time_ingestion_partitioning` set to `True`, dbt will use that column as the input to a `_PARTITIONTIME` pseudocolumn. Unlike with newer column-based partitioning, you must ensure that the values of your partitioning column match exactly the time-based granularity of your partitions. - - - - - - -```sql -{{ config( - materialized="incremental", - partition_by={ - "field": "created_date", - "data_type": "timestamp", - "granularity": "day", - "time_ingestion_partitioning": true - } -) }} - -select - user_id, - event_name, - created_at, - -- values of this column must match the data type + granularity defined above - timestamp_trunc(created_at, day) as created_date - -from {{ ref('events') }} -``` - - - - - - - - -```sql -create table `projectname`.`analytics`.`bigquery_table` (`user_id` INT64, `event_name` STRING, `created_at` TIMESTAMP) -partition by timestamp_trunc(_PARTITIONTIME, day); - -insert into `projectname`.`analytics`.`bigquery_table` (_partitiontime, `user_id`, `event_name`, `created_at`) -select created_date as _partitiontime, * EXCEPT(created_date) from ( - select - user_id, - event_name, - created_at, - -- values of this column must match granularity defined above - timestamp_trunc(created_at, day) as created_date - - from `projectname`.`analytics`.`events` -); -``` - - - - - - -#### Partitioning with integer buckets - -If the `data_type` is specified as `int64`, then a `range` key must also -be provided in the `partition_by` dict. dbt will use the values provided in -the `range` dict to generate the partitioning clause for the table. - - - - - - -```sql -{{ config( - materialized='table', - partition_by={ - "field": "user_id", - "data_type": "int64", - "range": { - "start": 0, - "end": 100, - "interval": 10 - } - } -)}} - -select - user_id, - event_name, - created_at - -from {{ ref('events') }} -``` - - - - - - - - -```sql -create table analytics.bigquery_table -partition by range_bucket( - customer_id, - generate_array(0, 100, 10) -) -as ( - - select - user_id, - event_name, - created_at - - from analytics.events - -) -``` - - - - - - -#### Additional partition configs - -If your model has `partition_by` configured, you may optionally specify two additional configurations: - -- `require_partition_filter` (boolean): If set to `true`, anyone querying this model _must_ specify a partition filter, otherwise their query will fail. This is recommended for very large tables with obvious partitioning schemes, such as event streams grouped by day. Note that this will affect other dbt models or tests that try to select from this model, too. - -- `partition_expiration_days` (integer): If set for date- or timestamp-type partitions, the partition will expire that many days after the date it represents. E.g. A partition representing `2021-01-01`, set to expire after 7 days, will no longer be queryable as of `2021-01-08`, its storage costs zeroed out, and its contents will eventually be deleted. Note that [table expiration](#controlling-table-expiration) will take precedence if specified. - - - -```sql -{{ config( - materialized = 'table', - partition_by = { - "field": "created_at", - "data_type": "timestamp", - "granularity": "day" - }, - require_partition_filter = true, - partition_expiration_days = 7 -)}} - -``` - - - -### Clustering Clause - -BigQuery tables can be [clustered](https://cloud.google.com/bigquery/docs/clustered-tables) to colocate related data. - -Clustering on a single column: - - - -```sql -{{ - config( - materialized = "table", - cluster_by = "order_id", - ) -}} - -select * from ... -``` - - - -Clustering on a multiple columns: - - - -```sql -{{ - config( - materialized = "table", - cluster_by = ["customer_id", "order_id"], - ) -}} - -select * from ... -``` - - - -## Managing KMS Encryption - -[Customer managed encryption keys](https://cloud.google.com/bigquery/docs/customer-managed-encryption) can be configured for BigQuery tables using the `kms_key_name` model configuration. - -### Using KMS Encryption - -To specify the KMS key name for a model (or a group of models), use the `kms_key_name` model configuration. The following example sets the `kms_key_name` for all of the models in the `encrypted/` directory of your dbt project. - - - -```yaml - -name: my_project -version: 1.0.0 - -... - -models: - my_project: - encrypted: - +kms_key_name: 'projects/PROJECT_ID/locations/global/keyRings/test/cryptoKeys/quickstart' -``` - - - -## Labels and Tags - -### Specifying labels - -dbt supports the specification of BigQuery labels for the tables and views that it creates. These labels can be specified using the `labels` model config. - -The `labels` config can be provided in a model config, or in the `dbt_project.yml` file, as shown below. - - BigQuery key-value pair entries for labels larger than 63 characters are truncated. - -**Configuring labels in a model file** - - - -```sql -{{ - config( - materialized = "table", - labels = {'contains_pii': 'yes', 'contains_pie': 'no'} - ) -}} - -select * from {{ ref('another_model') }} -``` - - - -**Configuring labels in dbt_project.yml** - - - -```yaml - -models: - my_project: - snowplow: - +labels: - domain: clickstream - finance: - +labels: - domain: finance -``` - - - - +defaultValue="model" +values={[ +{label: 'Model config', value: 'model'}, +{label: 'Project config', value: 'project'} +]} +> - + -### Specifying tags -BigQuery table and view *tags* can be created by supplying an empty string for the label value. - - + ```sql {{ - config( - materialized = "table", - labels = {'contains_pii': ''} - ) + config( + materialized='table' + ) }} -select * from {{ ref('another_model') }} +select * +from ... ``` -### Policy tags -BigQuery enables [column-level security](https://cloud.google.com/bigquery/docs/column-level-security-intro) by setting [policy tags](https://cloud.google.com/bigquery/docs/best-practices-policy-tags) on specific columns. - -dbt enables this feature as a column resource property, `policy_tags` (_not_ a node config). - - - -```yaml -version: 2 - -models: -- name: policy_tag_table - columns: - - name: field - policy_tags: - - 'projects//locations//taxonomies//policyTags/' -``` - - - -Please note that in order for policy tags to take effect, [column-level `persist_docs`](/reference/resource-configs/persist_docs) must be enabled for the model, seed, or snapshot. Consider using [variables](/docs/build/project-variables) to manage taxonomies and make sure to add the required security [roles](https://cloud.google.com/bigquery/docs/column-level-security-intro#roles) to your BigQuery service account key. - -## Merge behavior (incremental models) - -The [`incremental_strategy` config](/docs/build/incremental-strategy) controls how dbt builds incremental models. dbt uses a [merge statement](https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax) on BigQuery to refresh incremental tables. - -The `incremental_strategy` config can be set to one of two values: - - `merge` (default) - - `insert_overwrite` - -### Performance and cost - -The operations performed by dbt while building a BigQuery incremental model can -be made cheaper and faster by using [clustering keys](#clustering-keys) in your -model configuration. See [this guide](https://discourse.getdbt.com/t/benchmarking-incremental-strategies-on-bigquery/981) for more information on performance tuning for BigQuery incremental models. - -**Note:** These performance and cost benefits are applicable to incremental models -built with either the `merge` or the `insert_overwrite` incremental strategy. - -### The `merge` strategy - The `merge` incremental strategy will generate a `merge` statement that looks - something like: - -```merge -merge into {{ destination_table }} DEST -using ({{ model_sql }}) SRC -on SRC.{{ unique_key }} = DEST.{{ unique_key }} - -when matched then update ... -when not matched then insert ... -``` - -The 'merge' approach automatically updates new data in the destination incremental table but requires scanning all source tables referenced in the model SQL, as well as destination tables. This can be slow and expensive for large data volumes. [Partitioning and clustering](#using-table-partitioning-and-clustering) techniques mentioned earlier can help mitigate these issues. - -**Note:** The `unique_key` configuration is required when the `merge` incremental -strategy is selected. - -### The `insert_overwrite` strategy - -The `insert_overwrite` strategy generates a merge statement that replaces entire partitions -in the destination table. **Note:** this configuration requires that the model is configured -with a [Partition clause](#partition-clause). The `merge` statement that dbt generates -when the `insert_overwrite` strategy is selected looks something like: - -```sql -/* - Create a temporary table from the model SQL -*/ -create temporary table {{ model_name }}__dbt_tmp as ( - {{ model_sql }} -); - -/* - If applicable, determine the partitions to overwrite by - querying the temp table. -*/ - -declare dbt_partitions_for_replacement array; -set (dbt_partitions_for_replacement) = ( - select as struct - array_agg(distinct date(max_tstamp)) - from `my_project`.`my_dataset`.{{ model_name }}__dbt_tmp -); - -/* - Overwrite partitions in the destination table which match - the partitions in the temporary table -*/ -merge into {{ destination_table }} DEST -using {{ model_name }}__dbt_tmp SRC -on FALSE - -when not matched by source and {{ partition_column }} in unnest(dbt_partitions_for_replacement) -then delete - -when not matched then insert ... -``` - -For a complete writeup on the mechanics of this approach, see -[this explainer post](https://discourse.getdbt.com/t/bigquery-dbt-incremental-changes/982). - -#### Determining partitions to overwrite - -dbt is able to determine the partitions to overwrite dynamically from the values -present in the temporary table, or statically using a user-supplied configuration. - -The "dynamic" approach is simplest (and the default), but the "static" approach -will reduce costs by eliminating multiple queries in the model build script. - -#### Static partitions - -To supply a static list of partitions to overwrite, use the `partitions` configuration. - - - -```sql -{% set partitions_to_replace = [ - 'timestamp(current_date)', - 'timestamp(date_sub(current_date, interval 1 day))' -] %} - -{{ - config( - materialized = 'incremental', - incremental_strategy = 'insert_overwrite', - partition_by = {'field': 'session_start', 'data_type': 'timestamp'}, - partitions = partitions_to_replace - ) -}} - -with events as ( - - select * from {{ref('events')}} - - {% if is_incremental() %} - -- recalculate yesterday + today - where timestamp_trunc(event_timestamp, day) in ({{ partitions_to_replace | join(',') }}) - {% endif %} - -), - -... rest of model ... -``` - - - -This example model serves to replace the data in the destination table for both -_today_ and _yesterday_ every day that it is run. It is the fastest and cheapest -way to incrementally update a table using dbt. If we wanted this to run more dynamically— -let’s say, always for the past 3 days—we could leverage dbt’s baked-in [datetime macros](https://github.com/dbt-labs/dbt-core/blob/dev/octavius-catto/core/dbt/include/global_project/macros/etc/datetime.sql) and write a few of our own. - -Think of this as "full control" mode. You must ensure that expressions or literal values in the the `partitions` config have proper quoting when templated, and that they match the `partition_by.data_type` (`timestamp`, `datetime`, `date`, or `int64`). Otherwise, the filter in the incremental `merge` statement will raise an error. - -#### Dynamic partitions - -If no `partitions` configuration is provided, dbt will instead: - -1. Create a temporary table for your model SQL -2. Query the temporary table to find the distinct partitions to be overwritten -3. Query the destination table to find the _max_ partition in the database - -When building your model SQL, you can take advantage of the introspection performed -by dbt to filter for only _new_ data. The max partition in the destination table -will be available using the `_dbt_max_partition` BigQuery scripting variable. **Note:** -this is a BigQuery SQL variable, not a dbt Jinja variable, so no jinja brackets are -required to access this variable. - -**Example model SQL:** - -```sql -{{ - config( - materialized = 'incremental', - partition_by = {'field': 'session_start', 'data_type': 'timestamp'}, - incremental_strategy = 'insert_overwrite' - ) -}} - -with events as ( - - select * from {{ref('events')}} - - {% if is_incremental() %} - - -- recalculate latest day's data + previous - -- NOTE: The _dbt_max_partition variable is used to introspect the destination table - where date(event_timestamp) >= date_sub(date(_dbt_max_partition), interval 1 day) - -{% endif %} - -), - -... rest of model ... -``` - -#### Copying partitions - -If you are replacing entire partitions in your incremental runs, you can opt to do so with the [copy table API](https://cloud.google.com/bigquery/docs/managing-tables#copy-table) and partition decorators rather than a `merge` statement. While this mechanism doesn't offer the same visibility and ease of debugging as the SQL `merge` statement, it can yield significant savings in time and cost for large datasets because the copy table API does not incur any costs for inserting the data - it's equivalent to the `bq cp` gcloud command line interface (CLI) command. - -You can enable this by switching on `copy_partitions: True` in the `partition_by` configuration. This approach works only in combination with "dynamic" partition replacement. - - - -```sql -{{ config( - materialized="incremental", - incremental_strategy="insert_overwrite", - partition_by={ - "field": "created_date", - "data_type": "timestamp", - "granularity": "day", - "time_ingestion_partitioning": true, - "copy_partitions": true - } -) }} - -select - user_id, - event_name, - created_at, - -- values of this column must match the data type + granularity defined above - timestamp_trunc(created_at, day) as created_date - -from {{ ref('events') }} -``` - - - - - -``` -... -[0m16:03:13.017641 [debug] [Thread-3 (]: BigQuery adapter: Copying table(s) "/projects/projectname/datasets/analytics/tables/bigquery_table__dbt_tmp$20230112" to "/projects/projectname/datasets/analytics/tables/bigquery_table$20230112" with disposition: "WRITE_TRUNCATE" -... -``` - - - -## Controlling table expiration - -By default, dbt-created tables never expire. You can configure certain model(s) -to expire after a set number of hours by setting `hours_to_expiration`. - -:::info Note -The `hours_to_expiration` only applies to initial creation of the underlying table. It doesn't reset for incremental models when they do another run. -::: - - - -```yml -models: - [](/reference/resource-configs/resource-path): - +hours_to_expiration: 6 - -``` - - - - - -```sql - -{{ config( - hours_to_expiration = 6 -) }} - -select ... - -``` - - - -## Authorized Views - -If the `grant_access_to` config is specified for a model materialized as a -view, dbt will grant the view model access to select from the list of datasets -provided. See [BQ docs on authorized views](https://cloud.google.com/bigquery/docs/share-access-views) -for more details. - - - - - -```yml -models: - [](/reference/resource-configs/resource-path): - +grant_access_to: - - project: project_1 - dataset: dataset_1 - - project: project_2 - dataset: dataset_2 -``` - - - - - -```sql - -{{ config( - grant_access_to=[ - {'project': 'project_1', 'dataset': 'dataset_1'}, - {'project': 'project_2', 'dataset': 'dataset_2'} - ] -) }} -``` - - - -Views with this configuration will be able to select from objects in `project_1.dataset_1` and `project_2.dataset_2`, even when they are located elsewhere and queried by users who do not otherwise have access to `project_1.dataset_1` and `project_2.dataset_2`. - - - -## Materialized views - -The BigQuery adapter supports [materialized views](https://cloud.google.com/bigquery/docs/materialized-views-intro) -with the following configuration parameters: - -| Parameter | Type | Required | Default | Change Monitoring Support | -|----------------------------------------------------------------------------------|------------------------|----------|---------|---------------------------| -| [`on_configuration_change`](/reference/resource-configs/on_configuration_change) | `` | no | `apply` | n/a | -| [`cluster_by`](#clustering-clause) | `[]` | no | `none` | drop/create | -| [`partition_by`](#partition-clause) | `{}` | no | `none` | drop/create | -| [`enable_refresh`](#auto-refresh) | `` | no | `true` | alter | -| [`refresh_interval_minutes`](#auto-refresh) | `` | no | `30` | alter | -| [`max_staleness`](#auto-refresh) (in Preview) | `` | no | `none` | alter | -| [`description`](/reference/resource-properties/description) | `` | no | `none` | alter | -| [`labels`](#specifying-labels) | `{: }` | no | `none` | alter | -| [`hours_to_expiration`](#controlling-table-expiration) | `` | no | `none` | alter | -| [`kms_key_name`](#using-kms-encryption) | `` | no | `none` | alter | - - - + - + - + ```yaml models: - [](/reference/resource-configs/resource-path): - [+](/reference/resource-configs/plus-prefix)[materialized](/reference/resource-configs/materialized): materialized_view - [+](/reference/resource-configs/plus-prefix)[on_configuration_change](/reference/resource-configs/on_configuration_change): apply | continue | fail - [+](/reference/resource-configs/plus-prefix)[cluster_by](#clustering-clause): | [] - [+](/reference/resource-configs/plus-prefix)[partition_by](#partition-clause): - - field: - - data_type: timestamp | date | datetime | int64 - # only if `data_type` is not 'int64' - - granularity: hour | day | month | year - # only if `data_type` is 'int64' - - range: - - start: - - end: - - interval: - [+](/reference/resource-configs/plus-prefix)[enable_refresh](#auto-refresh): true | false - [+](/reference/resource-configs/plus-prefix)[refresh_interval_minutes](#auto-refresh): - [+](/reference/resource-configs/plus-prefix)[max_staleness](#auto-refresh): - [+](/reference/resource-configs/plus-prefix)[description](/reference/resource-properties/description): - [+](/reference/resource-configs/plus-prefix)[labels](#specifying-labels): {: } - [+](/reference/resource-configs/plus-prefix)[hours_to_expiration](#acontrolling-table-expiration): - [+](/reference/resource-configs/plus-prefix)[kms_key_name](##using-kms-encryption): + your_project_name: + materialized: view + staging: + materialized: table ``` + - - - - -```yaml -version: 2 - -models: - - name: [] - config: - [materialized](/reference/resource-configs/materialized): materialized_view - [on_configuration_change](/reference/resource-configs/on_configuration_change): apply | continue | fail - [cluster_by](#clustering-clause): | [] - [partition_by](#partition-clause): - - field: - - data_type: timestamp | date | datetime | int64 - # only if `data_type` is not 'int64' - - granularity: hour | day | month | year - # only if `data_type` is 'int64' - - range: - - start: - - end: - - interval: - [enable_refresh](#auto-refresh): true | false - [refresh_interval_minutes](#auto-refresh): - [max_staleness](#auto-refresh): - [description](/reference/resource-properties/description): - [labels](#specifying-labels): {: } - [hours_to_expiration](#acontrolling-table-expiration): - [kms_key_name](##using-kms-encryption): -``` +## Seeds - +By default, `dbt-fabric` will attempt to insert seed files in batches of 400 rows. +If this exceeds Microsoft Fabric Synapse Data Warehouse 2100 parameter limit, the adapter will automatically limit to the highest safe value possible. - +To set a different default seed value, you can set the variable `max_batch_size` in your project configuration. + - - - - -```jinja -{{ config( - [materialized](/reference/resource-configs/materialized)='materialized_view', - [on_configuration_change](/reference/resource-configs/on_configuration_change)="apply" | "continue" | "fail", - [cluster_by](#clustering-clause)="" | [""], - [partition_by](#partition-clause)={ - "field": "", - "data_type": "timestamp" | "date" | "datetime" | "int64", - - # only if `data_type` is not 'int64' - "granularity": "hour" | "day" | "month" | "year, - - # only if `data_type` is 'int64' - "range": { - "start": , - "end": , - "interval": , - } - }, - - # auto-refresh options - [enable_refresh](#auto-refresh)= true | false, - [refresh_interval_minutes](#auto-refresh)=, - [max_staleness](#auto-refresh)="", - - # additional options - [description](/reference/resource-properties/description)="", - [labels](#specifying-labels)={ - "": "", - }, - [hours_to_expiration](#acontrolling-table-expiration)=, - [kms_key_name](##using-kms-encryption)="", -) }} +```yaml +vars: + max_batch_size: 200 # Any integer less than or equal to 2100 will do. ``` - - - - -Many of these parameters correspond to their table counterparts and have been linked above. -The set of parameters unique to materialized views covers [auto-refresh functionality](#auto-refresh). - -Learn more about these parameters in BigQuery's docs: -- [CREATE MATERIALIZED VIEW statement](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_materialized_view_statement) -- [materialized_view_option_list](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#materialized_view_option_list) - -### Auto-refresh +## Snapshots -| Parameter | Type | Required | Default | Change Monitoring Support | -|------------------------------|--------------|----------|---------|---------------------------| -| `enable_refresh` | `` | no | `true` | alter | -| `refresh_interval_minutes` | `` | no | `30` | alter | -| `max_staleness` (in Preview) | `` | no | `none` | alter | +Columns in source tables can not have any constraints. +If, for example, any column has a `NOT NULL` constraint, an error will be thrown. -BigQuery supports [automatic refresh](https://cloud.google.com/bigquery/docs/materialized-views-manage#automatic_refresh) configuration for materialized views. -By default, a materialized view will automatically refresh within 5 minutes of changes in the base table, but not more frequently than once every 30 minutes. -BigQuery only officially supports the configuration of the frequency (the "once every 30 minutes" frequency); -however, there is a feature in preview that allows for the configuration of the staleness (the "5 minutes" refresh). -dbt will monitor these parameters for changes and apply them using an `ALTER` statement. +## Indexes -Learn more about these parameters in BigQuery's docs: -- [materialized_view_option_list](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#materialized_view_option_list) -- [max_staleness](https://cloud.google.com/bigquery/docs/materialized-views-create#max_staleness) +Indexes are not supported by Microsoft Fabric Synapse Data Warehouse. Any Indexes provided as a configuration is ignored by the adapter. -### Limitations +## Grants with auto provisioning -As with most data platforms, there are limitations associated with materialized views. Some worth noting include: +Grants with auto provisioning is not supported by Microsoft Fabric Synapse Data Warehouse at this time. -- Materialized view SQL has a [limited feature set](https://cloud.google.com/bigquery/docs/materialized-views-create#supported-mvs). -- Materialized view SQL cannot be updated; the materialized view must go through a `--full-refresh` (DROP/CREATE). -- The `partition_by` clause on a materialized view must match that of the underlying base table. -- While materialized views can have descriptions, materialized view *columns* cannot. -- Recreating/dropping the base table requires recreating/dropping the materialized view. +## Incremental -Find more information about materialized view limitations in Google's BigQuery [docs](https://cloud.google.com/bigquery/docs/materialized-views-intro#limitations). +Fabric supports both `delete+insert` and `append` strategy. - +If a unique key is not provided, it will default to the `append` strategy. - +## Permissions -## Python models +The Microsoft Entra identity (user or service principal) must be a Fabric Workspace admin to work on the database level at this time. Fine grain access control will be incorporated in the future. -The BigQuery adapter supports Python models with the following additional configuration parameters: +## cross-database macros -| Parameter | Type | Required | Default | Valid values | -|-------------------------|-------------|----------|-----------|------------------| -| `enable_list_inference` | `` | no | `True` | `True`, `False` | -| `intermediate_format` | `` | no | `parquet` | `parquet`, `orc` | +Not supported at this time. -### The `enable_list_inference` parameter -The `enable_list_inference` parameter enables a PySpark data frame to read multiple records in the same operation. -By default, this is set to `True` to support the default `intermediate_format` of `parquet`. +## dbt-utils -### The `intermediate_format` parameter -The `intermediate_format` parameter specifies which file format to use when writing records to a table. The default is `parquet`. +Not supported at this time. However, dbt-fabric offers some utils macros. Please check out [utils macros](https://github.com/microsoft/dbt-fabric/tree/main/dbt/include/fabric/macros/utils). - From fad90675d3dc308c636dccd64913045ee9bb6537 Mon Sep 17 00:00:00 2001 From: Ly Nguyen Date: Thu, 15 Aug 2024 12:33:09 -0700 Subject: [PATCH 8/9] Revert "Check a guide" This reverts commit d5776b7b80c8c72accb171887209e55d31c44e51. --- website/docs/guides/bigquery-qs.md | 321 +++++++---------------------- 1 file changed, 72 insertions(+), 249 deletions(-) diff --git a/website/docs/guides/bigquery-qs.md b/website/docs/guides/bigquery-qs.md index 5401d57f2b6..1ba5f7b0021 100644 --- a/website/docs/guides/bigquery-qs.md +++ b/website/docs/guides/bigquery-qs.md @@ -13,222 +13,101 @@ recently_updated: true ## Introduction -In this quickstart guide, you'll learn how to use dbt Cloud with Snowflake. It will show you how to: +In this quickstart guide, you'll learn how to use dbt Cloud with BigQuery. It will show you how to: -- Create a new Snowflake worksheet. -- Load sample data into your Snowflake account. -- Connect dbt Cloud to Snowflake. +- Create a Google Cloud Platform (GCP) project. +- Access sample data in a public dataset. +- Connect dbt Cloud to BigQuery. - Take a sample query and turn it into a model in your dbt project. A model in dbt is a select statement. -- Add sources to your dbt project. Sources allow you to name and describe the raw data already loaded into Snowflake. - Add tests to your models. - Document your models. - Schedule a job to run. -Snowflake also provides a quickstart for you to learn how to use dbt Cloud. It makes use of a different public dataset (Knoema Economy Data Atlas) than what's shown in this guide. For more information, refer to [Accelerating Data Teams with dbt Cloud & Snowflake](https://quickstarts.snowflake.com/guide/accelerating_data_teams_with_snowflake_and_dbt_cloud_hands_on_lab/) in the Snowflake docs. - :::tip Videos for you You can check out [dbt Fundamentals](https://learn.getdbt.com/courses/dbt-fundamentals) for free if you're interested in course learning with videos. - -You can also watch the [YouTube video on dbt and Snowflake](https://www.youtube.com/watch?v=kbCkwhySV_I&list=PL0QYlrC86xQm7CoOH6RS7hcgLnd3OQioG). ::: - + ### Prerequisites​ -- You have a [dbt Cloud account](https://www.getdbt.com/signup/). -- You have a [trial Snowflake account](https://signup.snowflake.com/). During trial account creation, make sure to choose the **Enterprise** Snowflake edition so you have `ACCOUNTADMIN` access. For a full implementation, you should consider organizational questions when choosing a cloud provider. For more information, see [Introduction to Cloud Platforms](https://docs.snowflake.com/en/user-guide/intro-cloud-platforms.html) in the Snowflake docs. For the purposes of this setup, all cloud providers and regions will work so choose whichever you’d like. +- You have a [dbt Cloud account](https://www.getdbt.com/signup/). +- You have a [Google account](https://support.google.com/accounts/answer/27441?hl=en). +- You can use a personal or work account to set up BigQuery through [Google Cloud Platform (GCP)](https://cloud.google.com/free). ### Related content - Learn more with [dbt Learn courses](https://learn.getdbt.com) -- [How we configure Snowflake](https://blog.getdbt.com/how-we-configure-snowflake/) - [CI jobs](/docs/deploy/continuous-integration) - [Deploy jobs](/docs/deploy/deploy-jobs) - [Job notifications](/docs/deploy/job-notifications) - [Source freshness](/docs/deploy/source-freshness) -## Create a new Snowflake worksheet -1. Log in to your trial Snowflake account. -2. In the Snowflake UI, click **+ Worksheet** in the upper right corner to create a new worksheet. - -## Load data -The data used here is stored as CSV files in a public S3 bucket and the following steps will guide you through how to prepare your Snowflake account for that data and upload it. - -1. Create a new virtual warehouse, two new databases (one for raw data, the other for future dbt development), and two new schemas (one for `jaffle_shop` data, the other for `stripe` data). - - To do this, run these SQL commands by typing them into the Editor of your new Snowflake worksheet and clicking **Run** in the upper right corner of the UI: - ```sql - create warehouse transforming; - create database raw; - create database analytics; - create schema raw.jaffle_shop; - create schema raw.stripe; - ``` - -2. In the `raw` database and `jaffle_shop` and `stripe` schemas, create three tables and load relevant data into them: - - - First, delete all contents (empty) in the Editor of the Snowflake worksheet. Then, run this SQL command to create the `customer` table: - - ```sql - create table raw.jaffle_shop.customers - ( id integer, - first_name varchar, - last_name varchar - ); - ``` +## Create a new GCP project​ - - Delete all contents in the Editor, then run this command to load data into the `customer` table: +1. Go to the [BigQuery Console](https://console.cloud.google.com/bigquery) after you log in to your Google account. If you have multiple Google accounts, make sure you’re using the correct one. +2. Create a new project from the [Manage resources page](https://console.cloud.google.com/projectcreate?previousPage=%2Fcloud-resource-manager%3Fwalkthrough_id%3Dresource-manager--create-project%26project%3D%26folder%3D%26organizationId%3D%23step_index%3D1&walkthrough_id=resource-manager--create-project). For more information, refer to [Creating a project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#creating_a_project) in the Google Cloud docs. GCP automatically populates the Project name field for you. You can change it to be more descriptive for your use. For example, `dbt Learn - BigQuery Setup`. - ```sql - copy into raw.jaffle_shop.customers (id, first_name, last_name) - from 's3://dbt-tutorial-public/jaffle_shop_customers.csv' - file_format = ( - type = 'CSV' - field_delimiter = ',' - skip_header = 1 - ); - ``` - - Delete all contents in the Editor (empty), then run this command to create the `orders` table: - ```sql - create table raw.jaffle_shop.orders - ( id integer, - user_id integer, - order_date date, - status varchar, - _etl_loaded_at timestamp default current_timestamp - ); - ``` +## Create BigQuery datasets - - Delete all contents in the Editor, then run this command to load data into the `orders` table: - ```sql - copy into raw.jaffle_shop.orders (id, user_id, order_date, status) - from 's3://dbt-tutorial-public/jaffle_shop_orders.csv' - file_format = ( - type = 'CSV' - field_delimiter = ',' - skip_header = 1 - ); - ``` - - Delete all contents in the Editor (empty), then run this command to create the `payment` table: - ```sql - create table raw.stripe.payment - ( id integer, - orderid integer, - paymentmethod varchar, - status varchar, - amount integer, - created date, - _batched_at timestamp default current_timestamp - ); - ``` - - Delete all contents in the Editor, then run this command to load data into the `payment` table: - ```sql - copy into raw.stripe.payment (id, orderid, paymentmethod, status, amount, created) - from 's3://dbt-tutorial-public/stripe_payments.csv' - file_format = ( - type = 'CSV' - field_delimiter = ',' - skip_header = 1 - ); - ``` -3. Verify that the data is loaded by running these SQL queries. Confirm that you can see output for each one. +1. From the [BigQuery Console](https://console.cloud.google.com/bigquery), click **Editor**. Make sure to select your newly created project, which is available at the top of the page. +1. Verify that you can run SQL queries. Copy and paste these queries into the Query Editor: ```sql - select * from raw.jaffle_shop.customers; - select * from raw.jaffle_shop.orders; - select * from raw.stripe.payment; + select * from `dbt-tutorial.jaffle_shop.customers`; + select * from `dbt-tutorial.jaffle_shop.orders`; + select * from `dbt-tutorial.stripe.payment`; ``` -## Connect dbt Cloud to Snowflake - -There are two ways to connect dbt Cloud to Snowflake. The first option is Partner Connect, which provides a streamlined setup to create your dbt Cloud account from within your new Snowflake trial account. The second option is to create your dbt Cloud account separately and build the Snowflake connection yourself (connect manually). If you want to get started quickly, dbt Labs recommends using Partner Connect. If you want to customize your setup from the very beginning and gain familiarity with the dbt Cloud setup flow, dbt Labs recommends connecting manually. - - - - -Using Partner Connect allows you to create a complete dbt account with your [Snowflake connection](/docs/cloud/connect-data-platform/connect-snowflake), [a managed repository](/docs/collaborate/git/managed-repository), [environments](/docs/build/custom-schemas#managing-environments), and credentials. - -1. In the Snowflake UI, click on the home icon in the upper left corner. In the left sidebar, select **Data Products**. Then, select **Partner Connect**. Find the dbt tile by scrolling or by searching for dbt in the search bar. Click the tile to connect to dbt. - - - - If you’re using the classic version of the Snowflake UI, you can click the **Partner Connect** button in the top bar of your account. From there, click on the dbt tile to open up the connect box. - - - -2. In the **Connect to dbt** popup, find the **Optional Grant** option and select the **RAW** and **ANALYTICS** databases. This will grant access for your new dbt user role to each database. Then, click **Connect**. - - - - - -3. Click **Activate** when a popup appears: - - - - - -4. After the new tab loads, you will see a form. If you already created a dbt Cloud account, you will be asked to provide an account name. If you haven't created account, you will be asked to provide an account name and password. - - - -5. After you have filled out the form and clicked **Complete Registration**, you will be logged into dbt Cloud automatically. - -6. From your **Account Settings** in dbt Cloud (using the gear menu in the upper right corner), choose the "Partner Connect Trial" project and select **snowflake** in the overview table. Select edit and update the fields **Database** and **Warehouse** to be `analytics` and `transforming`, respectively. - - - - - - - - - -1. Create a new project in dbt Cloud. From **Account settings** (using the gear menu in the top right corner), click **+ New Project**. + Click **Run**, then check for results from the queries. For example: +
+ +
+2. Create new datasets from the [BigQuery Console](https://console.cloud.google.com/bigquery). For more information, refer to [Create datasets](https://cloud.google.com/bigquery/docs/datasets#create-dataset) in the Google Cloud docs. Datasets in BigQuery are equivalent to schemas in a traditional database. On the **Create dataset** page: + - **Dataset ID** — Enter a name that fits the purpose. This name is used like schema in fully qualified references to your database objects such as `database.schema.table`. As an example for this guide, create one for `jaffle_shop` and another one for `stripe` afterward. + - **Data location** — Leave it blank (the default). It determines the GCP location of where your data is stored. The current default location is the US multi-region. All tables within this dataset will share this location. + - **Enable table expiration** — Leave it unselected (the default). The default for the billing table expiration is 60 days. Because billing isn’t enabled for this project, GCP defaults to deprecating tables. + - **Google-managed encryption key** — This option is available under **Advanced options**. Allow Google to manage encryption (the default). +
+ +
+3. After you create the `jaffle_shop` dataset, create one for `stripe` with all the same values except for **Dataset ID**. + +## Generate BigQuery credentials {#generate-bigquery-credentials} +In order to let dbt connect to your warehouse, you'll need to generate a keyfile. This is analogous to using a database username and password with most other data warehouses. + +1. Start the [GCP credentials wizard](https://console.cloud.google.com/apis/credentials/wizard). Make sure your new project is selected in the header. If you do not see your account or project, click your profile picture to the right and verify you are using the correct email account. For **Credential Type**: + - From the **Select an API** dropdown, choose **BigQuery API** + - Select **Application data** for the type of data you will be accessing + - Click **Next** to create a new service account. +2. Create a service account for your new project from the [Service accounts page](https://console.cloud.google.com/projectselector2/iam-admin/serviceaccounts?supportedpurview=project). For more information, refer to [Create a service account](https://developers.google.com/workspace/guides/create-credentials#create_a_service_account) in the Google Cloud docs. As an example for this guide, you can: + - Type `dbt-user` as the **Service account name** + - From the **Select a role** dropdown, choose **BigQuery Job User** and **BigQuery Data Editor** roles and click **Continue** + - Leave the **Grant users access to this service account** fields blank + - Click **Done** +3. Create a service account key for your new project from the [Service accounts page](https://console.cloud.google.com/iam-admin/serviceaccounts?walkthrough_id=iam--create-service-account-keys&start_index=1#step_index=1). For more information, refer to [Create a service account key](https://cloud.google.com/iam/docs/creating-managing-service-account-keys#creating) in the Google Cloud docs. When downloading the JSON file, make sure to use a filename you can easily remember. For example, `dbt-user-creds.json`. For security reasons, dbt Labs recommends that you protect this JSON file like you would your identity credentials; for example, don't check the JSON file into your version control software. + +## Connect dbt Cloud to BigQuery​ +1. Create a new project in [dbt Cloud](https://cloud.getdbt.com/). From **Account settings** (using the gear menu in the top right corner), click **+ New Project**. 2. Enter a project name and click **Continue**. -3. For the warehouse, click **Snowflake** then **Next** to set up your connection. - - - -4. Enter your **Settings** for Snowflake with: - * **Account** — Find your account by using the Snowflake trial account URL and removing `snowflakecomputing.com`. The order of your account information will vary by Snowflake version. For example, Snowflake's Classic console URL might look like: `oq65696.west-us-2.azure.snowflakecomputing.com`. The AppUI or Snowsight URL might look more like: `snowflakecomputing.com/west-us-2.azure/oq65696`. In both examples, your account will be: `oq65696.west-us-2.azure`. For more information, see [Account Identifiers](https://docs.snowflake.com/en/user-guide/admin-account-identifier.html) in the Snowflake docs. - - - - * **Role** — Leave blank for now. You can update this to a default Snowflake role later. - * **Database** — `analytics`. This tells dbt to create new models in the analytics database. - * **Warehouse** — `transforming`. This tells dbt to use the transforming warehouse that was created earlier. - - +3. For the warehouse, click **BigQuery** then **Next** to set up your connection. +4. Click **Upload a Service Account JSON File** in settings. +5. Select the JSON file you downloaded in [Generate BigQuery credentials](#generate-bigquery-credentials) and dbt Cloud will fill in all the necessary fields. +6. Click **Test Connection**. This verifies that dbt Cloud can access your BigQuery account. +7. Click **Next** if the test succeeded. If it failed, you might need to go back and regenerate your BigQuery credentials. -5. Enter your **Development Credentials** for Snowflake with: - * **Username** — The username you created for Snowflake. The username is not your email address and is usually your first and last name together in one word. - * **Password** — The password you set when creating your Snowflake account. - * **Schema** — You’ll notice that the schema name has been auto created for you. By convention, this is `dbt_`. This is the schema connected directly to your development environment, and it's where your models will be built when running dbt within the Cloud IDE. - * **Target name** — Leave as the default. - * **Threads** — Leave as 4. This is the number of simultaneous connects that dbt Cloud will make to build models concurrently. - - - -6. Click **Test Connection**. This verifies that dbt Cloud can access your Snowflake account. -7. If the connection test succeeds, click **Next**. If it fails, you may need to check your Snowflake settings and credentials. - -
-
## Set up a dbt Cloud managed repository -If you used Partner Connect, you can skip to [initializing your dbt project](#initialize-your-dbt-project-and-start-developing) as the Partner Connect provides you with a managed repository. Otherwise, you will need to create your repository connection. - + ## Initialize your dbt project​ and start developing Now that you have a repository configured, you can initialize your project and start development in dbt Cloud: 1. Click **Start developing in the IDE**. It might take a few minutes for your project to spin up for the first time as it establishes your git connection, clones your repo, and tests the connection to the warehouse. -2. Above the file tree to the left, click **Initialize your project**. This builds out your folder structure with example models. -3. Make your initial commit by clicking **Commit and sync**. Use the commit message `initial commit`. This creates the first commit to your managed repo and allows you to open a branch where you can add new dbt code. +2. Above the file tree to the left, click **Initialize dbt project**. This builds out your folder structure with example models. +3. Make your initial commit by clicking **Commit and sync**. Use the commit message `initial commit` and click **Commit**. This creates the first commit to your managed repo and allows you to open a branch where you can add new dbt code. 4. You can now directly query data from your warehouse and execute `dbt run`. You can try this out now: - - Click **+ Create new file**, add this query to the new file, and click **Save as** to save the new file: + - Click **+ Create new file**, add this query to the new file, and click **Save as** to save the new file: ```sql - select * from raw.jaffle_shop.customers + select * from `dbt-tutorial.jaffle_shop.customers` ``` - In the command line bar at the bottom, enter `dbt run` and click **Enter**. You should see a `dbt run succeeded` message. @@ -245,6 +124,7 @@ Name the new branch `add-customers-model`. 2. Name the file `customers.sql`, then click **Create**. 3. Copy the following query into the file and click **Save**. + ```sql with customers as ( @@ -253,7 +133,7 @@ with customers as ( first_name, last_name - from raw.jaffle_shop.customers + from `dbt-tutorial`.jaffle_shop.customers ), @@ -265,7 +145,7 @@ orders as ( order_date, status - from raw.jaffle_shop.orders + from `dbt-tutorial`.jaffle_shop.orders ), @@ -307,6 +187,14 @@ select * from final Later, you can connect your business intelligence (BI) tools to these views and tables so they only read cleaned up data rather than raw data in your BI tool. +#### FAQs + + + + + + + ## Change the way your model is materialized @@ -330,7 +218,7 @@ Later, you can connect your business intelligence (BI) tools to these views and first_name, last_name - from raw.jaffle_shop.customers + from `dbt-tutorial`.jaffle_shop.customers ```
@@ -344,7 +232,7 @@ Later, you can connect your business intelligence (BI) tools to these views and order_date, status - from raw.jaffle_shop.orders + from `dbt-tutorial`.jaffle_shop.orders ```
@@ -407,79 +295,14 @@ Later, you can connect your business intelligence (BI) tools to these views and This time, when you performed a `dbt run`, separate views/tables were created for `stg_customers`, `stg_orders` and `customers`. dbt inferred the order to run these models. Because `customers` depends on `stg_customers` and `stg_orders`, dbt builds `customers` last. You do not need to explicitly define these dependencies. + #### FAQs {#faq-2} -## Build models on top of sources - -Sources make it possible to name and describe the data loaded into your warehouse by your extract and load tools. By declaring these tables as sources in dbt, you can: -- select from source tables in your models using the `{{ source() }}` function, helping define the lineage of your data -- test your assumptions about your source data -- calculate the freshness of your source data - -1. Create a new YML file `models/sources.yml`. -2. Declare the sources by copying the following into the file and clicking **Save**. - - - - ```yml - version: 2 - - sources: - - name: jaffle_shop - description: This is a replica of the Postgres database used by our app - database: raw - schema: jaffle_shop - tables: - - name: customers - description: One record per customer. - - name: orders - description: One record per order. Includes cancelled and deleted orders. - ``` - - - -3. Edit the `models/stg_customers.sql` file to select from the `customers` table in the `jaffle_shop` source. - - - - ```sql - select - id as customer_id, - first_name, - last_name - - from {{ source('jaffle_shop', 'customers') }} - ``` - - - -4. Edit the `models/stg_orders.sql` file to select from the `orders` table in the `jaffle_shop` source. - - - - ```sql - select - id as order_id, - user_id as customer_id, - order_date, - status - - from {{ source('jaffle_shop', 'orders') }} - ``` - - - -5. Execute `dbt run`. - - The results of your `dbt run` will be exactly the same as the previous step. Your `stg_customers` and `stg_orders` - models will still query from the same raw data source in Snowflake. By using `source`, you can - test and document your raw data and also understand the lineage of your sources. - - + From 32134e9c7dc8ba4798ad285c629180adfe65992a Mon Sep 17 00:00:00 2001 From: Ly Nguyen Date: Thu, 15 Aug 2024 12:46:29 -0700 Subject: [PATCH 9/9] Test linter --- website/docs/docs/trusted-adapters.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/docs/trusted-adapters.md b/website/docs/docs/trusted-adapters.md index 80a24b96fe0..81637fd2bda 100644 --- a/website/docs/docs/trusted-adapters.md +++ b/website/docs/docs/trusted-adapters.md @@ -9,7 +9,7 @@ Trusted adapters take part in the Trusted Adapter Program, including a commitmen Free and open-source tools for the data professional are increasingly abundant. This is by-and-large a *good thing*, however it requires due diligence that wasn't required in a paid-license, closed-source software world. As a user, there are questions to answer important before taking a dependency on an open-source project. The trusted adapter designation is meant to streamline this process for end users. -### Trusted adapter specifications +### Trusted Adapter Specifications Refer to the [Build, test, document, and promote adapters](/guides/adapter-creation) guide for more information, particularly if you are an adapter maintainer considering having your adapter be added to the trusted list.