Skip to content

Commit

Permalink
moving more guides
Browse files Browse the repository at this point in the history
  • Loading branch information
runleonarun committed Nov 4, 2023
1 parent 1b75021 commit c8e1cb1
Show file tree
Hide file tree
Showing 22 changed files with 628 additions and 602 deletions.
3 changes: 1 addition & 2 deletions website/docs/guides/airflow-and-dbt-cloud.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,9 @@
title: Airflow and dbt Cloud
id: airflow-and-dbt-cloud
time_to_complete: '60 minutes'
platform: 'dbt-cloud'
icon: 'guides'
hide_table_of_contents: true
tags: ['airflow', 'dbt Cloud', 'orchestration']
tags: ['dbt Cloud', 'Orchestration']
level: 'Intermediate'
recently_updated: true
---
Expand Down
3 changes: 1 addition & 2 deletions website/docs/guides/building-packages.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,9 @@ description: When you have dbt code that might help others, you can create a pac
displayText: Building dbt packages
hoverSnippet: Learn how to create packages for dbt.
time_to_complete: '60 minutes'
platform: 'dbt-core'
icon: 'guides'
hide_table_of_contents: true
tags: ['packages', 'dbt Core', 'legacy']
tags: ['dbt Core', 'legacy']
level: 'Advanced'
recently_updated: true
---
Expand Down
7 changes: 0 additions & 7 deletions website/docs/guides/creating-new-materializations.md
Original file line number Diff line number Diff line change
Expand Up @@ -172,13 +172,6 @@ For more information on the `config` dbt Jinja function, see the [config](/refer

## Materialization precedence


:::info New in 0.15.1

The materialization resolution order was poorly defined in versions of dbt prior to 0.15.1. Please use this guide for versions of dbt greater than or equal to 0.15.1.

:::

dbt will pick the materialization macro in the following order (lower takes priority):

1. global project - default
Expand Down

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -1,17 +1,26 @@
---
title: How to optimize and troubleshoot dbt models on Databricks
sidebar_label: "How to optimize and troubleshoot dbt models on Databricks"
title: Optimize and troubleshoot dbt models on Databricks
sidebar_label: "Optimize and troubleshoot dbt models on Databricks"
description: "Learn more about optimizing and troubleshooting your dbt models on Databricks"
displayText: Optimizing and troubleshooting your dbt models on Databricks
hoverSnippet: Learn how to optimize and troubleshoot your dbt models on Databricks.
time_to_complete: '30 minutes'
icon: 'databricks'
hide_table_of_contents: true
tags: ['Databricks', 'dbt Core','dbt Cloud']
level: 'Intermediate'
recently_updated: true
---

## Introduction

Continuing our Databricks and dbt guide series from the last [guide](/guides/dbt-ecosystem/databricks-guides/how-to-set-up-your-databricks-dbt-project), it’s time to talk about performance optimization. In this follow-up post,  we outline simple strategies to optimize for cost, performance, and simplicity when architecting your data pipelines. We’ve encapsulated these strategies in this acronym-framework:

- Platform Components
- Patterns & Best Practices
- Performance Troubleshooting

## 1. Platform Components
## Platform Components

As you start to develop your dbt projects, one of the first decisions you will make is what kind of backend infrastructure to run your models against. Databricks offers SQL warehouses, All-Purpose Compute, and Jobs Compute, each optimized to workloads they are catered to. Our recommendation is to use Databricks SQL warehouses for all your SQL workloads. SQL warehouses are optimized for SQL workloads when compared to other compute options, additionally, they can scale both vertically to support larger workloads and horizontally to support concurrency. Also, SQL warehouses are easier to manage and provide out-of-the-box features such as query history to help audit and optimize your SQL workloads. Between Serverless, Pro, and Classic SQL Warehouse types that Databricks offers, our standard recommendation for you is to leverage Databricks serverless warehouses. You can explore features of these warehouse types in the [Compare features section](https://www.databricks.com/product/pricing/databricks-sql?_gl=1*2rsmlo*_ga*ZmExYzgzZDAtMWU0Ny00N2YyLWFhYzEtM2RhZTQzNTAyZjZi*_ga_PQSEQ3RZQC*MTY3OTYwMDg0Ni4zNTAuMS4xNjc5NjAyMDMzLjUzLjAuMA..&_ga=2.104593536.1471430337.1679342371-fa1c83d0-1e47-47f2-aac1-3dae43502f6b) on the Databricks pricing page.

Expand All @@ -31,7 +40,7 @@ Another technique worth implementing is to provision separate SQL warehouses for

Because of the ability of serverless warehouses to spin up in a matter of seconds, setting your auto-stop configuration to a lower threshold will not impact SLAs and end-user experience. From the SQL Workspace UI, the default value is 10 minutes and  you can set it to 5 minutes for a lower threshold with the UI. If you would like more custom settings, you can set the threshold to as low as 1 minute with the [API](https://docs.databricks.com/sql/api/sql-endpoints.html#).

## 2. Patterns & Best Practices
## Patterns & Best Practices

Now that we have a solid sense of the infrastructure components, we can shift our focus to best practices and design patterns on pipeline development.  We recommend the staging/intermediate/mart approach which is analogous to the medallion architecture bronze/silver/gold approach that’s recommended by Databricks. Let’s dissect each stage further.

Expand Down Expand Up @@ -121,7 +130,7 @@ incremental_predicates = [
}}
```

## 3. Performance Troubleshooting
## Performance Troubleshooting

Performance troubleshooting refers to the process of identifying and resolving issues that impact the performance of your dbt models and overall data pipelines. By improving the speed and performance of your Lakehouse platform, you will be able to process data faster, process large and complex queries more effectively, and provide faster time to market.  Let’s go into detail the three effective strategies that you can implement.

Expand Down Expand Up @@ -166,7 +175,7 @@ Now you might be wondering, how do you identify opportunities for performance im

With the [dbt Cloud Admin API](/docs/dbt-cloud-apis/admin-cloud-api), you can  pull the dbt artifacts from your dbt Cloud run,  put the generated `manifest.json` into an S3 bucket, stage it, and model the data using the [dbt artifacts package](https://hub.getdbt.com/brooklyn-data/dbt_artifacts/latest/). That package can help you identify inefficiencies in your dbt models and pinpoint where opportunities for improvement are.

## Conclusion
### Conclusion

This concludes the second guide in our series on “Working with Databricks and dbt”, following [How to set up your Databricks and dbt Project](/guides/dbt-ecosystem/databricks-guides/how-to-set-up-your-databricks-dbt-project).

Expand Down
2 changes: 1 addition & 1 deletion website/docs/guides/debugging-schema-names.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ time_to_complete: '45 minutes'
platform: 'dbt-core'
icon: 'guides'
hide_table_of_contents: true
tags: ['schema names', 'dbt Core', 'legacy']
tags: ['dbt Core', 'legacy']
level: 'Advanced'
recently_updated: true
---
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,16 @@
# How to set up your Databricks and dbt project

---
title: How to set up your Databricks and dbt project
sidebar_label: "How to set up your Databricks and dbt project"
description: "Learn more about setting up your dbt project with Databricks"
displayText: Setting up your dbt project with Databricks
hoverSnippet: Learn how to set up your dbt project with Databricks.
time_to_complete: '30 minutes'
icon: 'databricks'
hide_table_of_contents: true
tags: ['Databricks', 'dbt Core','dbt Cloud']
level: 'Intermediate'
recently_updated: true
---

Databricks and dbt Labs are partnering to help data teams think like software engineering teams and ship trusted data, faster. The dbt-databricks adapter enables dbt users to leverage the latest Databricks features in their dbt project. Hundreds of customers are now using dbt and Databricks to build expressive and reliable data pipelines on the Lakehouse, generating data assets that enable analytics, ML, and AI use cases throughout the business.

Expand Down Expand Up @@ -80,9 +91,9 @@ For your development credentials/profiles.yml:

During your first invocation of `dbt run`, dbt will create the developer schema if it doesn't already exist in the dev catalog.

### Defining your dbt deployment environment
## Defining your dbt deployment environment

Last, we need to give dbt a way to deploy code outside of development environments. To do so, we’ll use dbt [environments](https://docs.getdbt.com/docs/collaborate/environments) to define the production targets that end users will interact with.
We need to give dbt a way to deploy code outside of development environments. To do so, we’ll use dbt [environments](https://docs.getdbt.com/docs/collaborate/environments) to define the production targets that end users will interact with.

Core projects can use [targets in profiles](https://docs.getdbt.com/docs/core/connection-profiles#understanding-targets-in-profiles) to separate environments. [dbt Cloud environments](https://docs.getdbt.com/docs/cloud/develop-in-the-cloud#set-up-and-access-the-cloud-ide) allow you to define environments via the UI and [schedule jobs](/guides/databricks#create-and-run-a-job) for specific environments.

Expand All @@ -94,10 +105,10 @@ Let’s set up our deployment environment:
4. Set the schema to the default for your prod environment. This can be overridden by [custom schemas](https://docs.getdbt.com/docs/build/custom-schemas#what-is-a-custom-schema) if you need to use more than one.
5. Provide your Service Principal token.

### Connect dbt to your git repository
## Connect dbt to your git repository

Next, you’ll need somewhere to store and version control your code that allows you to collaborate with teammates. Connect your dbt project to a git repository with [dbt Cloud](/guides/databricks#set-up-a-dbt-cloud-managed-repository). [Core](/guides/manual-install#create-a-repository) projects will use the git CLI.

## Next steps
### Next steps

Now that your project is configured, you can start transforming your Databricks data with dbt. To help you scale efficiently, we recommend you follow our best practices, starting with the ["Unity Catalog best practices" guide](dbt-unity-catalog-best-practices).
Now that your project is configured, you can start transforming your Databricks data with dbt. To help you scale efficiently, we recommend you follow our best practices, starting with the [Unity Catalog best practices](/best-practices/dbt-unity-catalog-best-practices), then you can [Optimize dbt models on Databricks](/guides/how_to_optimize_dbt_models_on_databricks) .
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,14 @@ id: how-to-use-databricks-workflows-to-run-dbt-cloud-jobs
description: Learn how to use Databricks workflows to run dbt Cloud jobs
displayText: "Use Databricks workflows to run dbt Cloud jobs"
hoverSnippet: Learn how to use Databricks workflows to run dbt Cloud jobs
time_to_complete: '60 minutes'
icon: 'databricks'
hide_table_of_contents: true
tags: ['Databricks', 'dbt Core','dbt Cloud','Orchestration']
level: 'Intermediate'
recently_updated: true
---
## Introduction

Using Databricks workflows to call the dbt Cloud job API can be useful for several reasons:

Expand All @@ -13,7 +20,7 @@ Using Databricks workflows to call the dbt Cloud job API can be useful for sever
3. [**Separation of concerns —**](https://en.wikipedia.org/wiki/Separation_of_concerns) Detailed logs for dbt jobs in the dbt Cloud environment can lead to more modularity and efficient debugging. By doing so, it becomes easier to isolate bugs quickly while still being able to see the overall status in Databricks.
4. **Custom job triggering —** Use a Databricks workflow to trigger dbt Cloud jobs based on custom conditions or logic that aren't natively supported by dbt Cloud's scheduling feature. This can give you more flexibility in terms of when and how your dbt Cloud jobs run.

## Prerequisites
### Prerequisites

- Active [Teams or Enterprise dbt Cloud account](https://www.getdbt.com/pricing/)
- You must have a configured and existing [dbt Cloud deploy job](/docs/deploy/deploy-jobs)
Expand All @@ -29,7 +36,7 @@ To use Databricks workflows for running dbt Cloud jobs, you need to perform the
- [Create a Databricks Python notebook](#create-a-databricks-python-notebook)
- [Configure the workflows to run the dbt Cloud jobs](#configure-the-workflows-to-run-the-dbt-cloud-jobs)

### Set up a Databricks secret scope
## Set up a Databricks secret scope

1. Retrieve **[User API Token](https://docs.getdbt.com/docs/dbt-cloud-apis/user-tokens#user-api-tokens) **or **[Service Account Token](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens#generating-service-account-tokens) **from dbt Cloud
2. Set up a **Databricks secret scope**, which is used to securely store your dbt Cloud API key.
Expand All @@ -47,7 +54,7 @@ databricks secrets put --scope <YOUR_SECRET_SCOPE> --key <YOUR_SECRET_KEY> --s
5. Replace **`<YOUR_DBT_CLOUD_API_KEY>`** with the actual API key value that you copied from dbt Cloud in step 1.


### Create a Databricks Python notebook
## Create a Databricks Python notebook

1. [Create a **Databricks Python notebook**](https://docs.databricks.com/notebooks/notebooks-manage.html), which executes a Python script that calls the dbt Cloud job API.

Expand Down Expand Up @@ -165,7 +172,7 @@ DbtJobRunStatus.SUCCESS
You can cancel the job from dbt Cloud if necessary.
:::

### Configure the workflows to run the dbt Cloud jobs
## Configure the workflows to run the dbt Cloud jobs

You can set up workflows directly from the notebook OR by adding this notebook to one of your existing workflows:

Expand Down Expand Up @@ -206,6 +213,4 @@ You can set up workflows directly from the notebook OR by adding this notebook t

Multiple Workflow tasks can be set up using the same notebook by configuring the `job_id` parameter to point to different dbt Cloud jobs.

## Closing

Using Databricks workflows to access the dbt Cloud job API can improve integration of your data pipeline processes and enable scheduling of more complex workflows.
2 changes: 1 addition & 1 deletion website/docs/guides/migrating-from-spark-to-databricks.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ time_to_complete: '30 minutes'
platform: ['dbt-core','dbt-cloud']
icon: 'guides'
hide_table_of_contents: true
tags: ['migration', 'dbt Core','dbt Cloud']
tags: ['Migration', 'dbt Core','dbt Cloud']
level: 'Intermediate'
recently_updated: true
---
Expand Down
2 changes: 1 addition & 1 deletion website/docs/guides/migrating-from-stored-procedures.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ time_to_complete: '30 minutes'
platform: 'dbt-core'
icon: 'guides'
hide_table_of_contents: true
tags: ['materializations', 'dbt Core']
tags: ['Migration', 'dbt Core']
level: 'Beginner'
recently_updated: true
---
Expand Down

This file was deleted.

Loading

0 comments on commit c8e1cb1

Please sign in to comment.