Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[daggy-u] [dbt] - Add Lesson 7 (DEV-55) #20096

Merged
merged 9 commits into from
Feb 27, 2024
8 changes: 7 additions & 1 deletion docs/dagster-university/pages/dagster-dbt.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,10 @@ title: Dagster + dbt
- [Overview](/dagster-dbt/lesson-4/1-overview)
- [Speeding up the development cycle](/dagster-dbt/lesson-4/2-speeding-up-the-development-cycle)
- [Debugging failed runs](/dagster-dbt/lesson-4/3-debugging-failed-runs)
- [Customizing your execution](/dagster-dbt/lesson-4/4-customizing-your-execution)
- [Customizing your execution](/dagster-dbt/lesson-4/4-customizing-your-execution)
- Lesson 7: Deploying to Dagster Cloud
- [Overview](/dagster-dbt/lesson-7/1-overview)
- [Pushing the project to GitHub](/dagster-dbt/lesson-7/2-pushing-the-project-to-github)
- [Setting up Dagster Cloud](/dagster-dbt/lesson-7/3-setting-up-dagster-cloud)
- [Creating the manifest with GitHub Actions](/dagster-dbt/lesson-7/4-creating-the-manifest-with-github-actions)
- [Preparing for a sucessful run](/dagster-dbt/lesson-7/5-preparing-for-a-successful-run)
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
---
title: "Lesson 7: Overview"
module: 'dbt_dagster'
lesson: '7'
---

# Overview

At this point, you have a fully integrated Dagster and dbt project! You’ve learned how to load dbt models as Dagster assets, create dependencies, add partitions, and execute and monitor the resulting pipeline in the Dagster UI.

In this lesson, we’ll deploy your Dagster+dbt project to have it running in both local and production environments. We’ll walk through some considerations involved in bundling your dbt project up with Dagster

You’ll learn how to prepare the project for deployment to Dagster Cloud, including pushing the project to GitHub and setting up CI/CD to factor in your dbt project. We’re using Dagster Cloud because it’s a standardized and controlled experience that we can walk you through, but all of the general patterns can be applied to however you deploy Dagster.
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
title: "Lesson 7: Pushing the project to GitHub"
module: 'dbt_dagster'
lesson: '7'
---

# Pushing the project to GitHub

We’ll be using GitHub in this lesson because Dagster Cloud has a native integration with GitHub to quickly get deployment set up. This functionality can be easily replicated if your company uses a different version control provider, but we’ll standardize on using GitHub for now. Whether you use the command line or an app like GitHub Desktop is up to you.

1. Because you cloned this project, it’ll already have a git history and context. Let’s delete that by running `rm -rf .git`.
2. Create a new repository on GitHub.
3. Push the code from your project into this GitHub repository’s `main` branch.

{% callout %}
> 💡 **Important!** Make sure the `.env` file in your project isn’t included in your commit! The starter project for this course should have it listed in `.gitignore`, but it’s wise to double-check before accidentally committing sensitive files.
{% /callout %}
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
---
title: "Lesson 7: Setting up Dagster Cloud"
module: 'dbt_dagster'
lesson: '7'
---

# Setting up Dagster Cloud

Now that the project is set up and ready in GitHub, it’s time to move to Dagster Cloud. To keep things simple, we’ll use a [Serverless deployment](https://docs.dagster.io/dagster-cloud/deployment/serverless) to deploy our project. This option offloads managing the required infrastructure to Dagster Labs.

1. Sign up for a new [Dagster Cloud trial account](https://dagster.cloud/signup). Even if you already have an account, create a new one for this course. **Note:** When you sign up for a new account, you’ll automatically begin a free trial. You won’t be charged for anything after the trial unless you enter a credit card.
2. Complete the signup flow by creating an organization and finishing your user profile.
3. When prompted to select a deployment type, click **Serverless.**
4. The next step is to add our project to Dagster Cloud! Click the **Import a Dagster project** option and do the following:
1. In the **Git scope** field, select the GitHub account or organization that contains your project repository.

{% callout %}
> 💡 **Don’t see the right account/organization?** You may need to install the Dagster Cloud GitHub app first. To do this, click **+ Add account or organization.** You’ll be redirected to GitHub to complete the setup, and then automatically sent back to Dagster Cloud when finished. If you’re installing within your company’s GitHub organization, you may need your company’s GitHub admin to approve the app.
{% /callout %}

2. In the **Repository** field, select the repository containing your Dagster project.
3. Click **Deploy. Note that the deployment can take a few minutes.** Feel free to go grab a snack while you’re waiting!

---

## What happens when Dagster Cloud deploys code?

When Dagster deploys the code, a few things happen:

- Dagster creates a new code location for the repository in Dagster Cloud in the `prod` deployment
- Dagster adds two GitHub Action files to the repository:
- `.github/workflows/deploy.yml` - This file sets up Continuous Deployment (CD) for the repository. We won’t talk through all the steps here, but a high-level summary is that every time a change is made to the `main` branch of your repository, this GitHub Action will build your Dagster project and deploy it to Dagster Cloud.
- `.github/workflows/branch_deployments.yml` - This file enables the use of [Branch Deployments](https://docs.dagster.io/dagster-cloud/managing-deployments/branch-deployments), a Dagster Cloud feature that automatically creates staging environments for your Dagster code with every pull request. We won’t work with Branch Deployments during this lesson, but we highly recommend trying them out!

---

## Checking deployment status

It looks like the deployment was completed, but it failed. If we look in the GitHub Action logs for the job, we’ll see the following error in the **Python Executable Deploy** step:

```yaml
Error: Some locations failed to load after being synced by the agent:
Error loading dagster_university: {'__typename': 'PythonError', 'message': "FileNotFoundError: [Errno 2] No such file or directory: '/venvs/3eca07cc1eb5/lib/python3.8/site-packages/working_directory/root/analytics/target/manifest.json'\n" ...
```

It looks like the deployment failed because Dagster could not find a dbt manifest file. In the next section of this lesson, we’ll walk you through fixing this.
erinkcochran87 marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
---
title: "Lesson 7: Creating the manifest with GitHub Actions"
module: 'dbt_dagster'
lesson: '7'
---

# Creating the manifest with GitHub Actions

To recap, our deployment failed in the last section because Dagster couldn’t find a dbt manifest file, which it needs to turn dbt models into Dagster assets. This is because, during local development, we built this file by running `dbt parse`., as you learned in Lesson 3 and improved in Lesson 4. However, Dagster Cloud’s out-of-the-box `deploy.yml` GitHub Action isn’t aware that you’re also trying to deploy a dbt project with Dagster.
erinkcochran87 marked this conversation as resolved.
Show resolved Hide resolved

To get our deployment working, we need to add a step to our GitHub Actions workflow that runs the dbt commands required to generate the `manifest.json`. Specifically, we need to run `dbt deps` and `dbt parse` in the dbt project, just like you did during local development.

1. In your Dagster project, locate the `.github/workflows` directory.
2. Open the `deploy.yml` file.
3. Locate the `Checkout for Python Executable Deploy` step, which should be on or near line 38.
4. After this step, add the following:

```bash
- name: Parse dbt project and package with Dagster project
if: steps.prerun.outputs.result == 'pex-deploy'
run: |
pip install pip --upgrade
pip install dbt-duckdb
cd project-repo/analytics
dbt deps
dbt parse
shell: bash
```

5. Save and commit the changes. Make sure to push them to the remote!

Once the new step is pushed to the remote, GitHub will automatically try to run a new job using the updated workflow.

At this point, your dbt project will be successfully deployed onto Dagster Cloud and you should be able to see your models in the asset graph!

{% table %}

- Successful deployment
- dbt assets in the Asset graph

---

- ![Successful deployment screen in Dagster Cloud](/images/dagster-dbt/lesson-7/successful-cloud-setup.png)
- ![dbt models in the Asset graph in Dagster Cloud](/images/dagster-dbt/lesson-7/asset-graph.png)

{% /table %}
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
---
title: "Lesson 7: Preparing for a successful run"
module: 'dbt_dagster'
lesson: '7'
---

# Preparing for a successful run

{% callout %}
>💡 **Heads up! This section is optional.** The goal of this lesson was to teach you how to successfully **deploy** to Dagster Cloud, which you completed in the last section. Preparing for a successful run in Dagster Cloud requires using some external services, which may not translate to the external services so we’ve opted to make this section optional.
{% /callout %}

In previous lessons, you followed along by adding our example code to your local project. You successfully materialized the assets in the project and stored the resulting data in a local DuckDB database.

**This section will be a little different.** To keep things simple, we’ll walk you through the steps required to run the pipeline successfully, but we won’t show you how to perform a run in this lesson. Production deployment can be complicated and require a lot of setup. We won't go through setting up the external resources because we want to focus on running your pipeline in Dagster Cloud. For this lesson, assume we already have our storage set up and ready to go.

---

## Deployment overview

Since you'll be deploying your project in production, you'll need production systems to read and write your assets. In this case, we'll use:

- **Amazon S3** to store the files we were saving to our local file system. The data will be small enough to fit within AWS's free tier. For more information on how to set up an S3 bucket, see [this guide](https://www.gormanalysis.com/blog/connecting-to-aws-s3-with-python/).
- **Motherduck** to replace our local DuckDB instance and query the data in our S3 bucket. Motherduck is a cloud-based data warehouse that can be used to store and query data that is currently free to setup. For more information on how to set up Motherduck, see [their documentation](https://motherduck.com/docs/getting-started), along with how to [connect it to your AWS S3 bucket](https://motherduck.com/docs/integrations/amazon-s3).

The code you cloned in the starter project already has some logic to dynamically switch between local and cloud storage, along with the paths to reference. To trigger the switch, you can set an environment variable called `DAGSTER_ENVIRONMENT` and set it to `prod`. This will tell the pipeline to use the production paths and storage.

In summary, before you can run this pipeline in Dagster Cloud, you’ll need to:

1. Set up an S3 bucket to store the files/assets that we download and generate
2. Sign up for a free Motherduck account to replace our local DuckDB instance
3. Connect an S3 user with access to the S3 bucket to the Motherduck account
4. Add a new production target to the dbt project
5. Add the environment variables for the S3 user and Motherduck token to Dagster Cloud

We’ll show you how to do 4 and 5 so you can do this with your credentials when you’re ready.

---

## Adding a production target to profiles.yml

The first step we’ll take is to add a second target to the `dagster_dbt_university` profile in our project’s `analytics/profiles.yml`. A [‘target’ in dbt](https://docs.getdbt.com/docs/core/connect-data-platform/connection-profiles#understanding-targets-in-profiles) describes a connection to a data warehouse, which up until this point in the course, has a local DuckDB instance.

To maintain the separation of our development and production environments, we’ll add a `prod` target to our project’s profiles:

```yaml
dagster_dbt_university:
target: dev
outputs:
dev:
type: duckdb
path: '../{{ env_var("DUCKDB_DATABASE", "data/staging/data.duckdb") }}'
prod:
type: duckdb
path: '{{ env_var("MOTHERDUCK_TOKEN", "") }}'
```
Because we’re still using DuckDB-backed database, our `type` will also be `duckdb` for `prod`. Save and commit the file to git before continuing.

**Note:** While dbt supports more platforms than just DuckDB, our project is set up to only work with this database type. If you use a different platform `type` for future projects, the configuration will vary depending on the platform being connected. Refer to [dbt’s documentation](https://docs.getdbt.com/docs/supported-data-platforms) for more information and examples.

---

## Adding a prod target to deploy.yml

Next, we need to update the dbt commands in the `.github/workflows/deploy.yml` file to target the new `prod` profile. This will ensure that dbt uses the correct connection details when the GitHub Action runs as part of our Dagster Cloud deployment.

Open the file, scroll to the dbt step you added, and add `-- target prod` after the `dbt parse` command. This command should be on or around line 52:

```bash
- name: Parse dbt project and package with Dagster project
if: steps.prerun.outputs.result == 'pex-deploy'
run: |
pip install pip --upgrade
pip install dbt-duckdb
cd project-repo/analytics
dbt deps
dbt parse --target prod ## add this flag
shell: bash
```

Save and commit the file to git. Don’t forget to push to remote!

---

## Adding environment variables to Dagster Cloud

The last step in preparing for a successful run is to move environment variables to Dagster Cloud. These variables were available to us via the `.env` file while we were working locally, but now that we’ve moved to a different environment, we’ll need to make them accessible again.

### Environment variables

The following table contains the environment variables we need to create in Dagster Cloud:

{% table %}

- Variable {% width="30%" %}
- Description

---

- `DUCKDB_DATABASE`
- todo

---

- `DAGSTER_ENVIRONMENT`
- Set this to `prod`

---

- `AWS_ACCESS_KEY_ID`
- The access key ID for the S3 bucket.

---

- `AWS_SECRET_ACCESS_KEY`
- The secret access key associated with the S3 bucket.

---

- `AWS_REGION`
- The region the S3 bucket is located in.

{% /table %}

### Creating environment variables

1. In the Dagster Cloud UI, click **Deployment > Environment variables**.
2. Click the **Add environment variable** button on the right side of the screen.
3. In the **Create environment variable** window, fill in the following:
1. **Name** - The name of the environment variable. For example: `DUCKDB_DATABASE`
2. **Value** - The value of the environment variable.
3. **Code location scope** - Deselect the **All code locations** option and check only the code location for this course’s project.
4. Click **Save.**

Repeat these steps until all the environment variables have been added.

---

## Running the pipeline

TODO: Add once unblocked
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading