Skip to content

Commit

Permalink
[daggy-u] [dbt] - Add dbt Lesson 2 (DEV-60) (#19868)
Browse files Browse the repository at this point in the history
## Summary & Motivation

This PR adds the content for Lesson 2 of the Dagster + dbt module to
Dagster University.

TODOs:

- [x] Part 1 - Confirm project clone command is correct / do we need to
add a destination location to prevent naming collisions?
- [x] Part 2 - Import snippet for `setup.py` once available
- [x] Part 2 - Add screenshot of asset graph in UI
- [x] Part 3 - Confirm `clone` command for dbt project to `analytics`
- [x] Part 5 - Should we add logs or errors for `dbt build`?

## How I Tested These Changes

---------

Co-authored-by: Tim Castillo <[email protected]>
Co-authored-by: Tim Castillo <[email protected]>
  • Loading branch information
3 people authored Feb 27, 2024
1 parent 2c97cce commit e6047ae
Show file tree
Hide file tree
Showing 7 changed files with 245 additions and 0 deletions.
17 changes: 17 additions & 0 deletions docs/dagster-university/pages/dagster-dbt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
title: Dagster + dbt
---

# Dagster + dbt

- Lesson 1: Introduction
- [What's dbt?](/dagster-dbt/lesson-1/1-whats-dbt)
- [Why use dbt and Dagster together?](/dagster-dbt/lesson-1/2-why-use-dbt-and-dagster-together)
- [How do dbt models relate to Dagster assets?](/dagster-dbt/lesson-1/3-how-do-dbt-models-relate-to-dagster-assets)
- [Project preview](/dagster-dbt/lesson-1/4-project-preview)
- Lesson 2: Installation & Setup
- [Requirements](/dagster-dbt/lesson-2/1-requirements)
- [Set up the Dagster project](/dagster-dbt/lesson-2/2-set-up-the-dagster-project)
- [Set up the dbt project](/dagster-dbt/lesson-2/3-set-up-the-dbt-project)
- [dbt project files](/dagster-dbt/lesson-2/4-dbt-project-files)
- [Verify dbt installation](/dagster-dbt/lesson-2/5-verify-dbt-installation)
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
---
title: "Lesson 2: Requirements"
module: 'dagster_dbt'
lesson: '2'
---

## Requirements

To complete this course, you’ll need:

- **To install git.** Refer to the [Git documentation](https://github.com/git-guides/install-git) if you don’t have this installed.
- **To have Python installed.** Dagster supports Python 3.9 - 3.12.
- **To install a package manager like pip or poetry**. If you need to install a package manager, refer to the following installation guides:
- [pip](https://pip.pypa.io/en/stable/installation/)
- [Poetry](https://python-poetry.org/docs/)

To check that Python and the pip or Poetry package manager are already installed in your environment, run:

```shell
python --version
pip --version
```

---

## Clone the Dagster University project

Even if you’ve already completed the Dagster Essentials course, you should still clone the project as some things may have changed.

Run the following to clone the project:

```bash
git clone https://github.com/dagster-io/project-dagster-university -b module/dagster-and-dbt/starter
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
title: "Lesson 2: Set up the Dagster project"
module: 'dagster_dbt'
lesson: '2'
---

# Set up the Dagster project

After downloading the Dagster University project, you’ll need to make a few changes to finish setting things up.

First, you’ll add a few additional dependencies to the project:

- `dagster-dbt` - Dagster’s integration library for dbt. This will also install `dbt-core` and `dagster` as dependencies.
- `dbt-duckdb` - A library for using dbt with DuckDB, which we’ll use to store the dbt models we create

Locate the `setup.py` file in the root of the Dagster University project. Open the file and replace it with the following:

```python
from setuptools import find_packages, setup

setup(
name="dagster_university",
packages=find_packages(exclude=["dagster_university_tests"]),
install_requires=[
"dagster==1.6.*",
"dagster-cloud",
"dagster-duckdb",
"dagster-dbt",
"dbt-duckdb",
"geopandas",
"kaleido",
"pandas",
"plotly",
"shapely",
"smart_open[s3]",
"s3fs",
"smart_open",
"boto3",
],
extras_require={"dev": ["dagster-webserver", "pytest"]},
)
```

{% callout %}
💡 **Heads up!** We strongly recommend installing the project dependencies inside a Python virtual environment. If you need a primer on virtual environments, including creating and activating one, check out this [blog post](https://dagster.io/blog/python-packages-primer-2).
{% /callout %}

Then, run the following in the command line to rename the `.env.example` file and install the dependencies:

```bash
cd project_dagster_university
cp .env.example .env
pip install -e ".[dev]"
```

The `e` flag installs the project in editable mode, you can modify existing Dagster assets without having to reload the code location. This allows you to shorten the time it takes to test a change. However, you’ll need to reload the code location in the Dagster UI when adding new assets or installing additional dependencies.

To confirm everything works:

1. Run `dagster dev` from the directory.
2. Navigate to the Dagster UI ([`http://localhost:3000`](http://localhost:3000/)) in your browser.
3. Open the asset graph by clicking **Assets > View global asset lineage**.
3. Click **Materialize all** to materialize all the assets in the project. **For partitioned assets**, you can materialize just the most recent partition:

![The Asset Graph in the Dagster UI](/images/dagster-dbt/lesson-2/asset-graph.png)
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
title: "Lesson 2: Set up the dbt project"
module: 'dagster_dbt'
lesson: '2'
---

# Set up the dbt project

Next, you’ll notice that there is a dbt project called `analytics` in the repository you cloned. Throughout the duration of this module, you’ll add new dbt models and see them reflected in Dagster.

1. Navigate into the directory by running:

```bash
cd analytics
```

2. Next, install dbt package dependencies by running:

```bash
dbt deps
```

3. In a file explorer or IDE, open the `analytics` directory. You should see the following files, which are the models we’ll use to get started:

- `models/sources/raw_taxis.yml`
- `models/staging/staging.yml`
- `models/staging/stg_trips.yml`
- `models/staging/stg_zones.yml`
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
---
title: "Lesson 2: dbt project files"
module: 'dagster_dbt'
lesson: '2'
---

# dbt project files

Before we get started building out the dbt project, let’s go over some of the files in the project.

---

## dbt_project.yml

From the dbt docs:

> Every [dbt project](https://docs.getdbt.com/docs/build/projects) needs a `dbt_project.yml` file — this is how dbt knows a directory is a dbt project. It also contains important information that tells dbt how to operate your project.
Refer to [the dbt documentation](https://docs.getdbt.com/reference/dbt_project.yml) for more information about `dbt_project.yml`.

---

## profiles.yml

The next file we’ll cover is the `profiles.yml` file. This file contains connection details for your data platform, such as those for the DuckDB database we’ll use in this course. In this step, we’ll set up a `dev` environment for the project to use, which is where the DuckDB is located.

Before we start working, you should know:

- **Don’t put credentials in this file!** We’ll be pushing `profiles.yml` to git, which will compromise them. When we set up the file, we’ll show you how to use environment variables to store connection details securely.
- **We’ll create the file in the `analytics` directory, instead of in dbt’s recommended `.dbt`.** We’re doing this for a few reasons:
- It allows dbt to use the same environment variables as Dagster
- It standardizes the way connections are created as more people contribute to the project

### Set up profiles.yml

Now you’re ready - let’s go!

1. Navigate to the `analytics` directory.
2. In this folder, create a `profiles.yml` file.
3. Copy the following code into the file:

```yaml
dagster_dbt_university:
target: dev
outputs:
dev:
type: duckdb
path: '../{{ env_var("DUCKDB_DATABASE", "data/staging/data.duckdb") }}'
```
Let’s review what this does:
- Creates a profile named `dagster_dbt_university`
- Set the default target (data warehouse) for the `dagster_dbt_university` profile to `dev`
- Defines one target: `dev`
- Sets the `type` to `duckdb`
- Sets the `path` using a [dbt macro](https://docs.getdbt.com/reference/dbt-jinja-functions/env_var) to reference the `DUCKDB_DATABASE` environment variable in the project’s `.env` file. With this, your dbt models will be built in the same DuckDB database as where your Dagster assets are materialized.

The `DUCKDB_DATABASE` environment variable is a relative path from the project’s root directory. For dbt to find it, we prefixed it with `../` to ensure it resolves correctly.
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
title: "Lesson 2: Verify dbt installation"
module: 'dagster_dbt'
lesson: '2'
---

# Verify dbt installation

Before continuing, let’s run the dbt project from the command line to confirm that everything is configured correctly.

From the `analytics` directory, run the following command:

```bash
dbt build
```

The two staging models should materialize successfully and pass their tests:

```bash
19:56:02 dbt build
19:56:03 Running with dbt=1.7.8
19:56:05 Registered adapter: duckdb=1.7.1
19:56:05 Unable to do partial parsing because saved manifest not found. Starting full parse.
19:56:07 Found 2 models, 2 tests, 2 sources, 0 exposures, 0 metrics, 505 macros, 0 groups, 0 semantic models
19:56:07
19:56:07 Concurrency: 1 threads (target='dev')
19:56:07
19:56:07 1 of 4 START sql table model main.stg_trips .................................... [RUN]
19:56:09 1 of 4 OK created sql table model main.stg_trips ............................... [OK in 1.53s]
19:56:09 2 of 4 START sql table model main.stg_zones .................................... [RUN]
19:56:09 2 of 4 OK created sql table model main.stg_zones ............................... [OK in 0.07s]
19:56:09 3 of 4 START test accepted_values_stg_zones_borough__Manhattan__Bronx__Brooklyn__Queens__Staten_Island__EWR [RUN]
19:56:09 3 of 4 PASS accepted_values_stg_zones_borough__Manhattan__Bronx__Brooklyn__Queens__Staten_Island__EWR [PASS in 0.06s]
19:56:09 4 of 4 START test not_null_stg_zones_zone_id ................................... [RUN]
19:56:09 4 of 4 PASS not_null_stg_zones_zone_id ......................................... [PASS in 0.04s]
19:56:09
19:56:09 Finished running 2 table models, 2 tests in 0 hours 0 minutes and 1.95 seconds (1.95s).
19:56:09
19:56:09 Completed successfully
19:56:09
19:56:09 Done. PASS=4 WARN=0 ERROR=0 SKIP=0 TOTAL=4
```
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

1 comment on commit e6047ae

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deploy preview for dagster-university ready!

✅ Preview
https://dagster-university-2ckfjr396-elementl.vercel.app

Built with commit e6047ae.
This pull request is being automatically deployed with vercel-action

Please sign in to comment.