Skip to content

Latest commit

 

History

History
111 lines (71 loc) · 5.72 KB

README.md

File metadata and controls

111 lines (71 loc) · 5.72 KB

dbt + Trino: Starburst Galaxy covid demo

Inspired by the Cinco de Trino repo by @jtcohen6!

There's a non-insignificant amount of setup work. The entire value prop of Trino and Galaxy is to be able to grab and transform data regardless of where it is. To demo this, you have to create at least one place for data to be and put data into it. Then you must set up a Galaxy account and give it access to the external data stores as well as where output data will be stored. The silver lining is that you only have to do this once, ever!

For the data source setup required for this tutorial, please see INFRA_SETUP.MD.

What you'll need:

Why are we using so many data sources? Well, for this data lakehouse tutorial we will take you through all the steps of creating a reporting structure, including the steps to get your sources into your land layer in S3. Starburst Galaxy's superpower with dbt is being able to federate data from multiple different sources into one dbt repository. Showing multiple sources helps demonstrate this use case in addition to the data lakehouse use case. If you are interested in only using S3, you can run all the TPCH and AWS models without having to create a snowflake login. The snowflake section will fail, but the rest should complete.

You will also need:

  • A dbt installation of your choosing. I used a virtual environment on my M1 mac because that was the most recommended. I'll add the steps below in this readme. Review the other dbt core installation information to pick what works best for you.

Tutorial Information

The goal of this tutorial is to showcase the power of dbt + Starburst Galaxy together. This tutorial aims to demonstrate both superpowers.

  1. Query federation across multiple data sources - dbt specializes as a transform tool and can only be utilized after the data is landed in a storage solution. Starburst Galaxy fixes that by allowing you to query your data from multiple sources.
  2. Data Lakehouse analytics - In this lab, we are going to build our lakehouse reporting structure in S3 and use slightly different naming conventions from the traditional Land, Structure, and Consume layer to accomodate for dbt standards. Land = Stage, Structure = Intermediate, Consume = Aggregate. For more information about the Starburst data lakehouse, visit this blog.

The demo itself

Installing dbt in your local environment

  1. Install the dbt-trino adapter plugin, which allows you to use dbt together with Trino / Starburst Galaxy. You may want to do this inside a Python virtual environment. Below I list the steps I took to create my virtual environment.
python3 -m venv dbt-env
source dbt-env/bin/activate
pip install --upgrade pip wheel setuptools
pip install dbt-trino

Make sure you are up to date on your versions.

dbt --version

Other helpful links to getting started with setting up your virtual environment:

Getting Started with this repository

  1. Clone this GitHub repo to your local machine: git clone https://github.com/monimiller/dbt-galaxy-covid-demo.git

  2. Copy sample.profiles.yml to the root of your machine, ~/dbt/profiles.yml. (Why? This file will contain your password for connecting to Trino/Starburst, so you don't want it checked into git.)

cp ./sample.profiles.yml ~/.dbt/profiles.yml
  1. Open the file, and update the fields denoted by <> with your own user, password, cluster, etc. Specify dbt_aws_tgt as your catalog if you want Iceberg tables. If not, use dbt_aws_source. You can keep the sample schema.

  2. Verify that you can connect to Trino / Starburst Galaxy. (If your Galaxy cluster is stopped, it may take a few moments for it to resume.)

dbt debug
  1. Install dbt packages (dbt_utils) for use in the project:
dbt deps
  1. Try running dbt:
dbt run
dbt test
dbt build
  1. Generate and view documentation:
dbt docs generate
dbt docs serve

More on Starburst Galaxy

  • Get started with the query federation tutorial!
  • Get started with the data lake analytics tutorial!

More on dbt + Trino

Watch recordings from past Trino community broadcasts: