There's a non-insignificant amount of setup work. The entire value prop of Trino and Galaxy is to be able to grab and transform data regardless of where it is. To demo this, you have to create at least one place for data to be and put data into it. Then you must set up a Galaxy account and give it access to the external data stores as well as where output data will be stored. The silver lining is that you only have to do this once, ever!
For the data source setup required for this tutorial, please see INFRA_SETUP.MD.
This demo can be utilized for either dbt Core or dbt Cloud. Both will require you to complete the steps in INFRA_SETUP.MD to set up the appropriate data sources.
- A Starburst Galaxy account. This is the easiest way to get up and running with trino to see the power of trino + dbt.
- AWS account to connect a catalog to S3. AWS will act as a source and a target catalog in this example.
- Any snowflake login. Sign up for a free account. You don't need need snowflake for the demo, it would just require you to alter some models yourself.
Why are we using so many data sources? Well, for this data lakehouse tutorial we will take you through all the steps of creating a reporting structure, including the steps to get your sources into your land layer in S3. Starburst Galaxy's superpower with dbt is being able to federate data from multiple different sources into one dbt repository. Showing multiple sources helps demonstrate this use case in addition to the data lakehouse use case. If you are interested in only using S3, you can run all the TPCH
and AWS
models without having to create a snowflake login. The snowflake section will fail, but the rest should complete.
You will also need:
- A dbt installation of your choosing (core or cloud).
- For core: I used a virtual environment on my M1 mac because that was the most recommended. I'll add the steps below in this readme. Review the other dbt core installation information to pick what works best for you.
- For Cloud: I registered for a free account and utilized this repository in dbt Cloud. This option requires less first time setup steps. If you don't know what to pick, use this.
The goal of this tutorial is to showcase the power of dbt + Starburst Galaxy together. This tutorial aims to demonstrate both superpowers.
- Query federation across multiple data sources - dbt specializes as a transform tool and can only be utilized after the data is landed in a storage solution. Starburst Galaxy fixes that by allowing you to query your data from multiple sources.
- Data Lakehouse analytics - In this lab, we are going to build our lakehouse reporting structure in S3 and use slightly different naming conventions from the traditional Land, Structure, and Consume layer to accomodate for dbt standards. Land = Stage, Structure = Intermediate, Consume = Aggregate. For more information about the Starburst data lakehouse, visit this blog.
For the dbt Core tutorial, visit this blog for more information. Use the CORE.MD as a README to run this demo using dbt Core.
For the dbt Cloud tutorial, visit this blog for more information. Use the CLOUD.MD as a README to run this demo using dbt Cloud.
Shout out to @dataders for his awesome help! Inspired by the Cinco de Trino repo by @jtcohen6!