Initial Infrastructure

Initial setup

What you'll need:

A Starburst Galaxy account. This is the easiest way to get up and running with trino to see the power of trino + dbt.
AWS account to connect a catalog to S3. AWS will act as a source and a target catalog in this example.
Any snowflake login. Sign up for a free account. You don't need need snowflake for the demo, it would just require you to alter some models yourself.

Why are we using so many data sources? Well, for this data lakehouse tutorial we will take you through all the steps of creating a reporting structure, including the steps to get your sources into your land layer in S3. Starburst Galaxy's superpower with dbt is being able to federate data from multiple different sources into one dbt repository. Showing multiple sources helps demonstrate this use case in addition to the data lakehouse use case. If you are interested in only using S3, you can run all the TPCH and AWS models without having to create a snowflake login. The snowflake section will fail, but the rest should complete.

You will also need:

A dbt installation of your choosing. I used a virtual environment on my M1 mac because that was the most recommended. I'll add the steps below in this readme. Review the other dbt core installation information to pick what works best for you.

Getting Started with AWS

Sign up for an AWS account.
Create a S3 bucket in the Ohio region (us-east-2). You must specify this region because this is where the COVID-19 public data lake exists. Use all the defaults.
Create an AWS access key that will be used as the Starburst Galaxy Authentication method for connecting to S3
- Go to the IAM Management Console
- Select Users
- Select Add Users
- Provide a Descriptive User Name like <username>-aws-dbt
- Select AWS Credential Type: Access key - Programmatic access
- Set Permissions: Attach existing policies directly
- Add the following policy: AmazonS3FullAccess
Finish creating the access key with the rest of the defaults, and then save your AWS Access Key and Secret Access Key.

Getting Started with Starburst Galaxy

Create the source AWS catalog in Starburst Galaxy

Sign up for a Starburst Galaxy free trial.
Set up Starburst Galaxy and configure the S3 catalog to access your S3 bucket.

a. Navigate to the Catalogs tab. Click Configure a Catalog. Create an S3 Catalog.

Catalog name: dbt_aws_source
Add a relevant description
Authenticate to S3 through the AWS Access Key/Secret created earlier
Metastore configuration: "I don't have a metastore"
Default directory name: source
Enable Allow creating external tables
Enable Allow writing to external tables
Select default table format: Hive
Hit Skip the Set Permissions page

More information on connecting to S3.

Create a Cluster in Starburst Galaxy

A cluster in Starburst Galaxy provides the resources necessary to run queries against your catalogs. You can access the catalog data exposed by running clusters in the Query Editor.

Click Create a new cluster
Enter cluster name: dbt-aws
Also select the tpch cluster
Cluster size: Free
Cluster type: Standard
Catalogs: dbt_aws_source (select the catalog previously created)
Cloud provider region: US East (Ohio) aka us-east-2

Create the target AWS catalog in Starburst Galaxy

Configure a new S3 catalog to access your S3 bucket. This is optional, dependent on if you want your structure and consume tables in Iceberg format.

a. Navigate to the Catalogs tab. Click Configure a Catalog. b. Create an S3 Catalog.

Catalog name: dbt_aws_tgt
Add a relevant description
Authenticate to S3 through the AWS Access Key/Secret created earlier
Metastore configuration: "I don't have a metastore"
Default directory name: target
Enable Allow creating external tables
Enable Allow writing to external tables
Select default table format: Iceberg
Hit Skip the Set Permissions page
Add this to the existing cluster dbt-aws

Create a Snowflake catalog

Follow the instructions to create and configure your Snowflake catalog. Give your catalog a descriptive name such as dbt-snow. I made a snowflake account for the purpose of this demo and I didn't need to sign up for any legalities. Connect your catalog to the dbt-aws cluster. Once you log into snowflake (in the account admin role), navigate to the Marketplace tab on the left hand side. Search for the Free COVID-19 Epidemiological Data. Then click Open. Connect to the COVID19 database.

Note: You may need to update permissions if you do not have access to the account admin role.

grant import share on account to DEMOROLE;
grant create database on account to role DEMOROLE;

At this point, you should have at least 3 catalogs configured to your cluster. If you want to use Iceberg, you should have 4.

Create your AWS source tables

Since tpch is a sample catalog, and the COVID19.PUBLIC data from snowflake will automatically sync, the only source you need to create is the AWS table from the COVID-19 data lake.

Follow the Starburst Galaxy data lake tutorial to create the enigma_jhu table using the existing cluster and catalog you created.
Configure role-based access control - you need to add the COVID19 data lake and the bucket you created for this tutorial.

s3://covid19-lake/*
s3://dbt-aws-<username>/*

Create a schema in the source location you created earlier.

create schema aws_covid_data_lake with (location='s3://dbt-aws-<username>/');

Create the enigma_jhu table

CREATE TABLE enigma_jhu (
    fips varchar,
    admin2 varchar,
    province_state varchar,
    country_region varchar,
    last_update varchar,
    latitude double,
    longitude double,
    confirmed varchar,
    combined_key varchar
)
WITH (
    format = 'JSON',
    external_location = 's3://covid19-lake/enigma-jhu/json/'
);

You should now be able to see all your sources within your cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INFRA_SETUP.MD

INFRA_SETUP.MD

Initial Infrastructure

Initial setup

Getting Started with AWS

Getting Started with Starburst Galaxy

Create the source AWS catalog in Starburst Galaxy

Create a Cluster in Starburst Galaxy

Create the target AWS catalog in Starburst Galaxy

Create a Snowflake catalog

Create your AWS source tables

Files

INFRA_SETUP.MD

Latest commit

History

INFRA_SETUP.MD

File metadata and controls

Initial Infrastructure

Initial setup

Getting Started with AWS

Getting Started with Starburst Galaxy

Create the source AWS catalog in Starburst Galaxy

Create a Cluster in Starburst Galaxy

Create the target AWS catalog in Starburst Galaxy

Create a Snowflake catalog

Create your AWS source tables