Skip to content

Latest commit

 

History

History
141 lines (104 loc) · 7.21 KB

INFRA_SETUP.MD

File metadata and controls

141 lines (104 loc) · 7.21 KB

Initial Infrastructure

Initial setup

What you'll need:

Why are we using so many data sources? Well, for this data lakehouse tutorial we will take you through all the steps of creating a reporting structure, including the steps to get your sources into your land layer in S3. Starburst Galaxy's superpower with dbt is being able to federate data from multiple different sources into one dbt repository. Showing multiple sources helps demonstrate this use case in addition to the data lakehouse use case. If you are interested in only using S3, you can run all the TPCH and AWS models without having to create a snowflake login. The snowflake section will fail, but the rest should complete.

You will also need:

  • A dbt installation of your choosing. I used a virtual environment on my M1 mac because that was the most recommended. I'll add the steps below in this readme. Review the other dbt core installation information to pick what works best for you.

Getting Started with AWS

  1. Sign up for an AWS account.
  2. Create a S3 bucket in the Ohio region (us-east-2). You must specify this region because this is where the COVID-19 public data lake exists. Use all the defaults.
  3. Create an AWS access key that will be used as the Starburst Galaxy Authentication method for connecting to S3
    • Go to the IAM Management Console
    • Select Users
    • Select Add Users
    • Provide a Descriptive User Name like <username>-aws-dbt
    • Select AWS Credential Type: Access key - Programmatic access
    • Set Permissions: Attach existing policies directly
    • Add the following policy: AmazonS3FullAccess
  4. Finish creating the access key with the rest of the defaults, and then save your AWS Access Key and Secret Access Key.

Getting Started with Starburst Galaxy

Create the source AWS catalog in Starburst Galaxy

  1. Sign up for a Starburst Galaxy free trial.

  2. Set up Starburst Galaxy and configure the S3 catalog to access your S3 bucket.

a. Navigate to the Catalogs tab. Click Configure a Catalog. Create an S3 Catalog.

  • Catalog name: dbt_aws_source
  • Add a relevant description
  • Authenticate to S3 through the AWS Access Key/Secret created earlier
  • Metastore configuration: "I don't have a metastore"
  • Default directory name: source
  • Enable Allow creating external tables
  • Enable Allow writing to external tables
  • Select default table format: Hive
  • Hit Skip the Set Permissions page

More information on connecting to S3.

Create a Cluster in Starburst Galaxy

A cluster in Starburst Galaxy provides the resources necessary to run queries against your catalogs. You can access the catalog data exposed by running clusters in the Query Editor.

  • Click Create a new cluster
  • Enter cluster name: dbt-aws
  • Also select the tpch cluster
  • Cluster size: Free
  • Cluster type: Standard
  • Catalogs: dbt_aws_source (select the catalog previously created)
  • Cloud provider region: US East (Ohio) aka us-east-2

Create the target AWS catalog in Starburst Galaxy

  1. Configure a new S3 catalog to access your S3 bucket. This is optional, dependent on if you want your structure and consume tables in Iceberg format.

a. Navigate to the Catalogs tab. Click Configure a Catalog. b. Create an S3 Catalog.

  • Catalog name: dbt_aws_tgt
  • Add a relevant description
  • Authenticate to S3 through the AWS Access Key/Secret created earlier
  • Metastore configuration: "I don't have a metastore"
  • Default directory name: target
  • Enable Allow creating external tables
  • Enable Allow writing to external tables
  • Select default table format: Iceberg
  • Hit Skip the Set Permissions page
  • Add this to the existing cluster dbt-aws

Create a Snowflake catalog

Follow the instructions to create and configure your Snowflake catalog. Give your catalog a descriptive name such as dbt-snow. I made a snowflake account for the purpose of this demo and I didn't need to sign up for any legalities. Connect your catalog to the dbt-aws cluster. Once you log into snowflake (in the account admin role), navigate to the Marketplace tab on the left hand side. Search for the Free COVID-19 Epidemiological Data. Then click Open. Connect to the COVID19 database.

Note: You may need to update permissions if you do not have access to the account admin role.

grant import share on account to DEMOROLE;
grant create database on account to role DEMOROLE;

At this point, you should have at least 3 catalogs configured to your cluster. If you want to use Iceberg, you should have 4.

Create your AWS source tables

Since tpch is a sample catalog, and the COVID19.PUBLIC data from snowflake will automatically sync, the only source you need to create is the AWS table from the COVID-19 data lake.

  1. Follow the Starburst Galaxy data lake tutorial to create the enigma_jhu table using the existing cluster and catalog you created.
  2. Configure role-based access control - you need to add the COVID19 data lake and the bucket you created for this tutorial.
  • s3://covid19-lake/*
  • s3://dbt-aws-<username>/*
  1. Create a schema in the source location you created earlier.
create schema aws_covid_data_lake with (location='s3://dbt-aws-<username>/');
  1. Create the enigma_jhu table
CREATE TABLE enigma_jhu (
    fips varchar,
    admin2 varchar,
    province_state varchar,
    country_region varchar,
    last_update varchar,
    latitude double,
    longitude double,
    confirmed varchar,
    combined_key varchar
)
WITH (
    format = 'JSON',
    external_location = 's3://covid19-lake/enigma-jhu/json/'
);

You should now be able to see all your sources within your cluster.