Skip to content

Latest commit

 

History

History
125 lines (98 loc) · 3.97 KB

File metadata and controls

125 lines (98 loc) · 3.97 KB

dbt and PostgreSQL for Analytics Engineering

Python dbt PostgreSQL

License

This project is meant for experimenting with dbt and the dbt-postgres adapter for Analytics, using NYC TLC Trip Record dataset as the datasource, with Kimball dimensional modeling technique.

NOTE: This is not meant for production use at scale, but rather for educational purposes only. Consider using RedShift, BigQuery, Snowflake or Databricks instead. If those options are too costy, or if you need something for on-premises deploy, consider Clickhouse instead.

Tech Stack

Up and Running

Developer Setup

1. Install the dependencies on pyproject.toml:

uv sync

2. Activate the virtualenv created by uv:

source .venv/bin/activate

3. (Optional) Install pre-commit:

brew install pre-commit

From root folder where .pre-commit-config.yaml is located, run:

pre-commit install

4. Setup dbt profiles.yaml accordingly (use the profiles.tmpl.yaml as template)

4.1. By default, the profiles_dir is the user '$HOME/.dbt/'

mkdir -p ~/.dbt/
cat profiles.tmpl.yml >> ~/.dbt/profiles.yml

4.2. Set the environment variables for dbt-postgres:

export DBT_POSTGRES_HOST=localhost
export DBT_POSTGRES_PORT=5432
export DBT_POSTGRES_DATABASE=nyc_taxi
export DBT_POSTGRES_SOURCE_SCHEMA=public
export DBT_POSTGRES_TARGET_SCHEMA=nyc_tlc_record_data
export DBT_POSTGRES_USER=postgres
export DBT_POSTGRES_PASSWORD=postgres

5. Install dbt dependencies and trigger the pipeline

5.1. Run dbt deps to install dbt plugins

dbt deps

5.2. Run dbt seed to push/create the tables from the .csv seed files to the target schema

dbt seed

5.3. Run dbt run to trigger the dbt models to run

dbt build --target [prod|dev]

# Alternatively you can run only a subset of the models with:

## +models/staging: Runs the dependencies/preceding models first that lead 
## to 'models/staging', and then the target models
dbt [build|run] --select +models/staging --target [prod|dev]

## models/staging+: Runs the target models first, and then all models that depend on it
dbt [build|run] --select models/staging+ --target [prod|dev]

6. Generate the Docs and the Data Lineage graph with:

dbt docs generate
dbt docs serve

Access the generated docs at:

open http://localhost:8080

Containerization and Testing

1. Build the Docker Image with:

docker build -t dbt-postgres:latest . --no-cache

2. Start a container with it:

docker run -d --rm \
  -e DBT_POSTGRES_HOST=host.docker.internal \
  -e DBT_POSTGRES_DATABASE=nyc_taxi \
  -e DBT_POSTGRES_SOURCE_SCHEMA=public \
  -e DBT_POSTGRES_TARGET_SCHEMA=nyc_tlc_record_data \
  -e DBT_POSTGRES_USER=postgres \
  -e DBT_POSTGRES_PASSWORD=postgres \
  --name dbt-postgres \
  dbt-postgres

TODO:

  • PEP-517: Packaging and dependency management with uv
  • Bootstrap dbt with PostgreSQL Adapter (dbt-postgres)
  • Generate and serve docs and Data Lineage Graphs locally
  • Add dbt macro to configure target schemas dinamically
  • Run dbt-core in Docker