nuvita97 / AWS-Data-Warehouse Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

A data engineering project to build ETL pipeline in AWS S3 & RedShift

0 stars 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
create_tables.py		create_tables.py
dwh.cfg		dwh.cfg
etl.py		etl.py
sql_queries.py		sql_queries.py

Repository files navigation

PROJECT: DATA WAREHOUSE ON AWS (S3 & Redshift)

Introduction

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
This project is to build an ETL pipeline for a database hosted on Redshift by loading data from S3 to staging tables on Redshift and execute SQL statements that create the analytics tables from these staging tables.

Data Sources

One source contains info about songs and artists: s3://udacity-dend/song_data
The second source contains info log action of all users: s3://udacity-dend/log_data

Database Schema

Staging tables:

staging_songs: info about songs and artists which was loaded from song_data
staging_events: actions of all users which was loaded from log_data

Fact Table:

songplays: records in event data associated with song plays

Dimension Tables:

users: users in the application
songs: songs in music database
artists: artists in music database
time: timestamps of records in songplays broken down into specific units.

Project Files

dwh.cfg: config file containing info about S3, IAM and Redshift
sql_queries.py: python file containing all SQL queries used to be imported into the database.
create_tables.py: python file to drop and create tables.
etl.py: python script to excute ETL process dividing into 2 stages: extract JSON files to staging tables, and load from the staging tables into fact & dimensional tables in AWS Redshift.
README.md: this file - description for the project.

How to Run

Setup in AWS: create Redshift cluster with IAM user, IAM role, store info into dwh.cfg
Run create_tables.py to DROP old tables and CREATE new ones in Redshift.
Run etl.py to excute ETL process.

About

A data engineering project to build ETL pipeline in AWS S3 & RedShift

python aws aws-s3 data-engineering aws-redshift etl-pipeline

Report repository

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%