Skip to content

Project exploring building a Data Lake and ETL pipeline in Spark. The ETL pipeline loads data from s3, processes it into analytics tables, and loads them back into s3

Notifications You must be signed in to change notification settings

TitoLulu/Data-Lakes-with-S3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Udacity-Project-Data-Lakes

The aim of this project is to move Sparkify warehouse into a data lake given that the company's user base and songs database has grown even more.

Approach

Build an s3 hosted data lake and ETL pipeline that loads data from s3, process the data using Spark and load the data back into s3 as a set of dimensional tables.

Data for the project resides in s3 in the following links

  1. Song data: s3://udacity-dend/song_data
  2. Log data: s3://udacity-dend/log_data

The song data is a subset of Million Song Dataset and the log datasets are log files in json format generated by this event simulator

Database and Schema Design

Employed dimensional modelling in the schema design to enable fast retrieval of song analysis data by Sparkify analytics team.

My design includes one fact table, songplays that keeps records of facts about songs a user on sparkify listened to such as length of song, duration and time user started listening to the song.

The dimension tables include :

  • artists, records about artists such as name, location, latitude,longitude
  • songs, records about songs such as title,year, duration
  • time, records of time such as starttime, week, month
  • users, records of users such as firstname, lastname and gender

How to run the project

  1. Enter AWS access key and secret key in the dl.cfg script
  2. Run etl.py to load data from s3, process it with Spark into analytic tables, and load it back to s3 as dimensional tables

References

-Reference #1: Parquet Files

-Reference #2: API Reference

About

Project exploring building a Data Lake and ETL pipeline in Spark. The ETL pipeline loads data from s3, processes it into analytics tables, and loads them back into s3

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages