The aim of this project is to move Sparkify warehouse into a data lake given that the company's user base and songs database has grown even more.
Build an s3 hosted data lake and ETL pipeline that loads data from s3, process the data using Spark and load the data back into s3 as a set of dimensional tables.
Data for the project resides in s3 in the following links
- Song data: s3://udacity-dend/song_data
- Log data: s3://udacity-dend/log_data
The song data is a subset of Million Song Dataset and the log datasets are log files in json format generated by this event simulator
Employed dimensional modelling in the schema design to enable fast retrieval of song analysis data by Sparkify analytics team.
My design includes one fact table, songplays that keeps records of facts about songs a user on sparkify listened to such as length of song, duration and time user started listening to the song.
The dimension tables include :
- artists, records about artists such as name, location, latitude,longitude
- songs, records about songs such as title,year, duration
- time, records of time such as starttime, week, month
- users, records of users such as firstname, lastname and gender
- Enter AWS access key and secret key in the dl.cfg script
- Run etl.py to load data from s3, process it with Spark into analytic tables, and load it back to s3 as dimensional tables
-Reference #1: Parquet Files
-Reference #2: API Reference