Udacity-Project-Data-Lakes

The aim of this project is to move Sparkify warehouse into a data lake given that the company's user base and songs database has grown even more.

Approach

Build an s3 hosted data lake and ETL pipeline that loads data from s3, process the data using Spark and load the data back into s3 as a set of dimensional tables.

Data for the project resides in s3 in the following links

Song data: s3://udacity-dend/song_data
Log data: s3://udacity-dend/log_data

The song data is a subset of Million Song Dataset and the log datasets are log files in json format generated by this event simulator

Database and Schema Design

Employed dimensional modelling in the schema design to enable fast retrieval of song analysis data by Sparkify analytics team.

My design includes one fact table, songplays that keeps records of facts about songs a user on sparkify listened to such as length of song, duration and time user started listening to the song.

The dimension tables include :

artists, records about artists such as name, location, latitude,longitude
songs, records about songs such as title,year, duration
time, records of time such as starttime, week, month
users, records of users such as firstname, lastname and gender

How to run the project

Enter AWS access key and secret key in the dl.cfg script
Run etl.py to load data from s3, process it with Spark into analytic tables, and load it back to s3 as dimensional tables

References

-Reference #1: Parquet Files

-Reference #2: API Reference

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
dl.cfg		dl.cfg
etl.py		etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Udacity-Project-Data-Lakes

Approach

Database and Schema Design

How to run the project

References

About

Releases

Packages

Languages

TitoLulu/Data-Lakes-with-S3

Folders and files

Latest commit

History

Repository files navigation

Udacity-Project-Data-Lakes

Approach

Database and Schema Design

How to run the project

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages