The goal of this project is to generate a PostgreSQL database that hosts structured data related to song and user activity with the purpose of having a playground where we can obtain quick insigts via SQL queries. In order to implement this we will organize the database into different tables with a star shaped schema and ingest the data by means of and ETL (extract, transform, load) procedure.
The data used in this project is contained in two datasets: the Song dataset and the Log dataset. Both datasets can be found in the directory ./data
.
The Song dataset is a subset of the Million Song Dataset. The data is organized in with the following file structure
./data/song_data/
└── A
├── A
│ ├── A
│ ├── B
│ └── C
└── B
├── A
├── B
└── C
where the parameters of the song (song_id, title, duration, etc...) and the information of its artist (artist id, artist name, artist location, etc...) are written in JSON format. As an example, you can find below the data corresponding to the song ./data/song_data/A/A/A/TRAAAAW128F429D538.json
{
"num_songs":1,
"artist_id":"ARD7TVE1187B99BFB1",
"artist_latitude":null,
"artist_longitude":null,
"artist_location":"California - LA",
"artist_name":"Casual",
"song_id":"SOMZWCG12A8C13C480",
"title":"I Didn't Mean To",
"duration":218.93179,
"year":0
}
The Log dataset contains information related to the data generated by the simulator of events Eventsim. Like the Song files, each log entry is written in JSON format and labeled by date. The log files contain similar data as the one present in activity logs from music streaming applications. Below you can find an example that ilustrates the schema of the log files
{
"artist":"Des'ree",
"auth":"Logged In",
"firstName":"Kaylee",
"gender":"F",
"itemInSession":1,
"lastName":"Summers",
"length":246.30812,
"level":"free",
"location":"Phoenix-Mesa-Scottsdale, AZ",
"method":"PUT",
"page":"NextSong",
"registration":1540344794796.0,
"sessionId":139,
"song":"You Gotta Be",
"status":200,
"ts":1541106106796,
"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"",
"userId":"8"
}
Since we are going to create the database from scratch we are going to design the database in a normalized fashion with a star schema. The star schema is widely used to implement databases where a heavy number of OLTP transactions will take place (like SQL DML statements). To this end, we are going to employ the songplays
table as fact table and users
, songs
, artists
and time
as dimension tables. While the fact table encapsulates the information about the songs played by the users along time, the dimension ones are employed for descriptive purposes like song, user, artists and time related data. The fields corresponding to each one of them are detailed below.
songplay_id SERIAL PRIMARY KEY,
start_time BIGINT,
user_id INT,
level VARCHAR,
song_id VARCHAR,
artist_id VARCHAR,
session_id INT,
location VARCHAR,
user_agent VARCHAR
user_id INT PRIMARY KEY,
first_name VARCHAR,
last_name VARCHAR,
gender VARCHAR,
level VARCHAR
song_id VARCHAR PRIMARY KEY,
title VARCHAR,
artist_id VARCHAR,
year INT,
duration FLOAT
artist_id VARCHAR PRIMARY KEY,
artist_name VARCHAR,
artist_location VARCHAR,
artist_latitude INT,
artist_longitude INT
start_time BIGINT PRIMARY KEY,
hour INT,
day INT,
week INT,
month INT,
year INT,
weekday INT
As one can see in the following figure, the fact table is connected to the dimension tables with foreign keys that make possible the use of JOIN statements to bring the data alltogether. This will be usefull when we need to get insights that require information from different tables.
An automatized procedure was created in order to set-up the database with the aforementioned star schema and to integrate the song and log data in the database. Such a procedure is carried out in two separate steps: (1) database generation and (2) data ingestion. These stages are encapsulated in the python scripts ./create_tables.py
and ./etl.py
respectively. Below you can find a brief description of how they work.
-
Database generation: the script
./create_table.py
makes use of the python librarypsycopg2
to create both the database namedsparkify
and the fact and dimensions tables. This is achieved my means ofCREATE DATABASE
andCREATE TABLE
SQL DDL statements. As a measure of caution, before anyCREATE
operation aDROP DATABASE
(orDROP TABLE
) statement is executed. -
Data ingestion: the script
./etlp.py
contains the ETL pipeline employed to ingest the song and log data located in./data
into thesongplays
,users
,songs
,artists
andtime
tables. Similarly to the above step, thepsycopg2
library is used to set up a connecttion to thesparkify
database and, after the data is extracted and transformed into the correct format, differentINSERT
SQL statements are employed to ingest data into these tables. This procedure is divided in two separate methodsprocess_song_file
andprocess_log_file
according to the source of the raw data.
After the database is set up one can execute SQL queries to obtain insights on the data. Example queries are provided below:
Example query 1: How many users purchased the paid subscription? In comparison, are there more free tier users than paid subscriptions?
SELECT
level,
COUNT(level) as level_count
FROM users
GROUP BY level
ORDER BY level_count DESC
SELECT
title,
duration
FROM songs
ORDER BY duration DESC
LIMIT 5
SELECT
COUNT(DISTINCT(song_id)) AS number_songs,
COUNT(DISTINCT(artist_id)) AS number_artists
FROM songs
LIMIT 5
The queries above are included in the notebook ./example_queries.ipynb
.
Below you can find the requirements to run the scripts ./create_table.py
and ./etlp.py
.