Building a fully scalable ETL (Extract, Transform, Load) pipeline to handle large volumes of transaction data for a café business.
- Project Overview
- Project Vision
- Sprint 1 - ETL-Pipline
- Sprint 2 - AWS-Data-Warehouse
In this project, a fully scalable ETL (Extract, Transform, Load) pipeline is built to handle large volumes of transaction data for a café business that has many branches. This pipeline will collect all the transaction data generated by each individual café branch and place it in a single location. By being able to easily query the company's data as a whole, the client will drastically increase their ability to identify company-wide trends and insights.
Using agile methodology, the project was built slowly and adding more complexity to it as time progresses. The project consist of five sprints, where each sprint is a week in length with different task from the product owner in the product backlog.
Below is a vision of what both parties would like to produce as the end result.
- Each night a CSV for each branch will be uploaded to the cloud.
- The system we have developed will read each file and perform ETL steps.
- Data will be stored in a data warehouse
- Data analytics software will be used to create Business Intelligence analytics for the client
- Application monitoring software will be used to produce operational metrics, such as system errors, up-time and more.
Sprint 1 product backlog consist of:
A sample CSV file for the type of data we will be dealing with was provided by the client. As part of the PoC (Proof of Concept) stage the file was used to create our Extract stage of the pipeline. Requirements.txt was installed in other to install all library that will be used.
"""
pip install -r requirements.txt
"""
Python Script and Pandas Dataframe was used to extract the csv file for transformation.
The data contains sensitive data, which comes in the form of debit/credit card long numbers and customer names. This sensitive data was removed before it is stored.
After the transaction data has been cleaned, it was normalize into 3NF form using pandas library and jupyter Notebook. The data was separated the into a single record for the transaction itself, and one record each per basket item. Each basket record have a foreign key pointing to the transactions primary key.
SQL script was generated using psycopg2 and SQLALchemy to connect to the Postgresql Database
Schema was designed using the SQL data defination language to create tables and get the ER diagram shown below: