Extract, process and unify data from multiple sources in a single PostgreSQL database.
Table of Contents
The project goal is to transform and centralize the data infrastructure of a multinational retailer. The existing decentralized data architecture posed challenges in terms of efficiency, data accessibility, and strategic decision-making. By centralizing the data, the project aimed to streamline information management, enhance data consistency, and provide a unified and comprehensive view of the company's operations.
To achieve this goal, the project extracts and cleans data from multiple sources and uploads them into a new central PostgreSQL database.
Key platforms and technologies
: PostgreSQL, AWS (S3 Buckets, Amazon RDS), REST APIFiles parsed
: Structured (JSON, CSV), Unstructured (PDF)Languages and libraries
: Python (Pandas, Numpy, Boto3, SQLAlchemy)
Each table was extracted from a difference source. Procedures followed to extract the data include:
- Retrieving data from AWS RDS databases
- Retrieving data through REST API requests
- Downloading files from an AWS S3 buckets
- Parsing:
- Unstructured (PDF) files
- Structured or Semi-Structured (CSV and JSON) files
The database follows a star schema
design, with orders_table as its central or fact table.
A total of 5 dimension tables are then linked to the fact table through primary and foreign key relationships, as specified below.
The first step is to extract and clean the data originating from multiple sources and create a single, unified database containing this information.
To achieve this, three main utility class modules were created:
- database_utils.py: Utility class to connect and upload data to a database.
- data_extraction.py: Utility class to extract data from multiple sources, including: REST APIs, S3 buckets, structured and unstructured data files (e.g. .csv, .json, .pdf)
- data_cleaning.py: Utility class to clean data from specific data sources.
The main application logic to extract, clean and upload data to the central database is then defined in:
- main.py: Main script containing the application logic. It extracts and cleans data from multiple sources and uploads them to a local database (i.e. PostgreSQL).
Now that the database has been created, the next step is to define the relationship between the database tables.
Before defining the relationship between tables, all columns in the tables were casted to their correct data types. Information on the casting can be found in the casting .sql files inside the sql_schema/
folder.
To connect the fact table with the dimension tables, primary and foreign keys were defined, as shown in the primary_keys.sql and foreign_keys.sql files in the sql_schema/
folder.
Queries were performed to analyse and answer key questions on business performance.
All the queries can be found in the sql_analysis/
folder.
You can find the answers to some key questions below:
country | number_of_stores |
---|---|
GB | 266 |
DE | 141 |
US | 32 |
locality | total_no_stores |
---|---|
Chapletown | 14 |
Belper | 13 |
Bushey | 12 |
Exeter | 11 |
Arbroath | 10 |
High Wycombe | 10 |
Rutherglen | 10 |
total_sales | month |
---|---|
669827.29 | 8 |
664560.19 | 1 |
653897.52 | 10 |
647025.12 | 5 |
642987.86 | 7 |
642189.12 | 3 |
numbers_of_sales | product_quantities | location |
---|---|---|
26494 | 105981 | Web |
91202 | 366106 | Offline |
store_type | total_sales | percentage_total(%) |
---|---|---|
Local | 3397481.64 | 44.36 |
Web Portal | 1718046.71 | 22.43 |
Super Store | 1218631.59 | 15.91 |
Mall Kiosk | 695367.25 | 9.08 |
Outlet | 629175.05 | 8.22 |
total_sales | year | month |
---|---|---|
27883.39 | 1994 | 3 |
27166.77 | 2019 | 1 |
27029.08 | 2009 | 8 |
26628.90 | 1997 | 11 |
26310.97 | 2018 | 12 |
26174.55 | 2017 | 9 |
26150.51 | 2019 | 8 |
25790.12 | 2010 | 5 |
25594.29 | 2000 | 1 |
25513.78 | 1996 | 8 |
total_staff_numbers | country_code |
---|---|
13132 | GB |
6054 | DE |
1245 | US |
total_sales | store_type | country_code |
---|---|---|
1103848.76 | Local | DE |
382822.64 | Super Store | DE |
246432.01 | Mall Kiosk | DE |
197393.70 | Outlet | DE |
year | avg_time_between_sales |
---|---|
1998 | {"hours":2,"minutes":8,"seconds":6,"milliseconds":538.161} |
2003 | {"hours":2,"minutes":8,"seconds":36,"milliseconds":218.084} |
2017 | {"hours":2,"minutes":8,"seconds":39,"milliseconds":348.444} |
2005 | {"hours":2,"minutes":9,"milliseconds":174.073} |
1992 | {"hours":2,"minutes":9,"seconds":32,"milliseconds":62.921} |
2015 | {"hours":2,"minutes":9,"seconds":36,"milliseconds":903.552} |
1997 | {"hours":2,"minutes":9,"seconds":46,"milliseconds":934.207} |
2014 | {"hours":2,"minutes":10,"seconds":5,"milliseconds":558.445} |
2006 | {"hours":2,"minutes":10,"seconds":15,"milliseconds":623.512} |
2004 | {"hours":2,"minutes":10,"seconds":29,"milliseconds":774.281} |
2001 | {"hours":2,"minutes":10,"seconds":34,"milliseconds":39.781} |
2018 | {"hours":2,"minutes":10,"seconds":37,"milliseconds":13.92} |
1994 | {"hours":2,"minutes":10,"seconds":38,"milliseconds":998.756} |
2019 | {"hours":2,"minutes":10,"seconds":40,"milliseconds":488.303} |
2000 | {"hours":2,"minutes":10,"seconds":55,"milliseconds":241.739} |
1996 | {"hours":2,"minutes":11,"milliseconds":84.763} |
1999 | {"hours":2,"minutes":11,"seconds":5,"milliseconds":102.046} |
2010 | {"hours":2,"minutes":11,"seconds":7,"milliseconds":808.24} |
2007 | {"hours":2,"minutes":11,"seconds":8,"milliseconds":918.642} |
2009 | {"hours":2,"minutes":11,"seconds":18,"milliseconds":622.594} |
2021 | {"hours":2,"minutes":11,"seconds":44,"milliseconds":547.326} |
2012 | {"hours":2,"minutes":11,"seconds":47,"milliseconds":533.684} |
2020 | {"hours":2,"minutes":11,"seconds":58,"milliseconds":890.476} |
2011 | {"hours":2,"minutes":12,"seconds":19,"milliseconds":240.745} |
2016 | {"hours":2,"minutes":12,"seconds":50,"milliseconds":458.354} |
1995 | {"hours":2,"minutes":13,"milliseconds":53.404} |
2022 | {"hours":2,"minutes":13,"seconds":2,"milliseconds":3.698} |
2008 | {"hours":2,"minutes":13,"seconds":3,"milliseconds":770.202} |
2002 | {"hours":2,"minutes":13,"seconds":39,"milliseconds":915.69} |
1993 | {"hours":2,"minutes":15,"seconds":35,"milliseconds":481.806} |
2013 | {"hours":2,"minutes":17,"seconds":13,"milliseconds":712.533} |
For this step, we recommend cloning the repository using Git
. If you do not have Git installed in your machine, check the resources here.
- Open a command prompt.
- Change the current working directory to the location where you want the cloned directory.
- Type git clone, and then paste the URL for the Github repository.
git clone https://github.com/pili-zhangqiu/multinational-retail-data-centralisation.git
To create the virtual environment, follow the instructions provided in the Python official page here.
After activating the virtual environment, install all the necessary libraries listed in the requirements.txt
. To do so, run the command:
$ pip install -r /path/to/requirements.txt
If you don't have Python, you can find more information on how to install it here.
To run this project, you will have to set up the following credentials and databases:
- PostgreSQL - Store your credentials in a file called:
- db_creds_local.yaml
- AWS - Store your credentials in files called:
- db_creds_aws_sso.yaml
- db_creds_aws_api.yaml
- db_creds_aws_rds.yaml
If you followed the previous steps on installation, you should be able to run the main.py
script to start the data centralisation process. This will intialise the data extraction and cleaning processes, after which the processed data will be uploaded in a central PostgreSQL database.
To run the script, follow the next steps:
- Open a command prompt.
- Navigate to the directory where the main.py script is located.
- Use the python method to run the example script.
python main.py
The repository structure is as follows:
multinational-retail-data-centralisation/
├── main.py
│ Run this script to create a local database from the extracted data.
│
├── database_utils.py
│ Utility class to interact with databases.
├── data_extraction.py
│ Utility class to extract data from databases, API calls or files.
├── data_cleaning.py
│ Utility class to clean dataframes.
├── validation_utils.yaml
│
├── sql_schema/
│ Contains .sql files to cast columns and create relationships between tables.
│ ├── cast_dim_card_details_table.sql
│ ├── cast_dim_date_times_table.sql
│ ├── cast_dim_products_table.sql
│ ├── cast_dim_store_details_table.sql
│ ├── cast_dim_users_table.sql
│ ├── cast_orders_table.sql
│ ├── primary_keys.sql
│ └── foreigh_keys.sql
│
├── sql_analysis/
│ Contains .sql files to perform data analysis.
│ ├── sales_online.sql
│ ├── sales_per_month.sql
│ ├── sales_per_month_year.sql
│ ├── sales_per_store_type.sql
│ ├── sales_per_store_type_germany.sql
│ ├── staff_count_per_country.sql
│ ├── stores_per_country.sql
│ ├── stores_per_locality.sql
│ └── time_between_sales_per_year.sql
│
├── img/
│ └── (...)
├── README.md
├── LICENSE.txt
└── .gitignore
Distributed under the Apache 2.0 License. See LICENSE.txt
for more information.