GitHub - pili-zhangqiu/multinational-retail-data-centralisation: Data centralisation project for a multinational retailer using Python, AWS and PostgreSQL.

Extract, process and unify data from multiple sources in a single PostgreSQL database.

Table of Contents

About The Project
Database Schema
- Table Origins
- Entity-Relationship Diagram (ERD)
Project Flow
Installation
- Step 1: Clone the Repository
- Step 2: Setup the Environment
Usage
- Credentials
- Executing the data centralisation
File Structure
License

About the Project

Summary

The project goal is to transform and centralize the data infrastructure of a multinational retailer. The existing decentralized data architecture posed challenges in terms of efficiency, data accessibility, and strategic decision-making. By centralizing the data, the project aimed to streamline information management, enhance data consistency, and provide a unified and comprehensive view of the company's operations.

To achieve this goal, the project extracts and cleans data from multiple sources and uploads them into a new central PostgreSQL database.

Key platforms and technologies: PostgreSQL, AWS (S3 Buckets, Amazon RDS), REST API
Files parsed: Structured (JSON, CSV), Unstructured (PDF)
Languages and libraries: Python (Pandas, Numpy, Boto3, SQLAlchemy)

(back to top)

Database Schema

Table origins

Each table was extracted from a difference source. Procedures followed to extract the data include:

Retrieving data from AWS RDS databases
Retrieving data through REST API requests
Downloading files from an AWS S3 buckets
Parsing:
- Unstructured (PDF) files
- Structured or Semi-Structured (CSV and JSON) files

Entity-Relationship Diagram (ERD)

The database follows a star schema design, with orders_table as its central or fact table.

A total of 5 dimension tables are then linked to the fact table through primary and foreign key relationships, as specified below.

(back to top)

Project Flow

1. Data Extraction, Cleaning & Database Initialisation

The first step is to extract and clean the data originating from multiple sources and create a single, unified database containing this information.

To achieve this, three main utility class modules were created:

database_utils.py: Utility class to connect and upload data to a database.
data_extraction.py: Utility class to extract data from multiple sources, including: REST APIs, S3 buckets, structured and unstructured data files (e.g. .csv, .json, .pdf)
data_cleaning.py: Utility class to clean data from specific data sources.

The main application logic to extract, clean and upload data to the central database is then defined in:

main.py: Main script containing the application logic. It extracts and cleans data from multiple sources and uploads them to a local database (i.e. PostgreSQL).

2. Database Schema Design

Now that the database has been created, the next step is to define the relationship between the database tables.

Before defining the relationship between tables, all columns in the tables were casted to their correct data types. Information on the casting can be found in the casting .sql files inside the sql_schema/ folder.

To connect the fact table with the dimension tables, primary and foreign keys were defined, as shown in the primary_keys.sql and foreign_keys.sql files in the sql_schema/ folder.

3. Querying Data & Analysis

Queries were performed to analyse and answer key questions on business performance. All the queries can be found in the sql_analysis/ folder.

You can find the answers to some key questions below:

(A) How many stores does the business have in each country?

country	number_of_stores
GB	266
DE	141
US	32

(B) Which locations have the most stores?

locality	total_no_stores
Chapletown	14
Belper	13
Bushey	12
Exeter	11
Arbroath	10
High Wycombe	10
Rutherglen	10

(C) Which months produced the largest amounts of sales?

total_sales	month
669827.29	8
664560.19	1
653897.52	10
647025.12	5
642987.86	7
642189.12	3

(D) How many sales are made online?

numbers_of_sales	product_quantities	location
26494	105981	Web
91202	366106	Offline

(E) Which percentage sales comes from each store type?

store_type	total_sales	percentage_total(%)
Local	3397481.64	44.36
Web Portal	1718046.71	22.43
Super Store	1218631.59	15.91
Mall Kiosk	695367.25	9.08
Outlet	629175.05	8.22

(F) Which month in which year produced the highest sales?

total_sales	year	month
27883.39	1994	3
27166.77	2019	1
27029.08	2009	8
26628.90	1997	11
26310.97	2018	12
26174.55	2017	9
26150.51	2019	8
25790.12	2010	5
25594.29	2000	1
25513.78	1996	8

(G) What is our staff headcount?

total_staff_numbers	country_code
13132	GB
6054	DE
1245	US

(H) Which German store type is selling the most?

total_sales	store_type	country_code
1103848.76	Local	DE
382822.64	Super Store	DE
246432.01	Mall Kiosk	DE
197393.70	Outlet	DE

(I) How quickly is the company making sales?

year	avg_time_between_sales
1998	{"hours":2,"minutes":8,"seconds":6,"milliseconds":538.161}
2003	{"hours":2,"minutes":8,"seconds":36,"milliseconds":218.084}
2017	{"hours":2,"minutes":8,"seconds":39,"milliseconds":348.444}
2005	{"hours":2,"minutes":9,"milliseconds":174.073}
1992	{"hours":2,"minutes":9,"seconds":32,"milliseconds":62.921}
2015	{"hours":2,"minutes":9,"seconds":36,"milliseconds":903.552}
1997	{"hours":2,"minutes":9,"seconds":46,"milliseconds":934.207}
2014	{"hours":2,"minutes":10,"seconds":5,"milliseconds":558.445}
2006	{"hours":2,"minutes":10,"seconds":15,"milliseconds":623.512}
2004	{"hours":2,"minutes":10,"seconds":29,"milliseconds":774.281}
2001	{"hours":2,"minutes":10,"seconds":34,"milliseconds":39.781}
2018	{"hours":2,"minutes":10,"seconds":37,"milliseconds":13.92}
1994	{"hours":2,"minutes":10,"seconds":38,"milliseconds":998.756}
2019	{"hours":2,"minutes":10,"seconds":40,"milliseconds":488.303}
2000	{"hours":2,"minutes":10,"seconds":55,"milliseconds":241.739}
1996	{"hours":2,"minutes":11,"milliseconds":84.763}
1999	{"hours":2,"minutes":11,"seconds":5,"milliseconds":102.046}
2010	{"hours":2,"minutes":11,"seconds":7,"milliseconds":808.24}
2007	{"hours":2,"minutes":11,"seconds":8,"milliseconds":918.642}
2009	{"hours":2,"minutes":11,"seconds":18,"milliseconds":622.594}
2021	{"hours":2,"minutes":11,"seconds":44,"milliseconds":547.326}
2012	{"hours":2,"minutes":11,"seconds":47,"milliseconds":533.684}
2020	{"hours":2,"minutes":11,"seconds":58,"milliseconds":890.476}
2011	{"hours":2,"minutes":12,"seconds":19,"milliseconds":240.745}
2016	{"hours":2,"minutes":12,"seconds":50,"milliseconds":458.354}
1995	{"hours":2,"minutes":13,"milliseconds":53.404}
2022	{"hours":2,"minutes":13,"seconds":2,"milliseconds":3.698}
2008	{"hours":2,"minutes":13,"seconds":3,"milliseconds":770.202}
2002	{"hours":2,"minutes":13,"seconds":39,"milliseconds":915.69}
1993	{"hours":2,"minutes":15,"seconds":35,"milliseconds":481.806}
2013	{"hours":2,"minutes":17,"seconds":13,"milliseconds":712.533}

(back to top)

Installation

Step 1: Clone the repository

For this step, we recommend cloning the repository using Git. If you do not have Git installed in your machine, check the resources here.

Open a command prompt.
Change the current working directory to the location where you want the cloned directory.

Type git clone, and then paste the URL for the Github repository.

git clone https://github.com/pili-zhangqiu/multinational-retail-data-centralisation.git

Step 2: Setup the environment

To create the virtual environment, follow the instructions provided in the Python official page here.

After activating the virtual environment, install all the necessary libraries listed in the requirements.txt. To do so, run the command:

$ pip install -r /path/to/requirements.txt

If you don't have Python, you can find more information on how to install it here.

(back to top)

Usage

Credentials

To run this project, you will have to set up the following credentials and databases:

PostgreSQL - Store your credentials in a file called:
- db_creds_local.yaml
AWS - Store your credentials in files called:
- db_creds_aws_sso.yaml
- db_creds_aws_api.yaml
- db_creds_aws_rds.yaml

Executing the data centralisation

If you followed the previous steps on installation, you should be able to run the main.py script to start the data centralisation process. This will intialise the data extraction and cleaning processes, after which the processed data will be uploaded in a central PostgreSQL database.

To run the script, follow the next steps:

Open a command prompt.
Navigate to the directory where the main.py script is located.
Use the python method to run the example script.
```
python main.py
```

(back to top)

File Structure

The repository structure is as follows:

multinational-retail-data-centralisation/
├── main.py
│   Run this script to create a local database from the extracted data.
│
├── database_utils.py
│   Utility class to interact with databases.
├── data_extraction.py
│   Utility class to extract data from databases, API calls or files.
├── data_cleaning.py
│   Utility class to clean dataframes.
├── validation_utils.yaml
│
├── sql_schema/
│   Contains .sql files to cast columns and create relationships between tables.
│   ├── cast_dim_card_details_table.sql
│   ├── cast_dim_date_times_table.sql
│   ├── cast_dim_products_table.sql
│   ├── cast_dim_store_details_table.sql
│   ├── cast_dim_users_table.sql
│   ├── cast_orders_table.sql
│   ├── primary_keys.sql
│   └── foreigh_keys.sql
│
├── sql_analysis/
│   Contains .sql files to perform data analysis.
│   ├── sales_online.sql
│   ├── sales_per_month.sql
│   ├── sales_per_month_year.sql
│   ├── sales_per_store_type.sql
│   ├── sales_per_store_type_germany.sql
│   ├── staff_count_per_country.sql
│   ├── stores_per_country.sql
│   ├── stores_per_locality.sql
│   └── time_between_sales_per_year.sql
│
├── img/
│   └── (...)
├── README.md
├── LICENSE.txt
└── .gitignore

(back to top)

License

Distributed under the Apache 2.0 License. See LICENSE.txt for more information.

(back to top)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About the Project

Summary

Database Schema

Table origins

Entity-Relationship Diagram (ERD)

Project Flow

1. Data Extraction, Cleaning & Database Initialisation

2. Database Schema Design

3. Querying Data & Analysis

(A) How many stores does the business have in each country?

(B) Which locations have the most stores?

(C) Which months produced the largest amounts of sales?

(D) How many sales are made online?

(E) Which percentage sales comes from each store type?

(F) Which month in which year produced the highest sales?

(G) What is our staff headcount?

(H) Which German store type is selling the most?

(I) How quickly is the company making sales?

Installation

Step 1: Clone the repository

Step 2: Setup the environment

Usage

Credentials

Executing the data centralisation

File Structure

License

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
img		img
sql_analysis		sql_analysis
sql_schema		sql_schema
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
data_cleaning.py		data_cleaning.py
data_extraction.py		data_extraction.py
database_erd.pgerd		database_erd.pgerd
database_utils.py		database_utils.py
erd_mrdc.pgerd		erd_mrdc.pgerd
main.py		main.py
requirements.txt		requirements.txt
validation_utils.yaml		validation_utils.yaml

License

pili-zhangqiu/multinational-retail-data-centralisation

Folders and files

Latest commit

History

Repository files navigation

About the Project

Summary

Database Schema

Table origins

Entity-Relationship Diagram (ERD)

Project Flow

1. Data Extraction, Cleaning & Database Initialisation

2. Database Schema Design

3. Querying Data & Analysis

Installation

Step 1: Clone the repository

Step 2: Setup the environment

Usage

Credentials

Executing the data centralisation

File Structure

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages