Data-Collection-Pipeline

An implementation of an industry grade data collection pipeline that runs scalably in the cloud.

Step by Step Process

In this Data Collection pipeline project, I have collected tabular data and images from a website(https://www.boohoo.com). This data will be stored in a relational database and data lake in the cloud.

Step 1: Selecting a Website

Being a lover of everything about fashion and beauty, there are three website i wanted to scrape:

Step 2: Scraping

In this step, Sellenium and Request is being imported to scrape the website and a scraper class was created which contains all the methods used to scrape data from the website and get the required data. Before being able to access the website, accept cookies iframe appeared. Which was clicked immediately from the initializer to be able to access the data.

Step 3: Retrieving and Storing Data

some piece of information was extracted from this website and stored in a dictionary. Scraped images

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
README.md		README.md
boohoo.py		boohoo.py
test_boohoo.py		test_boohoo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-Collection-Pipeline

Step by Step Process

Step 1: Selecting a Website

Step 2: Scraping

Step 3: Retrieving and Storing Data

Step 4: Testing

About

Releases

Packages

Languages

success4lyf/Data-Collection-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Data-Collection-Pipeline

Step by Step Process

Step 1: Selecting a Website

Step 2: Scraping

Step 3: Retrieving and Storing Data

Step 4: Testing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages