Skip to content

An implementation of an industry grade data collection pipeline that runs scalably in the cloud.

Notifications You must be signed in to change notification settings

success4lyf/Data-Collection-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data-Collection-Pipeline

An implementation of an industry grade data collection pipeline that runs scalably in the cloud.

Step by Step Process

In this Data Collection pipeline project, I have collected tabular data and images from a website(https://www.boohoo.com). This data will be stored in a relational database and data lake in the cloud.

Step 1: Selecting a Website

Being a lover of everything about fashion and beauty, there are three website i wanted to scrape:

  1. https://www.cultbeauty.co.uk
  2. https://www.boohoo.com
  3. https://www.prettylittlething.com

Step 2: Scraping

In this step, Sellenium and Request is being imported to scrape the website and a scraper class was created which contains all the methods used to scrape data from the website and get the required data. Before being able to access the website, accept cookies iframe appeared. Which was clicked immediately from the initializer to be able to access the data.

Step 3: Retrieving and Storing Data

some piece of information was extracted from this website and stored in a dictionary. Scraped images

Step 4: Testing

About

An implementation of an industry grade data collection pipeline that runs scalably in the cloud.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages