An implementation of an industry grade data collection pipeline that runs scalably in the cloud.
In this Data Collection pipeline project, I have collected tabular data and images from a website(https://www.boohoo.com). This data will be stored in a relational database and data lake in the cloud.
Being a lover of everything about fashion and beauty, there are three website i wanted to scrape:
In this step, Sellenium and Request is being imported to scrape the website and a scraper class was created which contains all the methods used to scrape data from the website and get the required data. Before being able to access the website, accept cookies iframe appeared. Which was clicked immediately from the initializer to be able to access the data.
some piece of information was extracted from this website and stored in a dictionary. Scraped images