- 📖 About the Project
- ⚙️ Getting Started
- 💡 Running the Scraper
- 📂 Project Structure
- ✨ Customization
- 📜 License
- 📚 References
This project demonstrates how to build a web scraper using Scrapy, a powerful Python framework for web scraping, and store the extracted data in MongoDB, a flexible NoSQL database.
The scraper is designed to extract product information from Amazon. It:
- Extracts relevant data like product names, prices, ratings, and images.
- Handles pagination to scrape data across multiple pages.
- Stores the extracted data in a MongoDB database for further analysis or processing.
The scraper can be used for various products, not just books. You can search for any product category by passing the desired keyword.
Follow these steps to get the project running on your local machine. 🚀
Before running the scraper, ensure you have the following installed:
- 🐍 Python 3.9 or higher
- 🕷️ Scrapy
- 💾 MongoDB
- 🔗 pymongo
-
Clone the Repository:
git clone https://github.com/yourusername/books_scraper.git cd books_scraper
-
Set Up a Virtual Environment:
python -m venv venv
-
Activate the Virtual Environment:
-
On Windows:
venv\Scripts\activate
-
On Unix or macOS:
source venv/bin/activate
-
-
Install the Required Packages:
pip install scrapy pymongo
-
Configure MongoDB:
Ensure MongoDB is installed and running on your local machine. The default connection settings in the project are:
- Host:
localhost
- Port:
27017
- Database:
books_db
- Collection:
books
If your MongoDB configuration differs, update the settings in
settings.py
accordingly. - Host:
To execute the scraper, use the following command:
scrapy crawl book -a keyword="laptops"
The scraper will:
- Use the passed keyword (default is "books").
- Start at the specified Amazon search URL.
- Navigate through the pages and extract data like product names, prices, ratings, and images.
- Store the extracted data in the MongoDB database.
If you do not pass a keyword, the scraper will default to searching for "books". Example:
scrapy crawl book
The project follows Scrapy's standard structure:
books_scraper/
├── books/
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders/
│ ├── __init__.py
│ └── book_spider.py
├── scrapy.cfg
└── README.md
items.py
: Defines the data structure for the scraped items.pipelines.py
: Contains the pipeline for processing and storing items in MongoDB.settings.py
: Configuration settings for the Scrapy project, including MongoDB connection details.spiders/book_spider.py
: The main spider responsible for scraping Amazon.
To adapt the scraper for different keywords or websites:
-
Pass a Keyword Dynamically:
The spider can be run with a dynamic keyword using the
-a
argument. Example for scraping books related to "laptops":scrapy crawl book -a keyword="laptops"
By default, if no keyword is passed, the scraper will search for "books".
-
Update the
start_urls
:Modify the
start_urls
list inbook_spider.py
to point to a different website or category. -
Adjust the Parsing Logic:
Ensure the CSS selectors in the
parse
method ofbook_spider.py
accurately target the desired data fields on the new website. -
Handle Pagination:
If the target website uses a different pagination structure, update the pagination handling logic in the
parse
method accordingly.
This project is licensed under the MIT License. See the LICENSE
file for details. 📄
For more detailed information on the tools and techniques used in this project, refer to the following resources:
If you like this project, please give it a ⭐ by clicking the star button at the top of the repository! It helps others discover the project and motivates me to improve it further. ❤️
This update now allows users to pass a custom keyword for the search and if not passed, the default is "books".