Spring-Crawler is a fun showcase project demonstrating web crawling capabilities for products from hepsiburada.com, with data persistence in a PostgreSQL database.
PS: This project is for educational purposes only. Please respect the terms of service of the websites you crawl and do not use this project for unethical purposes.
- Crawl product data from hepsiburada.com
- Save product information to PostgreSQL database
- Distributed crawling using Redis for message queuing
- RESTful API endpoints for submitting URLs and checking product status
- Spring Boot
- PostgreSQL
- Redis
- Java
- Docker
- Swagger
-
Submit URLs via the
/submit
endpoint- Only URLs from hepsiburada.com are accepted, others are rejected.
- Duplicate URLs are saved and crawled only once.
-
URLs are stored in PostgreSQL and messages are sent to Redis
-
Workers listen for messages from Redis
-
When notified, workers fetch URLs from PostgreSQL and crawl hepsiburada.com
-
Crawled data is saved back to PostgreSQL
-
Users can check product status via the
/product
endpoint
- PENDING: URL submitted, waiting to be processed
- IN_PROGRESS: Currently being crawled
- COMPLETED: Crawling finished, data saved
- FAILED: Crawling failed, unable to save data
- Swagger UI: http://localhost:8080/swagger-ui.html
-
/submit
: Submit a product URL for crawling- Method: POST
- Body:
{ "url": "https://www.hepsiburada.com/product-url" }
-
/product
: Get product crawling status- Method: GET
- Query Param:
id
(product ID)
-
/products
: Get all products- Method: GET
- Clone the repository
- Run the Spring Boot application
- Access the Swagger UI at http://localhost:8080/swagger-ui.html
- Submit a product URL using the
/submit
endpoint - Check the product status using the
/product
endpoint - View all products using the
/products
endpoint
Contributions, issues, and feature requests are welcome!
This project is licensed under the MIT License.
Note: This is a really simple showcase and proof of concept for web crawling. In large-scale production environments, web crawling becomes significantly more complex. Many additional factors need to be considered, such as:
- Scalability and distributed crawling
- Respect for robots.txt and crawl-delay directives
- IP rotation and proxy management
- Rate limiting guards and anti-scraping measures
- Data deduplication and storage optimization
- Error handling and retry mechanisms
- Legal and ethical considerations
This project serves as a starting point for understanding basic crawling concepts, but real-world implementations require careful planning and additional infrastructure.