🕷️ Spring-Crawler

Spring-Crawler is a fun showcase project demonstrating web crawling capabilities for products from hepsiburada.com, with data persistence in a PostgreSQL database.

PS: This project is for educational purposes only. Please respect the terms of service of the websites you crawl and do not use this project for unethical purposes.

🚀 Features

Crawl product data from hepsiburada.com
Save product information to PostgreSQL database
Distributed crawling using Redis for message queuing
RESTful API endpoints for submitting URLs and checking product status

🛠️ Tech Stack

Spring Boot
PostgreSQL
Redis
Java
Docker
Swagger

🔄 How It Works

Submit URLs via the /submit endpoint
- Only URLs from hepsiburada.com are accepted, others are rejected.
- Duplicate URLs are saved and crawled only once.
URLs are stored in PostgreSQL and messages are sent to Redis
Workers listen for messages from Redis
When notified, workers fetch URLs from PostgreSQL and crawl hepsiburada.com
Crawled data is saved back to PostgreSQL
Users can check product status via the /product endpoint

📊 Product Statuses

PENDING: URL submitted, waiting to be processed
IN_PROGRESS: Currently being crawled
COMPLETED: Crawling finished, data saved
FAILED: Crawling failed, unable to save data

🔗 API Endpoints

Swagger UI: http://localhost:8080/swagger-ui.html

/submit: Submit a product URL for crawling
- Method: POST
- Body: { "url": "https://www.hepsiburada.com/product-url" }
/product: Get product crawling status
- Method: GET
- Query Param: id (product ID)
/products: Get all products
- Method: GET

🚀 Getting Started

Clone the repository
Run the Spring Boot application
Access the Swagger UI at http://localhost:8080/swagger-ui.html
Submit a product URL using the /submit endpoint
Check the product status using the /product endpoint
View all products using the /products endpoint

🤝 Contributing

Contributions, issues, and feature requests are welcome!

📝 License

This project is licensed under the MIT License.

Note: This is a really simple showcase and proof of concept for web crawling. In large-scale production environments, web crawling becomes significantly more complex. Many additional factors need to be considered, such as:

Scalability and distributed crawling

Respect for robots.txt and crawl-delay directives

IP rotation and proxy management

Rate limiting guards and anti-scraping measures

Data deduplication and storage optimization

Error handling and retry mechanisms

Legal and ethical considerations

This project serves as a starting point for understanding basic crawling concepts, but real-world implementations require careful planning and additional infrastructure.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.mvn/wrapper		.mvn/wrapper
src		src
.gitignore		.gitignore
README.md		README.md
compose.yaml		compose.yaml
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕷️ Spring-Crawler

🚀 Features

🛠️ Tech Stack

🔄 How It Works

📊 Product Statuses

🔗 API Endpoints

🚀 Getting Started

🤝 Contributing

📝 License

About

Releases

Packages

Languages

firattamur/spring-crawler

Folders and files

Latest commit

History

Repository files navigation

🕷️ Spring-Crawler

🚀 Features

🛠️ Tech Stack

🔄 How It Works

📊 Product Statuses

🔗 API Endpoints

🚀 Getting Started

🤝 Contributing

📝 License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages