Concept Web Scraper

This project is a Python-based web scraping application designed to extract, cache, and publish data from a target website. The application is built to run periodically, ensuring the most recent data is always available.

Architecture

The main script, app.py, orchestrates the process by leveraging classes from the scraping module. The FiftyA class is responsible for the actual web scraping, while ScrapeCacheContainer and ProducerContainer manage caching and data publishing, respectively.

The Scraper class in scraping/Scraper.py is a generic web scraper, equipped with a list of user-agent headers to rotate through for requests, a common technique to avoid being blocked by the target website.

Deployment

The project is containerized using Docker, allowing for easy deployment and scaling. The application and a Redis server are run as separate services, as defined in the docker-compose.yml file.

Future Improvements

Error Handling: Enhance error handling, especially for network errors during web scraping.
Logging: Improve logging to provide more insight into the application's operation and performance.
Testing: Add unit tests to ensure the reliability of the code.
Data Validation: Validate data before caching and publishing to ensure its integrity.
Scalability: Consider ways to scale the scraping process, such as distributed scraping, if the amount of data to be scraped is large or grows over time. Ideally with seperate workers, we can improve performance and avoid getting blocked from servers.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
.vscode		.vscode
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Concept Web Scraper

Architecture

Deployment

Future Improvements

About

Releases

Packages

Languages

License

National-Police-Data-Coalition/Webscraper

Folders and files

Latest commit

History

Repository files navigation

Concept Web Scraper

Architecture

Deployment

Future Improvements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages