The project objective is to demonstrate approach of developing data scraping API by means of FastAPI.
The project target scraping data source is toscrape.com. The source was specially developed in order to let people train their scraping skills.
The project uses:
- FastAPI - API web framework;
- aiohttp - requesting data via HTTP/HTTPS;
- bs4 (lxml) - parsing raw HTML text documents.
All used packages are listed in requirements.txt.
№ | Path | Description |
---|---|---|
1. | app/ | Package with all project logic. |
2. | app/core/ | Package that manages configurations. |
3. | app/core/config.py | Environment variables parsing with Starlette approach. |
4. | app/quotes/ | Package of implemented scraper itself. |
5. | app/quotes/api.py | API methods for quotes scraper. |
6. | app/quoates/core.py | Scraper functionality, requests, parsing and serialisation. |
7. | app/quotes/examples.py | Examples of input and output data for API methods. |
8. | app/quotes/schemas.py | Structures for input and output data. |
9. | app/quotes/tests.py | Tests for quotes scraper. |
10. | app/router/ | Storing endpoints for specific API versions. |
11. | app/router/api_v1/endpoints.py | Endpoints for API version 1. |
12. | app/conftest.py | Configurations for purest module. |
13. | app/main.py | Entrypoint for the scraping-api app. |
14. | .dockerignore | List of ignored files by Docker. |
15. | .gitignore | List of ignored files by GIT. |
16. | dev.env | List of environment variables. |
17. | Dockerfile | Deployment instructions for Docker container. |
18. | LICENSE | Description of license conditions. |
19. | pytest.ini | Configuration for test files location. |
20. | README.md | Current file with project documentation. |
21. | requirements.txt | List of all python packages used in project. |
Writing tests are a must-have practice in any project. Tests for scrapers have its specifics because scrapers deals with dynamic data - data that always changes. So be careful writing tests, they must be more abstract.
The code in this project is written in compliance with PEP8. It is extremely important to stick to code style, it helps other people to understand code better. In this project all imports are sorted with isort package. It helps to keep project clean and clear.
Please be respectful to other developers!
Command to build virtual environment under project root, it helps isolate the project from other developments so there are no dependencies collisions:
python -m venv venv/
Activate virtual environment:
source venv/bin/activate
Command to install all dependencies from created virtual environment:
pip install -r requirements.txt
Command to run tests:
pytest
Command to run service:
uvicorn app.main:app
More details on uvicorn - read more
Command to build image (from the project root folder):
docker build -t data-scraping-api .
Command to run service and expose 80 port for access via http://localhost:4001
docker run -d --name data-scraping-api -p 4001:80 data-scraping-api
To see all the configurations and options, go to the Docker image page: uvicorn-gunicorn-fastapi-docker