A python scrapy
based crawler that scrapes through the international section of spiegel and drops the result in a mongodb
database.
- Install
docker
- Install
docker-compose
- Ensure port
27017
is open and not used by other processes
- open a terminal and clone this repository
git clone https://github.com/Logician724/spiegel-crawler.git
- open the repository directory.
cd spiegel-crawler
- run
docker-compose
in detached mode
docker-compose up --build -d
- to continuously read the logs, type into your terminal in your repository directory[Optional]
docker-compose logs -f
- to stop the crawler, type into your terminal in your repository directory
docker-compose down
The script uses the scrapy library to crawl through the spiegel
international section.
The most important modules are
-
spiegel_spider.py
: This module houses the crawler logic for extracting info from the website. -
pipelines.py
: This module has a few pipelines that process the extracted information from the crawler.ArticleValidationPipeline
Runs basic validation on datatypes and lengths resulting from extraction.ArticleFormatPipeline
Mostly strips data points from all leading and trailing white spaces.ArticleMongoDBPipeline
This houses the logic that runs the duplicate check and either inserts a new document or updates one in the database.
There are 2 containers in the system
-
spiegel_crawler
container: Run the cron process that schedules the crawler script to run- Once on script initialization for the purpose of testing.
- Once every 15 minutes. speficially on the 0th, 15th, 30th, and 45th minute of each hour.
-
mongo
container: Runs the mongodb database engine and listens for connections on port27017
There are a few ways to access the crawl result set, the fastest of which is as follows.
- Make sure the crawler is still running by inspecting the result of
docker container ls
In case you find a container with the name mongo
similar to the following example output, then the crawler is still running.
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
98bb44c05822 spiegel_crawler_crawler "./entrypoint.sh" 32 minutes ago Up 32 minutes spiegel-crawler
65d71518b49a mongo "docker-entrypoint.s…" 32 minutes ago Up 32 minutes 0.0.0.0:27017->27017/tcp, :::27017->27017/tcp mongo
- Open the mongo shell
docker exec -it mongo mongo -u root -p root
- Switch to the crawler database
use crawler
- Return the crawler result set
db.crawler.find()
-
A limited sample of the result set will be returned. As instructed you can type
it
for more documents. -
You can filter your query result by following the mongodb find api docs. e.g. to return all documents that contain the word
Free
in their abstract.
db.crawler.find({abstract: {$regex: "Free"}})
- Once done type
exit
to leave the mongo shell.
The instructions above assume the default configuration in env.default
has not been modified. In case of modification, you will have to change your commands accordingly.
The repository uses an env.default
file that houses the default configuration for the respository for quick startup. The following variables can be changed or added in .env.default
file to make them customizable
-
MONGO_URI
: used by the crawler to connect to the correct database. The database connection string format follows the standard mongodb format. -
MONGO_DB
: used by the crawler to configure the name of the database the crawler uses to dump the crawl run result. The default name iscrawler
. -
MONGO_COL
: used by the crawler to configure the name of the collection the crawler uses to dump the crawl run result inside the database specified byMONGO_DB
. The default collection name iscrawler
. -
MONGO_INITDB_ROOT_USERNAME
: used by themongo
image to configure themongodb
root username. -
MONGO_INITDB_ROOT_PASSWORD
: used by themongo
image to configure themongodb
root password. -
For more info on environment customization for
mongo
image, check here