-
Notifications
You must be signed in to change notification settings - Fork 15
Crawl Managers
Previous Chapter: General Description
The simplest workflow can be defined with the CrawlManager class. This class schedules a
single spider job. Not much useful by itself, but it helps to illustrate basic concepts.
The first step is to create a crawl manager script in your project repository for deploying in ScrapyCloud. Save the following lines in a file called, for example,
script/crawlmanager.py
:
from shub_workflow.crawl import CrawlManager
if __name__ == '__main__':
crawlmanager = CrawlManager()
crawlmanager.run()
and add a proper scripts line on your project setup.py
. For example:
import glob
from setuptools import setup, find_packages
setup(
name = 'project',
version = '1.0',
packages = find_packages(),
scripts = glob.glob('scripts/*.py'),
entry_points = {'scrapy': ['settings = myproject.settings']}
)
Let's analyze the help printed when the script is called without parameters from command line:
> python crawlmanager.py -h
usage: You didn't set description for this script. Please set description property accordingly.
[-h] [--project-id PROJECT_ID] [--name NAME] [--flow-id FLOW_ID] [--tag TAG]
[--loop-mode SECONDS] [--max-running-jobs MAX_RUNNING_JOBS]
[--spider-args SPIDER_ARGS] [--job-settings JOB_SETTINGS]
[--units UNITS]
spider
positional arguments:
spider Spider name
optional arguments:
-h, --help show this help message and exit
--project-id PROJECT_ID
Overrides target project id.
--name NAME Script name.
--flow-id FLOW_ID If given, use the given flow id.
--tag TAG Add given tag to the scheduled jobs. Can be given
multiple times.
--loop-mode SECONDS If provided, manager will run in loop mode, with a
cycle each given number of seconds. Default: 0
--max-running-jobs MAX_RUNNING_JOBS
If given, don't allow more than the given jobs running
at once. Default: inf
--resume-workflow Resume workflow. You must use it in combination with --flow-id in order to set the flow id of the worklow you want to resume.
--spider-args SPIDER_ARGS
Spider arguments dict in json format
--job-settings JOB_SETTINGS
Job settings dict in json format
--units UNITS Set number of ScrapyCloud units for each job
Some of the options are inherited from parent classes, other ones are added by CrawlManager
class. A first message that may grab your attention, is the initial
description message: You didn't set description for this script. Please set description property accordingly.
. Every script subclassed from
base script class will print this message if a description for it (or a parent
class) was not created. For creating it you have to add the property description
. In our example, it could be something like this:
from shub_workflow.crawl import CrawlManager as SHCrawlManager
class CrawlManager(SHCrawlManager):
@property
def description(self):
return 'Crawl manager for MyProject.'
if __name__ == '__main__':
crawlmanager = CrawlManager()
crawlmanager.run()
Let's focus on the command line options and arguments. The first seven options (from --project-id
to --resume-workflow
) are inherited from the base script class.
--project-id
When a shub-workflow script runs in ScrapyCloud, the project id where it operates is autodetected: by default it is the id of the ScrapyCloud project where the script itself is running.
In the context of a script that schedules other jobs (from now on, a manager script), like our crawl manager, this project id determines the target project where these
children jobs must run. But for some applications you may want to run jobs in a different project than the one where the manager is running. So you can provide --project-id
option for those cases. Also, it is possible to run the manager outside scrapy cloud. In this case, project id cannot be autodetected, so you must provide it either with the
--project-id
option, or the PROJECT_ID
environment variable.
When a shub-workflow script is invoked on command line, it tries to guess the project id from default
entry in project scrapinghub.yml
. For overriding or providing when such entry is not available,
use either the PROJECT_ID
environment variable, or the --project-id
command line option)
--name
The --name
option allows to assign a workflow name to the script. The same script can run in the context of many different workflows (not only instances of the same
workflow), and a name identification can be useful in some situations.
--flow-id
The flow id identifies a specific instance of a workflow. If this option is not provided, it is autogenerated and added to the job tags of the manager script itself, and propagated to all its scheduled children. In this way different jobs running in ScrapyCloud can be related to the same instance of a workflow, and allows consistency between different jobs running on it in ways that we will see later. You may want also to override the flow id via command line when resuming jobs, for example, or for manually scheduling jobs associated to a specific workflow instance.
--tag
The --tag
command line option allows to add custom tags to the children jobs.
--loop-mode
By default, a workflow manager script performs a single loop and exits. The crawl manager for example, will schedule a spider job and finish. But if you set loop mode, it will continue alive, looping each every given seconds, and checking on each loop the status of the scheduled job. Once the job is finished, the crawl manager finishes too. Not much useful for this crawl manager. Most workflows however, need its manager to work in loop mode, for scheduling new jobs as previous ones finishes, monitoring the status of the workflow, etc. In order the crawl manager script to work in loop mode, you can either:
-
In your custom crawl manager class, set the class attribute
loop_mode
to an integer that determines the number of seconds that manager must sleep between each loop execution (except if you setloop_mode = 0
, which is the default and disables looping). -
You can override the default looping value in your class with the command line option
--loop-mode
.
--max-running-jobs
Another configuration inherited from the base workflow manager allows to set the maximal number of children jobs that can be running at a given moment. By default
there is no maximal. You can put a limit to this number either by class attribute default_max_jobs
, or by command line option --max-running-jobs
.
--resume-workflow
A flag option. It must be used in combination with --flow-id
. When you set a flow id in this way, and add --resume-workflow
, the crawl manager will infer the status
of a workflow by reading information from all the jobs with the same flow id, and resume from that. At moment it is only implemented for CrawlManager and its subclasses.
In this case, the manager will check all the spider jobs with same flow id, and if some is running, it will acquire them as own. This is needed for example, in order
to avoid to schedule more jobs than allowed by max running jobs.
For more complex workwflow classes (i.e. the Graph Manager introduced in the next chapter), this is still a TO DO feature.
The remaining set of options, and the main argument, are added by the CrawlManager
class itself and they are self explicative, considering the purpose of the crawl manager script.
So, let's exemplify the usage of the crawl manager. Let's suppose you have a spider called amazon.com
that accepts some parameters like department
and search_string
.
From command line, assuming you have a fully installed development environment for your project, you may call your script in this way:
> python crawlmanager.py amazon.com --spider-args='{"department": "books", "search_string": "winnie the witch"}' --job-settings='{"CONCURRENT_REQUESTS": 2}'
All crawl managers support implicit target spider via the class attribute spider
. If provided, the spider
command line argument is unavailable:
class MyCrawlManager(...):
loop_mode = 120
spider = "amazon.com"
So the command line call will be the same as before, but without the spider
argument:
> python crawlmanager.py [--spider-args=... ...]
The periodic crawl manager is very similar to the simplest one described in previous section. But instead of scheduling a simple spider job, on each loop it will check periodically for the job status. And when the job finishes, it schedules a new job. For activating this behaviour you need to set loop mode as explained above. Example:
from shub_workflow.crawl import PeriodicCrawlManager
class CrawlManager(PeriodicCrawlManager):
# check every 180 seconds the status of the scheduled job
loop_mode = 180
@property
def description(self):
return 'Periodic Crawl manager for MyProject.'
if __name__ == '__main__':
crawlmanager = CrawlManager()
crawlmanager.run()
This crawl manager also schedules a spider periodically (in fact, it is a sub class of the PeriodicCrawlManager
), but instead of being controlled
by an infinite loop, it is controlled by a generator that provides the arguments for each spider job it will schedule. Once the generator stops
iterating and all scheduled jobs are completed, the crawl manager finishes itself.
The generator method is an abstract class method that need to be overridden. It must yield dictionaries with {argument name: argument value} pairs. Each new yielded dictionary will override the base spider arguments already defined by command line, if any.
On each loop, it will check whether the number of running spiders is below the max number of jobs allowed (controlled either by attribute default_max_jobs
or by command line). If so, it will take multiple dictionaries of arguments from the generator (as much as to fill the free slots), and schedule a new job for each one.
For other details see the code.
This is useful, for example, when each spider job need to process files from an s3 folder. A very simple exaple:
from shub_workflow.crawl import GeneratorCrawlManager
from shub_workflow.deliver.futils import list_folder
INPUT_FOLDER = "s3://mybucket/myinputfolder"
class CrawlManager(GeneratorCrawlManager):
loop_mode = 120
default_max_jobs = 4
spider = "myspider"
description = "My generator manager"
def set_parameters_gen(self):
for input_file in list_folder(INPUT_FOLDER):
yield {
"input_file": input_file,
}
Here, the attribute spider
(or the command line argument for the spider, in case the attribute is not provided) indicates which spider use by default when scheduling
a new job. In the above example, the spider myspider
with argument input_file=<...>
will be scheduled for each input file found at the listed folder.
However, the spider name itself can be included in the yielded parameters. Example:
from shub_workflow.crawl import GeneratorCrawlManager
from shub_workflow.deliver.futils import list_folder
INPUT_FOLDER = "s3://mybucket/myinputfolder"
class CrawlManager(GeneratorCrawlManager):
loop_mode = 120
default_max_jobs = 4
spider = "myspider"
description = "My generator manager"
def set_parameters_gen(self):
for input_file in list_folder(INPUT_FOLDER):
spider = input_file.split("_")[0]
yield {
"spider": spider,
"input_file": input_file,
}
In the specific example code above the spider class attribute may seem unnecessary. However, it allows to disable the command line argument that sets the spider.
Scrapy Cloud parameters like project_id
(for cross project scheduling), units
and tags
can be included as well on yielded parameters.
Next Chapter: Managing Hubstorage Crawl Frontiers