Crawl Managers

Previous Chapter: Table of Contents and Introduction

Simplest workflow with one spider task.

The simplest workflow can be defined with the CrawlManager class. This class schedules a single spider job. Not much useful by itself, but it helps to illustrate basic concepts. The first step is to create a crawl manager script in your project repository for deploying in ScrapyCloud. Save the following lines in a file called, for example, script/crawlmanager.py:

from shub_workflow.crawl import CrawlManager

if __name__ == '__main__':
    crawlmanager = CrawlManager()
    crawlmanager.run()

and add a proper scripts line on your project setup.py. For example:

import glob
from setuptools import setup, find_packages                                                                                                                                     

setup(
    name             = 'project',
    version          = '1.0',
    packages         = find_packages(),
    scripts          = glob.glob('scripts/*.py'),
    entry_points     = {'scrapy': ['settings = myproject.settings']}
)

Let's analyze the help printed when the script is called without parameters from command line:

> python crawlmanager.py -h
usage: You didn't set description for this script. Please set description property accordingly.
       [-h] [--project-id PROJECT_ID] [--name NAME] [--flow-id FLOW_ID] [--tag TAG]
       [--loop-mode SECONDS] [--max-running-jobs MAX_RUNNING_JOBS]
       [--spider-args SPIDER_ARGS] [--job-settings JOB_SETTINGS]
       [--units UNITS]
       spider

positional arguments:
  spider                Spider name

optional arguments:
  -h, --help            show this help message and exit
  --project-id PROJECT_ID
                        Overrides target project id.
  --name NAME           Script name.
  --flow-id FLOW_ID     If given, use the given flow id.
  --tag TAG             Add given tag to the scheduled jobs. Can be given
                        multiple times.
  --loop-mode SECONDS   If provided, manager will run in loop mode, with a
                        cycle each given number of seconds. Default: 0
  --max-running-jobs MAX_RUNNING_JOBS
                        If given, don't allow more than the given jobs running
                        at once. Default: inf
  --resume-workflow     Resume workflow. You must use it in combination with --flow-id in order to set the flow id of the worklow you want to resume.
  --spider-args SPIDER_ARGS
                        Spider arguments dict in json format
  --job-settings JOB_SETTINGS
                        Job settings dict in json format
  --units UNITS         Set number of ScrapyCloud units for each job

Some of the options are inherited from parent classes, other ones are added by CrawlManager class. A first message that may grab your attention, is the initial description message: You didn't set description for this script. Please set description property accordingly.. Every script subclassed from base script class will print this message if a description for it (or a parent class) was not created. For creating it you have to add the property description. In our example, it could be something like this:

from shub_workflow.crawl import CrawlManager as SHCrawlManager


class CrawlManager(SHCrawlManager):

    @property
    def description(self):
        return 'Crawl manager for MyProject.'


if __name__ == '__main__':
    crawlmanager = CrawlManager()
    crawlmanager.run()

Let's focus on the command line options and arguments. The first seven options (from --project-id to --resume-workflow) are inherited from the base script class.

--project-id

When a shub-workflow script runs in ScrapyCloud, the project id where it operates is autodetected: by default it is the id of the ScrapyCloud project where the script itself is running. In the context of a script that schedules other jobs (from now on, a manager script), like our crawl manager, this project id determines the target project where these children jobs must run. But for some applications you may want to run jobs in a different project than the one where the manager is running. So you can provide --project-id option for those cases. Also, it is possible to run the manager outside scrapy cloud. In this case, project id cannot be autodetected, so you must provide it either with the --project-id option, or the PROJECT_ID environment variable.

When a shub-workflow script is invoked on command line, it tries to guess the project id from default entry in project scrapinghub.yml. For overriding or providing when such entry is not available, use either the PROJECT_ID environment variable, or the --project-id command line option)

--name

The --name option allows to assign a workflow name to the script. The same script can run in the context of many different workflows (not only instances of the same workflow), and a name identification can be useful in some situations.

--flow-id

The flow id identifies a specific instance of a workflow. If this option is not provided, it is autogenerated and added to the job tags of the manager script itself, and propagated to all its scheduled children. In this way different jobs running in ScrapyCloud can be related to the same instance of a workflow, and allows consistency between different jobs running on it in ways that we will see later. You may want also to override the flow id via command line when resuming jobs, for example, or for manually scheduling jobs associated to a specific workflow instance.

--tag

The --tag command line option allows to add custom tags to the children jobs.

--loop-mode

By default, a workflow manager script performs a single loop and exits. The crawl manager for example, will schedule a spider job and finish. But if you set loop mode, it will continue alive, looping each every given seconds, and checking on each loop the status of the scheduled job. Once the job is finished, the crawl manager finishes too. Not much useful for this crawl manager. Most workflows however, need its manager to work in loop mode, for scheduling new jobs as previous ones finishes, monitoring the status of the workflow, etc. In order the crawl manager script to work in loop mode, you can either:

In your custom crawl manager class, set the class attribute loop_mode to an integer that determines the number of seconds that manager must sleep between each loop execution (except if you set loop_mode = 0, which is the default and disables looping).
You can override the default looping value in your class with the command line option --loop-mode.

--max-running-jobs

Another configuration inherited from the base workflow manager allows to set the maximal number of children jobs that can be running at a given moment. By default there is no maximal. You can put a limit to this number either by class attribute default_max_jobs, or by command line option --max-running-jobs.

--resume-workflow

A flag option. It must be used in combination with --flow-id. When you set a flow id in this way, and add --resume-workflow, the crawl manager will infer the status of a workflow by reading information from all the jobs with the same flow id, and resume from that. At moment it is only implemented for CrawlManager and its subclasses. In this case, the manager will check all the spider jobs with same flow id, and if some is running, it will acquire them as own. This is needed for example, in order to avoid to schedule more jobs than allowed by max running jobs.

For more complex workwflow classes (i.e. the Graph Manager introduced in the next chapter), this is still a TO DO feature.

The remaining set of options, and the main argument, are added by the CrawlManager class itself and they are self explicative, considering the purpose of the crawl manager script.

So, let's exemplify the usage of the crawl manager. Let's suppose you have a spider called amazon.com that accepts some parameters like department and search_string. From command line, assuming you have a fully installed development environment for your project, you may call your script in this way:

> python crawlmanager.py amazon.com --spider-args='{"department": "books", "search_string": "winnie the witch"}' --job-settings='{"CONCURRENT_REQUESTS": 2}'

Periodic Crawl Manager

The periodic crawl manager is very similar to the simplest one described in previous section. But instead of scheduling a simple spider job, on each loop it will check periodically for the job status. And when the job finishes, it schedules a new job. For activating this behaviour you need to set loop mode as explained above. Example:

from shub_workflow.crawl import PeriodicCrawlManager


class CrawlManager(PeriodicCrawlManager):

    # check every 180 seconds the status of the scheduled job
    loop_mode = 180

    @property
    def description(self):
        return 'Periodic Crawl manager for MyProject.'


if __name__ == '__main__':
    crawlmanager = CrawlManager()
    crawlmanager.run()

Generator Crawl Manager

This crawl manager also schedules a spider periodically (in fact, it is a sub class of the PeriodicCrawlManager), but instead of being controlled by an infinite loop, it is controlled by a generator that provides the arguments for each spider job it will schedule. Once the generator stops iterating and all scheduled jobs are completed, the crawl manager finishes itself.

The generator method is an abstract class method that need to be overridden. It must yield dictionaries with {argument name: argument value} pairs. Each new yielded dictionary will override the base spider arguments already defined by command line, if any.

On each loop, it will check whether the number of running spiders is below the max number of jobs allowed (controlled either by attribute default_max_jobs or by command line). If so, it will take a new dictionary of arguments from the generator, and schedule a new job with them. For other details see the code.

This is useful, for example, when each spider job need to process files from an s3 folder. A very simple exaple:

from shub_workflow.crawl import GeneratorCrawlManager
from shub_workflow.deliver.futils import list_folder


INPUT_FOLDER = "s3://mybucket/myinputfolder"


class CrawlManager(GeneratorCrawlManager):

    loop_mode = 120
    default_max_jobs = 4

    description = "My generator manager"

    def set_parameters_gen(self):
        for input_file in list_folder(INPUT_FOLDER):
            yield {
                "input_file": input_file,
            }

All crawl managers support implicit target spider via the class attribute spider. If provided, the spider command line argument is unavailable:

class MyCrawlManager(...):

    loop_mode = 120
    spider = "amazon.com"

So the command line call will be the same as before, but without the spider argument:

> python crawlmanager.py [--spider-args=... ...]

Next Chapter: Managing Hubstorage Crawl Frontiers

Tutorial TOC

Appendices

Appendix A: Classes Diagram

Provide feedback

Saved searches