Skip to content

Crawl Managers

Martin Olveyra edited this page Jul 6, 2022 · 28 revisions

Previous Chapter: General Description


Simplest workflow with one spider task.

The simplest workflow can be defined with the CrawlManager class. This class schedules a single spider job. Not much useful by itself, but it helps to illustrate basic concepts. The first step is to create a crawl manager script in your project repository for deploying in ScrapyCloud. Save the following lines in a file called, for example, script/crawlmanager.py:

from shub_workflow.crawl import CrawlManager

if __name__ == '__main__':
    crawlmanager = CrawlManager()
    crawlmanager.run()

and add a proper scripts line on your project setup.py. For example:

import glob
from setuptools import setup, find_packages                                                                                                                                     

setup(
    name             = 'project',
    version          = '1.0',
    packages         = find_packages(),
    scripts          = glob.glob('scripts/*.py'),
    entry_points     = {'scrapy': ['settings = myproject.settings']}
)

Let's analyze the help printed when the script is called without parameters from command line:

> python crawlmanager.py -h
usage: You didn't set description for this script. Please set description property accordingly.
       [-h] [--project-id PROJECT_ID] [--name NAME] [--flow-id FLOW_ID] [--tag TAG]
       [--loop-mode SECONDS] [--max-running-jobs MAX_RUNNING_JOBS]
       [--spider-args SPIDER_ARGS] [--job-settings JOB_SETTINGS]
       [--units UNITS]
       spider

positional arguments:
  spider                Spider name

optional arguments:
  -h, --help            show this help message and exit
  --project-id PROJECT_ID
                        Overrides target project id.
  --name NAME           Script name.
  --flow-id FLOW_ID     If given, use the given flow id.
  --tag TAG             Add given tag to the scheduled jobs. Can be given
                        multiple times.
  --loop-mode SECONDS   If provided, manager will run in loop mode, with a
                        cycle each given number of seconds. Default: 0
  --max-running-jobs MAX_RUNNING_JOBS
                        If given, don't allow more than the given jobs running
                        at once. Default: inf
  --resume-workflow     Resume workflow. You must use it in combination with --flow-id in order to set the flow id of the worklow you want to resume.
  --spider-args SPIDER_ARGS
                        Spider arguments dict in json format
  --job-settings JOB_SETTINGS
                        Job settings dict in json format
  --units UNITS         Set number of ScrapyCloud units for each job

Some of the options are inherited from parent classes, other ones are added by CrawlManager class. A first message that may grab your attention, is the initial description message: You didn't set description for this script. Please set description property accordingly.. Every script subclassed from base script class will print this message if a description for it (or a parent class) was not created. For creating it you have to add the property description. In our example, it could be something like this:

from shub_workflow.crawl import CrawlManager as SHCrawlManager


class CrawlManager(SHCrawlManager):

    @property
    def description(self):
        return 'Crawl manager for MyProject.'


if __name__ == '__main__':
    crawlmanager = CrawlManager()
    crawlmanager.run()

Let's focus on the command line options and arguments. The first seven options (from --project-id to --resume-workflow) are inherited from the base script class.

  • --project-id

When a shub-workflow script runs in ScrapyCloud, the project id where it operates is autodetected: by default it is the id of the ScrapyCloud project where the script itself is running. In the context of a script that schedules other jobs (from now on, a manager script), like our crawl manager, this project id determines the target project where these children jobs must run. But for some applications you may want to run jobs in a different project than the one where the manager is running. So you can provide --project-id option for those cases. Also, it is possible to run the manager outside scrapy cloud. In this case, project id cannot be autodetected, so you must provide it either with the --project-id option, or the PROJECT_ID environment variable.

When a shub-workflow script is invoked on command line, it tries to guess the project id from default entry in project scrapinghub.yml. For overriding or providing when such entry is not available, use either the PROJECT_ID environment variable, or the --project-id command line option)

  • --name

The --name option overrides the manager attribute name. This attribute allows to assign a workflow name to the script. The same script can run in the context of many different workflows (not only instances of the same workflow), and a name identification is useful in many situations, like recognizing owned jobs on workflow resuming. In particular, any object derived from WorkFlowManager requires a name. Either as class attribute, or passed via command line. In addition, different scripts that may run on the same workflow, must have different names.

  • --flow-id

The flow id identifies a specific instance of a workflow. If this option is not provided, it is autogenerated and added to the job tags of the manager script itself, and propagated to all its scheduled children. In this way different jobs running in ScrapyCloud can be related to the same instance of a workflow, and allows consistency between different jobs running on it in ways that we will see later. You may want also to override the flow id via command line when resuming jobs, for example, or for manually scheduling jobs associated to a specific workflow instance.

  • --tag

The --tag command line option allows to add custom tags to the children jobs.

  • --loop-mode

By default, a workflow manager script performs a single loop and exits. The crawl manager for example, will schedule a spider job and finish. But if you set loop mode, it will continue alive, looping each every given seconds, and checking on each loop the status of the scheduled job. Once the job is finished, the crawl manager finishes too. Not much useful for this crawl manager. Most workflows however, need its manager to work in loop mode, for scheduling new jobs as previous ones finishes, monitoring the status of the workflow, etc. In order the crawl manager script to work in loop mode, you can either:

  • In your custom crawl manager class, set the class attribute loop_mode to an integer that determines the number of seconds that manager must sleep between each loop execution (except if you set loop_mode = 0, which is the default and disables looping).

  • You can override the default looping value in your class with the command line option --loop-mode.

  • --max-running-jobs

Another configuration inherited from the base workflow manager allows to set the maximal number of children jobs that can be running at a given moment. By default there is no maximal. You can put a limit to this number either by class attribute default_max_jobs, or by command line option --max-running-jobs.

  • --resume-workflow

A flag option. It must be used in combination with --flow-id. When you set a flow id in this way, and add --resume-workflow, the crawl manager will infer the status of a workflow by reading information from all the jobs with the same flow id, and resume from that. At moment it is only implemented for CrawlManager and its subclasses. In this case, the manager will check all the spider jobs with same flow id, and if some is running, it will acquire them as own. This is needed for example, in order to avoid to schedule more jobs than allowed by max running jobs.

For more complex workwflow classes (i.e. the Graph Manager introduced in the next chapter), this is still a TO DO feature.

The remaining set of options, and the main argument, are added by the CrawlManager class itself and they are self explicative, considering the purpose of the crawl manager script.

So, let's exemplify the usage of the crawl manager. Let's suppose you have a spider called amazon.com that accepts some parameters like department and search_string. From command line, assuming you have a fully installed development environment for your project, you may call your script in this way:

> python crawlmanager.py amazon.com --spider-args='{"department": "books", "search_string": "winnie the witch"}' --job-settings='{"CONCURRENT_REQUESTS": 2}'

All crawl managers support implicit target spider via the class attribute spider. If provided, the spider command line argument is unavailable:

class MyCrawlManager(...):

    name = ´crawl'
    loop_mode = 120
    spider = "amazon.com"

So the command line call will be the same as before, but without the spider argument:

> python crawlmanager.py [--spider-args=... ...]

Periodic Crawl Manager

The periodic crawl manager is very similar to the simplest one described in previous section. But instead of scheduling a simple spider job, on each loop it will check periodically for the job status. And when the job finishes, it schedules a new job. For activating this behaviour you need to set loop mode as explained above. Example:

from shub_workflow.crawl import PeriodicCrawlManager


class CrawlManager(PeriodicCrawlManager):

    name = ´crawl'
    # check every 180 seconds the status of the scheduled job
    loop_mode = 180

    @property
    def description(self):
        return 'Periodic Crawl manager for MyProject.'


if __name__ == '__main__':
    crawlmanager = CrawlManager()
    crawlmanager.run()

Generator Crawl Manager

This crawl manager also schedules a spider periodically (in fact, it is a sub class of the PeriodicCrawlManager), but instead of being controlled by an infinite loop, it is controlled by a generator that provides the arguments for each spider job it will schedule. Once the generator stops iterating and all scheduled jobs are completed, the crawl manager finishes itself.

The generator method is an abstract class method that need to be overridden. It must yield dictionaries with {argument name: argument value} pairs. Each new yielded dictionary will override the base spider arguments already defined by command line, if any.

On each loop, it will check whether the number of running spiders is below the max number of jobs allowed (controlled either by attribute default_max_jobs or by command line). If so, it will take multiple dictionaries of arguments from the generator (as much as to fill the free slots), and schedule a new job for each one. For other details see the code.

This is useful, for example, when each spider job need to process files from an s3 folder. A very simple exaple:

from shub_workflow.crawl import GeneratorCrawlManager
from shub_workflow.deliver.futils import list_folder


INPUT_FOLDER = "s3://mybucket/myinputfolder"


class CrawlManager(GeneratorCrawlManager):

    name = ´crawl'
    loop_mode = 120
    default_max_jobs = 4
    spider = "myspider"

    description = "My generator manager"

    def set_parameters_gen(self):
        for input_file in list_folder(INPUT_FOLDER):
            yield {
                "input_file": input_file,
            }

Here, the attribute spider (or the command line argument for the spider, in case the attribute is not provided) indicates which spider use by default when scheduling a new job. In the above example, the spider myspider with argument input_file=<...> will be scheduled for each input file found at the listed folder.

However, the spider name itself can be included in the yielded parameters. Example:

from shub_workflow.crawl import GeneratorCrawlManager
from shub_workflow.deliver.futils import list_folder


INPUT_FOLDER = "s3://mybucket/myinputfolder"


class CrawlManager(GeneratorCrawlManager):

    name = 'crawl'
    loop_mode = 120
    default_max_jobs = 4
    spider = "myspider"

    description = "My generator manager"

    def set_parameters_gen(self):
        for input_file in list_folder(INPUT_FOLDER):
            spider = input_file.split("_")[0]
            yield {
                "spider": spider,
                "input_file": input_file,
            }

In the specific example code above the spider class attribute may seem unnecessary. However, it allows to disable the command line argument that sets the spider.

Scrapy Cloud parameters like project_id (for cross project scheduling), units, tags and job_settings can be included on yielded parameters as well.

Handling bad outcomes

When a spider job finishes with an abnormal finish status (outcome), we typically want to do something. For example, raising an alert somewhere, or retrying the spider with modified spider arguments. For handling bad outcome jobs, you must override the method bad_outcome_hook(), available in all crawl manager classes. This method will be called when a job finishes with any of the outcomes defined in the list attribute self.failed_outcomes which, by default are the following ones:

    base_failed_outcomes = (
        "failed",
        "killed by oom",
        "cancelled",
        "cancel_timeout",
        "memusage_exceeded",
        "cancelled (stalled)",
    )

defined in WorkFlowManager class. You can append any other custom failed outcome to self.failed_outcomes.

Lets suppose some spiders may finish with memory problems, and in that case you want to retry it with a bigger number of units. In that case, you can add the following method to your generator crawl manager:

class CrawlManager(GeneratorCrawlManager):

    (...)

    def bad_outcome_hook(self, spider, outcome, spider_args_override, jobkey):
        if outcome == "memusage_exceeded" and spider_args_override.get("units") == 1:
            spider_args_override["units"] = 6
            self.add_job(spider, spider_args_override)

The code above instructs to add a new job with increased number of units, and all other parameters equal, when a spider finishes with outcome memusage_exceeded. The jobs added with this method will run first, before continue processing the set_parameters_gen() generator.

Note: the method add_job() is only available on GeneratorCrawlManager. Its purpose is not compatible with the use cases of CrawlManager and PeriodicCrawlManager.


Next Chapter: Managing Hubstorage Crawl Frontiers