Overview

modified from https://github.com/jschnurr/scrapyscript Scrapyscript provides a minimalist interface for invoking Scrapy directly from your code. Define Jobs that include your spider and any object you would like to pass to the running spider, and then pass them to an instance of Processor which will block, run the spiders, and return a list of consolidated results.

Useful for leveraging the vast power of Scrapy from existing code, or to run Scrapy from a Celery job.

Requirements

Python 2.7 or 3.5
Tested on Linux only (other platforms may work as well)

Install

pip install scrapy-script

Example

Let's create a spider that retrieves the title attribute from two popular websites.

from scrapy-script import Job, Processor
from scrapy.spiders import Spider
from scrapy import Request
import json

# Define a Scrapy Spider, which can accept *args or **kwargs
# https://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments
class PythonSpider(Spider):
    name = 'myspider'

    def start_requests(self):
        yield Request(self.url)

    def parse(self, response):
        title = response.xpath('//title/text()').extract()
        return {'url': response.request.url, 'title': title}

# Create jobs for each instance. *args and **kwargs supplied here will
# be passed to the spider constructor at runtime
githubJob = Job(PythonSpider, url='http://www.github.com')
pythonJob = Job(PythonSpider, url='http://www.python.org')

# Create a Processor, optionally passing in a Scrapy Settings object.
processor = Processor(settings=None, item_scraped=True)

# Start the reactor, and block until all spiders complete.
processor.run([githubJob, pythonJob])
data = processor.data()
# get crawl count
count = processor.count()
# Print the consolidated results
print(json.dumps(data, indent=4))

{
    "ccidcom": [
        {
            "title": "517寄语 | 工信部副部长陈肇雄：ICT促进持续发展 造福人类社会 "
        }
    ]
}

Spider Output Types

As per the scrapy docs, a Spider must return an iterable of Request and/or dicts or Item objects.

Requests will be consumed by Scrapy inside the Job. Dicts or Item objects will be queued and output together when all spiders are finished.

Due to the way billiard handles communication between processes, each dict or item must be pickle-able using pickle protocol 0.

Jobs

A job is a single request to call a specific spider, optionally passing in *args or **kwargs, which will be passed through to the spider constructor at runtime.

def __init__(self, spider, *args, **kwargs):
    '''Parameters:
        spider (spidercls): the spider to be run for this job.
    '''

Processor

A Twisted reactor for running spiders. Blocks until all have finished.

Constructor

class Processor(Process):
    def __init__(self, settings=None):
        '''
        Parameters:
          settings (scrapy.settings.Settings) - settings to apply. Defaults to Scrapy defaults.
        '''

Run

Starts the Scrapy engine, and executes all jobs. Returns consolidated results in a single list.

    def run(self, jobs):
        '''
        Parameters:
            jobs ([Job]) - one or more Job objects to be processed.

        Returns:
            List of objects yielded by the spiders after all jobs have run.
        '''

Notes

Multiprocessing vs Billiard

Scrapyscript spawns a subprocess to support the Twisted reactor. Billiard provides a fork of the multiprocessing library that supports Celery. This allows you to schedule scrapy spiders to run as Celery tasks.

Contributing

Updates, additional features or bug fixes are always welcome.

Version History

1.0.0 - 10-Dec-2017 - API changes to pass *args and **kwargs to running spider
0.1.0 - 28-May-2017 - patches to support Celery 4+ and Billiard 3.5.+.
Thanks to @mrge and @bmartel.

License

The MIT License (MIT). See LICENCE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
build/lib		build/lib
crawler		crawler
scrapy_script.egg-info		scrapy_script.egg-info
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENCE		LICENCE
Makefile		Makefile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
scrapy_script.py		scrapy_script.py
setup.cfg		setup.cfg
setup.py		setup.py
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Requirements

Install

Example

Spider Output Types

Jobs

Processor

Constructor

Run

Notes

Multiprocessing vs Billiard

Contributing

Version History

License

About

Releases

Packages

Languages

License

aox-lei/scrapyscript

Folders and files

Latest commit

History

Repository files navigation

Overview

Requirements

Install

Example

Spider Output Types

Jobs

Processor

Constructor

Run

Notes

Multiprocessing vs Billiard

Contributing

Version History

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages