Skip to content

Ehsan-U/scrapy-nodriver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scrapy-nodriver: Nodriver integration for Scrapy

version pyversions

A Scrapy Download Handler which performs requests using Nodriver. It can be used to handle pages that require JavaScript (among other things), while adhering to the regular Scrapy workflow (i.e. without interfering with request scheduling, item processing, etc).

What makes this package different from package like Scrapy-Playwright, is the optimization to stay undetected for most anti-bot solutions. CDP communication provides even better resistance against web applicatinon firewalls (WAF’s), while performance gets a massive boost.

Requirements

After the release of version 2.0, which includes coroutine syntax support and asyncio support, Scrapy allows to integrate asyncio-based projects such as Nodriver.
Note: Chrome must be installed on the system.

Minimum required versions

  • Python >= 3.8
  • Scrapy >= 2.0 (!= 2.4.0)

Installation

scrapy-nodriver is available on PyPI and can be installed with pip:

pip install scrapy-nodriver

nodriver is defined as a dependency so it gets installed automatically,

Activation

Download handler

Replace the default http and/or https Download Handlers through DOWNLOAD_HANDLERS:

# settings.py
DOWNLOAD_HANDLERS = {
    "http": "scrapy_nodriver.handler.ScrapyNodriverDownloadHandler",
    "https": "scrapy_nodriver.handler.ScrapyNodriverDownloadHandler",
}

Note that the ScrapyNodriverDownloadHandler class inherits from the default http/https handler. Unless explicitly marked (see Basic usage), requests will be processed by the regular Scrapy download handler.

Twisted reactor

Install the asyncio-based Twisted reactor:

# settings.py
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

This is the default in new projects since Scrapy 2.7.

Basic usage

Set the nodriver Request.meta key to download a request using Nodriver:

import scrapy

class AwesomeSpider(scrapy.Spider):
    name = "awesome"

    def start_requests(self):
        # GET request
        yield scrapy.Request("https://httpbin.org/get", meta={"nodriver": True})

    def parse(self, response, **kwargs):
        # 'response' contains the page as seen by the browser
        return {"url": response.url}

NODRIVER_MAX_CONCURRENT_PAGES

Type Optional[int], defaults to the value of Scrapy's CONCURRENT_REQUESTS setting

Maximum amount of allowed concurrent Nodriver pages.

NODRIVER_MAX_CONCURRENT_PAGES = 8

NODRIVER_BLOCKED_URLS

Type Optional[List], default None

Block resources on the page.

NODRIVER_BLOCKED_URLS = [
    "*/*.jpg",
    "*/*.png",
    "*/*.gif",
    "*/*.webp",
    "*/*.svg",
    "*/*.ico"
]

NODRIVER_HEADLESS

Type Optional[bool], default True

NODRIVER_HEADLESS = True

Supported Request.meta keys

nodriver

Type bool, default False

If set to a value that evaluates to True the request will be processed by Nodriver.

return scrapy.Request("https://example.org", meta={"nodriver": True})

nodriver_include_page

Type bool, default False

If True, the [Nodriver page] that was used to download the request will be available in the callback at response.meta['nodriver_page']. If False (or unset) the page will be closed immediately after processing the request.

Important!

This meta key is entirely optional, it's NOT necessary for the page to load or for any asynchronous operation to be performed (specifically, it's NOT necessary for PageMethod objects to be applied). Use it only if you need access to the Page object in the callback that handles the response.

For more information and important notes see Receiving Page objects in callbacks.

return scrapy.Request(
    url="https://example.org",
    meta={"nodriver": True, "nodriver_include_page": True},
)

nodriver_page_methods

Type Iterable[PageMethod], default ()

An iterable of scrapy_nodriver.page.PageMethod objects to indicate actions to be performed on the page before returning the final response. See Executing actions on pages.

nodriver_page

Type Optional[nodriver.Tab], default None

A Nodriver page to be used to download the request. If unspecified, a new page is created for each request. This key could be used in conjunction with nodriver_include_page to make a chain of requests using the same page. For instance:

from nodriver import Tab

def start_requests(self):
    yield scrapy.Request(
        url="https://httpbin.org/get",
        meta={"nodriver": True, "nodriver_include_page": True},
    )

def parse(self, response, **kwargs):
    page: Tab = response.meta["nodriver_page"]
    yield scrapy.Request(
        url="https://httpbin.org/headers",
        callback=self.parse_headers,
        meta={"nodriver": True, "nodriver_page": page},
    )
from nodriver import Tab
import scrapy

class AwesomeSpiderWithPage(scrapy.Spider):
    name = "page_spider"

    def start_requests(self):
        yield scrapy.Request(
            url="https://example.org",
            callback=self.parse_first,
            meta={"nodriver": True, "nodriver_include_page": True},
            errback=self.errback_close_page,
        )

    def parse_first(self, response):
        page: Tab = response.meta["nodriver_page"]
        return scrapy.Request(
            url="https://example.com",
            callback=self.parse_second,
            meta={"nodriver": True, "nodriver_include_page": True, "nodriver_page": page},
            errback=self.errback_close_page,
        )

    async def parse_second(self, response):
        page: Tab = response.meta["nodriver_page"]
        title = await page.title()  # "Example Domain"
        await page.close()
        return {"title": title}

    async def errback_close_page(self, failure):
        page: Tab = failure.request.meta["nodriver_page"]
        await page.close()

Notes:

  • When passing nodriver_include_page=True, make sure pages are always closed when they are no longer used. It's recommended to set a Request errback to make sure pages are closed even if a request fails (if nodriver_include_page=False pages are automatically closed upon encountering an exception). This is important, as open pages count towards the limit set by NODRIVER_MAX_CONCURRENT_PAGES and crawls could freeze if the limit is reached and pages remain open indefinitely.
  • Defining callbacks as async def is only necessary if you need to await things, it's NOT necessary if you just need to pass over the Page object from one callback to another (see the example above).
  • Any network operations resulting from awaiting a coroutine on a Page object (get, etc) will be executed directly by Nodriver, bypassing the Scrapy request workflow (Scheduler, Middlewares, etc).

Executing actions on pages

A sorted iterable (e.g. list, tuple) of PageMethod objects could be passed in the nodriver_page_methods Request.meta key to request methods to be invoked on the Page object before returning the final Response to the callback.

This is useful when you need to perform certain actions on a page (like scrolling down or clicking links) and you want to handle only the final result in your callback.

PageMethod class

scrapy_nodriver.page.PageMethod(method: str, *args, **kwargs):

Represents a method to be called (and awaited if necessary) on a nodriver.Tab object (e.g. "select", "save_screenshot", "evaluate", etc). method is the name of the method, *args and **kwargs are passed when calling such method. The return value will be stored in the PageMethod.result attribute.

For instance:

def start_requests(self):
    yield Request(
        url="https://example.org",
        meta={
            "nodriver": True,
            "nodriver_page_methods": [
                PageMethod("save_screenshot", filename="example.jpeg", full_page=True),
            ],
        },
    )

def parse(self, response, **kwargs):
    screenshot = response.meta["nodriver_page_methods"][0]
    # screenshot.result contains the image file path

produces the same effect as:

def start_requests(self):
    yield Request(
        url="https://example.org",
        meta={"nodriver": True, "nodriver_include_page": True},
    )

async def parse(self, response, **kwargs):
    page = response.meta["nodriver_page"]
    filepath = await page.save_screenshot(filename="example.jpeg", full_page=True)
    await page.close()

Supported methods

Refer to the upstream docs for the Tab class to see available methods.

Scroll down on an infinite scroll page, take a screenshot of the full page

class ScrollSpider(scrapy.Spider):
    name = "scroll"

    def start_requests(self):
        yield scrapy.Request(
            url="http://quotes.toscrape.com/scroll",
            meta=dict(
                nodriver=True,
                nodriver_include_page=True,
                nodriver_page_methods=[
                    PageMethod("wait_for", "div.quote"),
                    PageMethod("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
                    PageMethod("wait_for", "div.quote:nth-child(11)"),  # 10 per page
                ],
            ),
        )

    async def parse(self, response, **kwargs):
        page = response.meta["nodriver_page"]
        await page.save_screenshot(filename="quotes.jpeg", full_page=True)
        await page.close()
        return {"quote_count": len(response.css("div.quote"))}  # quotes from several pages

Known issues

No proxy support

Specifying a proxy via the proxy Request meta key is not supported.

Reporting issues

Before opening an issue please make sure the unexpected behavior can only be observed by using this package and not with standalone Nodriver. To do this, translate your spider code to a reasonably close Nodriver script: if the issue also occurs this way, you should instead report it upstream. For instance:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"

    def start_requests(self):
        yield scrapy.Request(
            url="https://example.org",
            meta=dict(
                nodriver=True,
                nodriver_page_methods=[
                    PageMethod("save_screenshot", filename="example.jpeg", full_page=True),
                ],
            ),
        )

translates roughly to:

import asyncio
import nodriver as uc

async def main():
    browser = await uc.start()
    page = await browser.get("https://example.org")
    await page.save_screenshot(filename="example.jpeg", full_page=True)
    await page.close()

if __name__ == '__main__':
    uc.loop().run_until_complete(main())