Skip to content

Use parslepy with scrapy

Paul Tremberth edited this page Aug 3, 2013 · 5 revisions

Parsing the New York Times Technology article summaries

In this example, we will consider this page http://www.nytimes.com/pages/technology/index.html.

We can write a Scrapy spider to fetch this page

from scrapy.spider import BaseSpider
class NYTimesSpider(BaseSpider):
    name = "NYTimes"
    allowed_domains = ["nytimes.com"]
    start_urls = ["http://www.nytimes.com/pages/technology/index.html"]

    def parse(self, response):
        ...

In each article we're interested in the title of the article, the author of the article, the URL of the full article, a summary, a thumbnail picture, and the publication date. Let's define a scrapy Item for this data

from scrapy.item import Item, Field
class NYTimesNewsItem(Item):
    title = Field(output_processor=TakeFirst())
    author = Field(output_processor=TakeFirst())
    summary = Field()
    image = Field()
    url = Field()
    timestamp = Field()

If you look into the HTML source code of the page, we're insterested in what's in the <div class="aColumn">. Article summaries are inside DIVs, either with class story or class ledeStory. In each article we can find:

  • the title of the article, in tags H1 or H3 depending on the containing DIV class
  • the author of the article, in an H6 tag with class byline
  • the URL of the full article, in a A element in the title
  • a summary, in a paragraph P with class summary
  • a thumbnail picture, optional, in an IMG in a DIV with class thumbnail

You can represent the extraction of these articles with a Parsley script like this (which we'll save in a file called nytimes__technology.let.json):

{
    "--(div.aColumn)": {
        "newsitems(div.story, div.ledeStory)": [{
            "--(h1, h3)": {"title": ".", "url": "a @href"},
            "author": ".byline",
            "timestamp": "span.timestamp @data-utc-timestamp",
            "summary": "parsley:strnl(.//p)",
            "image(img)": {"url": "@src", "alt": "@alt"}
        }]
    }
}

You can use parslepy directly to extract the data inside the HTML page (let's do that in the parse callback of the spider)

import cStringIO as StringIO
import pprint
class NYTimesSpider(BaseSpider):
    name = "NYTimes"
    allowed_domains = ["nytimes.com"]
    start_urls = ["http://www.nytimes.com/pages/technology/index.html"]

    def __init__(self, parseletfile=None):
        if parseletfile:
            with open(parseletfile) as jsonfp:
                self.parselet = parslepy.Parselet.from_jsonfile(jsonfp)

    def parse(self, response):
        extracted = self.parselet.parse(StringIO.StringIO(response.body))
        pprint.pprint(extracted)

Here, we pass the parselet script file as an option for the spider.

scrapy crawl NYTimes -a parseletfile=parselets/nytimes__technology.let.json

The extracted variable now looks like this:

{u'newsitems': [{u'author': u'By VINDU GOEL',
                 u'image': {u'alt': "A sign outside of Facebook's headquarters in Menlo Park, Calif. The company on Friday disclosed information about government requests for data, the vast majority of which did not pertain to national security matters.",
                            u'url': 'http://graphics8.nytimes.com/images/2013/06/16/business/15bits-facebook-data/15bits-facebook-data-sfSpan.jpg'},
                 u'summary': "A sign outside of Facebook's headquarters in Menlo Park, Calif. The company on Friday disclosed information about government requests for data, the vast majority of which did not pertain to national security matters.",
                 u'timestamp': None,
                 u'title': u'Facebook Discloses Basic Data on Law-Enforcement Requests',
                 u'url': 'http://bits.blogs.nytimes.com/2013/06/14/facebook-discloses-basic-data-on-law-enforcement-requests/?ref=technology'},
...
                {u'author': u'By E.C. GOGOLAK',
                 u'image': {u'alt': u"George Gasc\xf3n, San Francisco's district attorney, center, along with Attorney General Eric T. Schneiderman of New York, second from right, at a press conference Thursday to announce the formation of the Secure Our Smartphones initiative.",
                            u'url': 'http://graphics8.nytimes.com/images/2013/06/14/business/13bits-gascon-smartphone/13bits-gascon-smartphone-thumbStandard.jpg'},
                 u'summary': 'Prosecutors from New York State and San Francisco want phone makers to add features that would make stealing a smartphone pointless.',
                 u'timestamp': None,
                 u'title': u'Smartphone Makers Pressed to Address Growing Theft Problem',
                 u'url': 'http://bits.blogs.nytimes.com/2013/06/13/smartphone-makers-pressed-to-address-growing-theft-problem/?ref=technology'}]}

Scrapy has an ItemLoader class to help populate fields in items. Let's create a special loader to handle Parsley parselets:

import cStringIO as StringIO
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst

class NYTimesItemLoader(ItemLoader):
    default_output_processor = TakeFirst()

class ParsleyItemClassLoader(object):
    def __init__(self, item_class, item_loader_class, parselet, item_key, response, **context):
        self.item_class = item_class
        self.item_loader_class = item_loader_class
        self.parselet = parselet
        self.item_key = item_key
        self.response = response

    def iter_items(self):
        self.extracted = self.parselet.parse(StringIO.StringIO(self.response.body))
        for item_value in self.extracted.get(self.item_key):
            loader = self.item_loader_class(self.item_class())
            loader.add_value(None, item_value)
            yield loader.load_item()

The item_key parameter indicates where the items are to be fetched in the Parsley extracted data (in our case, the list of articles).

And the spider becomes:

class NYTimesSpider(BaseSpider):
    name = "NYTimes"
    allowed_domains = ["nytimes.com"]
    start_urls = ["http://www.nytimes.com/pages/technology/index.html"]

    def __init__(self, parseletfile=None):
        if parseletfile:
            with open(parseletfile) as jsonfp:
                self.parselet = parslepy.Parselet.from_jsonfile(jsonfp)

    def parse(self, response):

        loader = ParsleyItemClassLoader(
            NYTimesNewsItem,
            NYTimesItemLoader,
            self.parselet,
            item_key="newsitems",
            response=response)
        return loader.iter_items()