-
Notifications
You must be signed in to change notification settings - Fork 15
Use parslepy with scrapy
In this example, we will consider this page http://www.nytimes.com/pages/technology/index.html.
We can write a Scrapy spider to fetch this page
from scrapy.spider import BaseSpider
class NYTimesSpider(BaseSpider):
name = "NYTimes"
allowed_domains = ["nytimes.com"]
start_urls = ["http://www.nytimes.com/pages/technology/index.html"]
def parse(self, response):
...
In each article we're interested in the title of the article, the author of the article, the URL of the full article, a summary, a thumbnail picture, and the publication date. Let's define a scrapy Item
for this data
from scrapy.item import Item, Field
class NYTimesNewsItem(Item):
title = Field(output_processor=TakeFirst())
author = Field(output_processor=TakeFirst())
summary = Field()
image = Field()
url = Field()
timestamp = Field()
If you look into the HTML source code of the page, we're insterested in what's in the <div class="aColumn">
. Article summaries are inside DIVs, either with class story
or class ledeStory
. In each article we can find:
- the title of the article, in tags
H1
orH3
depending on the containingDIV
class - the author of the article, in an
H6
tag with classbyline
- the URL of the full article, in a
A
element in the title - a summary, in a paragraph
P
with classsummary
- a thumbnail picture, optional, in an
IMG
in aDIV
with classthumbnail
You can represent the extraction of these articles with a Parsley script like this (which we'll save in a file called nytimes__technology.let.json
):
{
"--(div.aColumn)": {
"newsitems(div.story, div.ledeStory)": [{
"--(h1, h3)": {"title": ".", "url": "a @href"},
"author": ".byline",
"timestamp": "span.timestamp @data-utc-timestamp",
"summary": "parsley:strnl(.//p)",
"image(img)": {"url": "@src", "alt": "@alt"}
}]
}
}
You can use parslepy
directly to extract the data inside the HTML page (let's do that in the parse
callback of the spider)
import cStringIO as StringIO
import pprint
class NYTimesSpider(BaseSpider):
name = "NYTimes"
allowed_domains = ["nytimes.com"]
start_urls = ["http://www.nytimes.com/pages/technology/index.html"]
def __init__(self, parseletfile=None):
if parseletfile:
with open(parseletfile) as jsonfp:
self.parselet = parslepy.Parselet.from_jsonfile(jsonfp)
def parse(self, response):
extracted = self.parselet.parse(StringIO.StringIO(response.body))
pprint.pprint(extracted)
Here, we pass the parselet script file as an option for the spider.
scrapy crawl NYTimes -a parseletfile=parselets/nytimes__technology.let.json
The extracted
variable now looks like this:
{u'newsitems': [{u'author': u'By VINDU GOEL',
u'image': {u'alt': "A sign outside of Facebook's headquarters in Menlo Park, Calif. The company on Friday disclosed information about government requests for data, the vast majority of which did not pertain to national security matters.",
u'url': 'http://graphics8.nytimes.com/images/2013/06/16/business/15bits-facebook-data/15bits-facebook-data-sfSpan.jpg'},
u'summary': "A sign outside of Facebook's headquarters in Menlo Park, Calif. The company on Friday disclosed information about government requests for data, the vast majority of which did not pertain to national security matters.",
u'timestamp': None,
u'title': u'Facebook Discloses Basic Data on Law-Enforcement Requests',
u'url': 'http://bits.blogs.nytimes.com/2013/06/14/facebook-discloses-basic-data-on-law-enforcement-requests/?ref=technology'},
...
{u'author': u'By E.C. GOGOLAK',
u'image': {u'alt': u"George Gasc\xf3n, San Francisco's district attorney, center, along with Attorney General Eric T. Schneiderman of New York, second from right, at a press conference Thursday to announce the formation of the Secure Our Smartphones initiative.",
u'url': 'http://graphics8.nytimes.com/images/2013/06/14/business/13bits-gascon-smartphone/13bits-gascon-smartphone-thumbStandard.jpg'},
u'summary': 'Prosecutors from New York State and San Francisco want phone makers to add features that would make stealing a smartphone pointless.',
u'timestamp': None,
u'title': u'Smartphone Makers Pressed to Address Growing Theft Problem',
u'url': 'http://bits.blogs.nytimes.com/2013/06/13/smartphone-makers-pressed-to-address-growing-theft-problem/?ref=technology'}]}
Scrapy has an ItemLoader
class to help populate fields in items. Let's create a special loader to handle Parsley parselets:
import cStringIO as StringIO
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst
class NYTimesItemLoader(ItemLoader):
default_output_processor = TakeFirst()
class ParsleyItemClassLoader(object):
def __init__(self, item_class, item_loader_class, parselet, item_key, response, **context):
self.item_class = item_class
self.item_loader_class = item_loader_class
self.parselet = parselet
self.item_key = item_key
self.response = response
def iter_items(self):
self.extracted = self.parselet.parse(StringIO.StringIO(self.response.body))
for item_value in self.extracted.get(self.item_key):
loader = self.item_loader_class(self.item_class())
loader.add_value(None, item_value)
yield loader.load_item()
The item_key
parameter indicates where the items are to be fetched in the Parsley extracted data (in our case, the list of articles).
And the spider becomes:
class NYTimesSpider(BaseSpider):
name = "NYTimes"
allowed_domains = ["nytimes.com"]
start_urls = ["http://www.nytimes.com/pages/technology/index.html"]
def __init__(self, parseletfile=None):
if parseletfile:
with open(parseletfile) as jsonfp:
self.parselet = parslepy.Parselet.from_jsonfile(jsonfp)
def parse(self, response):
loader = ParsleyItemClassLoader(
NYTimesNewsItem,
NYTimesItemLoader,
self.parselet,
item_key="newsitems",
response=response)
return loader.iter_items()