Item definition, various utils and helpers for working with CDR format for scrapy.
Main supported format is CDR v3.1, but there is CDR v2 support for uploading to ES
and a v2 -> v3 converter. If you need CDR v2 support, use scrapy-cdr==0.1.2
.
Contents
pip install scrapy-cdr
If you need ElasticSearch support, run this instead:
pip install scrapy-cdr[es]
scrapy-cdr requires setuptools 18.0+.
from scrapy_cdr import text_cdr_item def parse(response): yield text_cdr_item( response, crawler_name='my scrapy crawler', team_name='my team', item_cls=MyCDRItem, # optional )
There is also scrapy_cdr.cdr_item
for non-text items,
and an item definition in scrapy_cdr.CDRItem
.
scrapy_cdr.media_pipeline.CDRMediaPipeline
helps to download items
and puts them into "objects" field of the CDR item according to CDR v3 schema.
Add the pipeline to
ITEM_PIPELINES
insettings.py
:ITEM_PIPELINES = { 'scrapy_cdr.media_pipeline.CDRMediaPipeline': 1, }
Also you probably want to allow redirects in the media pipeline:
MEDIA_ALLOW_REDIRECTS = True
Set
FILES_STORE
as you would do for scrapy FilesPipeline.Put urls to download into "objects" field of the cdr item in the crawler, for example:
yield scrapy_cdr.utils.text_cdr_item( response, crawler_name='name', team_name='team', objects=['http://example.com/1.png', 'http://example.com/1.png'], )
Optionally, customize
CDRMediaPipeline
:FILES_MAX_CACHE
set maximum size of the downloader cache, and is 10000 by default (unlike unbounded cache used in scrapy).- Set
CDR_S3_RELATIVE_URLS = False
option to use absolute URLs inobjects
array (obj_stored_url
) when data is stored to S3. By default, S3 URLs are relative, i.e. they don't contain path and bucket. Local paths are always relative, regardless of this option.
Optionally, subclass the
CDRMediaPipeline
and re-define some methods:media_request
method if you want to customize how media items are downloaded.s3_path
method if you are storing media items in S3 (FILES_STORE
is "s3://...") and want to customize the S3 URL of stored items. When URLs are absolute (CDR_S3_RELATIVE_URLS = False
), by default it is "https://" urls for public items (ifFILES_STORE_S3_ACL
ispublic-read
orpublic-read-write
), and "s3://" for private items (default in scrapy).
cdr-es-upload
script takes care of generating
timestamp_index
field and can be used for uploading or deletion of
CDR items. Please see cdr-es-upload --help
for help on command line options.
Use cdr-v2-to-v3
script:
cdr-v2-to-v3 items.v2.jl.gz items.v3.jl.gz --broken
Note that this script does not support media items.
License is MIT.