Scrapy CDR

Item definition, various utils and helpers for working with CDR format for scrapy. Main supported format is CDR v3.1, but there is CDR v2 support for uploading to ES and a v2 -> v3 converter. If you need CDR v2 support, use scrapy-cdr==0.1.2.

Contents

Install
Usage
License

Install

pip install scrapy-cdr

If you need ElasticSearch support, run this instead:

pip install scrapy-cdr[es]

scrapy-cdr requires setuptools 18.0+.

Usage

Items

from scrapy_cdr import text_cdr_item

def parse(response):
    yield text_cdr_item(
        response,
        crawler_name='my scrapy crawler',
        team_name='my team',
        item_cls=MyCDRItem,  # optional
        )

There is also scrapy_cdr.cdr_item for non-text items, and an item definition in scrapy_cdr.CDRItem.

Media items

scrapy_cdr.media_pipeline.CDRMediaPipeline helps to download items and puts them into "objects" field of the CDR item according to CDR v3 schema.

Add the pipeline to ITEM_PIPELINES in settings.py:
```
ITEM_PIPELINES = {
    'scrapy_cdr.media_pipeline.CDRMediaPipeline': 1,
}
```
Also you probably want to allow redirects in the media pipeline:
```
MEDIA_ALLOW_REDIRECTS = True
```
Set FILES_STORE as you would do for scrapy FilesPipeline.

Put urls to download into "objects" field of the cdr item in the crawler, for example:

yield scrapy_cdr.utils.text_cdr_item(
    response,
    crawler_name='name',
    team_name='team',
    objects=['http://example.com/1.png', 'http://example.com/1.png'],
)

Optionally, customize CDRMediaPipeline:
- FILES_MAX_CACHE set maximum size of the downloader cache, and is 10000 by default (unlike unbounded cache used in scrapy).
- Set CDR_S3_RELATIVE_URLS = False option to use absolute URLs in objects array (obj_stored_url) when data is stored to S3. By default, S3 URLs are relative, i.e. they don't contain path and bucket. Local paths are always relative, regardless of this option.
Optionally, subclass the CDRMediaPipeline and re-define some methods:
- media_request method if you want to customize how media items are downloaded.
- s3_path method if you are storing media items in S3 (FILES_STORE is "s3://...") and want to customize the S3 URL of stored items. When URLs are absolute (CDR_S3_RELATIVE_URLS = False), by default it is "https://" urls for public items (if FILES_STORE_S3_ACL is public-read or public-read-write), and "s3://" for private items (default in scrapy).

Uploading to Elasticsearch

cdr-es-upload script takes care of generating timestamp_index field and can be used for uploading or deletion of CDR items. Please see cdr-es-upload --help for help on command line options.

Converting from CDR v2 format

Use cdr-v2-to-v3 script:

cdr-v2-to-v3 items.v2.jl.gz items.v3.jl.gz --broken

Note that this script does not support media items.

License

License is MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
scrapy_cdr		scrapy_cdr
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGES.rst		CHANGES.rst
LICENSE.txt		LICENSE.txt
README.rst		README.rst
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrapy CDR

Install

Usage

Items

Media items

Uploading to Elasticsearch

Converting from CDR v2 format

License

About

Releases

Packages

Contributors 4

Languages

License

TeamHG-Memex/scrapy-cdr

Folders and files

Latest commit

History

Repository files navigation

Scrapy CDR

About

Resources

License

Stars

Watchers

Forks

Languages