Contents
- A
cdr-kafka-upload
command for uploading CDR results to Kafka (usescrapy-cdr[kafka]
to install dependencies).
Packaging fixes:
- elasticsearch dependencies are optional now;
futures
package is no longer installed in Python 3.
Better CDR 3.1 compatibility:
- by default CDRMediaPipeline outputs relative URLs for stored files instead of absolute URLs;
- new
CDR_S3_RELATIVE_URLS
option which allows to switch back to absolute URLs (set it to False); - version number of CDR items is set to 3.1 instead of 3.0.
cdr-es-upload
: fix in--reverse-domains
option, use logging, allow setting log file and log levelcdr-es-download
: add filtering by--id
CDRMediaPipeline
does not keep extensions in file namesCDRMediaPipeline
limits downloader cache by default to 10k items- an option to put files in a reverse domain folder structure
for
cdr-es-upload
(this also strips extensions)
cdr-es-upload
fixes: run in constant memory, proceed after ES upload errors (e.g. exceeding upload size).
cdr-es-upload
fixes: add--max-chunk-bytes
and set it to 10 MB by default (was 100 MB before), proceed after indexing errors.
- Fix file extension handling in
CDRMediaPipeline
: now only url path is used (without query and fragment).
- Support CDR v3.1 format (add
response_headers
)
- Add
cdr-es-download
script to download data from CDR CDRMediaPipeline
: use "https://" URL for public media items on S3
cdr-es-upload
: log exceptions when reading data to upload
- Updated to CDR v3 (breaking change)
- Added
cdr-v2-to-v3
script for CDR v2 -> v3 conversion - Added
cdr-es-upload
script for Elasticsearch upload - Added
scrapy_cdr.media_pipeline.CDRMediaPipeline
to help with media item downloading
- Do not fail on responses without content-type header
- Allow passing a custom item class
Initial release