Skip to content

Latest commit

 

History

History
114 lines (69 loc) · 2.49 KB

CHANGES.rst

File metadata and controls

114 lines (69 loc) · 2.49 KB

Changes

  • A cdr-kafka-upload command for uploading CDR results to Kafka (use scrapy-cdr[kafka] to install dependencies).

Packaging fixes:

  • elasticsearch dependencies are optional now;
  • futures package is no longer installed in Python 3.

Better CDR 3.1 compatibility:

  • by default CDRMediaPipeline outputs relative URLs for stored files instead of absolute URLs;
  • new CDR_S3_RELATIVE_URLS option which allows to switch back to absolute URLs (set it to False);
  • version number of CDR items is set to 3.1 instead of 3.0.
  • cdr-es-upload: fix in --reverse-domains option, use logging, allow setting log file and log level
  • cdr-es-download: add filtering by --id
  • CDRMediaPipeline does not keep extensions in file names
  • CDRMediaPipeline limits downloader cache by default to 10k items
  • an option to put files in a reverse domain folder structure for cdr-es-upload (this also strips extensions)
  • cdr-es-upload fixes: run in constant memory, proceed after ES upload errors (e.g. exceeding upload size).
  • cdr-es-upload fixes: add --max-chunk-bytes and set it to 10 MB by default (was 100 MB before), proceed after indexing errors.
  • Fix file extension handling in CDRMediaPipeline: now only url path is used (without query and fragment).
  • Support CDR v3.1 format (add response_headers)
  • Add cdr-es-download script to download data from CDR
  • CDRMediaPipeline: use "https://" URL for public media items on S3
  • cdr-es-upload: log exceptions when reading data to upload
  • Updated to CDR v3 (breaking change)
  • Added cdr-v2-to-v3 script for CDR v2 -> v3 conversion
  • Added cdr-es-upload script for Elasticsearch upload
  • Added scrapy_cdr.media_pipeline.CDRMediaPipeline to help with media item downloading
  • Do not fail on responses without content-type header
  • Allow passing a custom item class

Initial release