W.X.Y.Z
W - Package version X - Major version. Introduces backward incompatibilities. Y - Minor version. Introduces new features. Z - Micro version. Fixes and improvements.
Fixes and performance improvements are done along all the way and are not indicated there.
In general you can find most incompatibilities by generous using of typing hints and mypy in your project.
Backward incompatibility issues:
- Python older than 3.8 is not supported anymore
- name attribute is now required for every subclass of WorkFlowManager (either as hardcoded attribute, or passed via command line required arguments)
- The Delivery script has been refactored. The old code is deprecated. It will be removed in future versions.
Minor version changes:
- 1.9.0 crawlmanager
bad_outcome_hook()
- 1.9.1 crawlmanager ability to resume
- 1.9.2 implicit crawlmanager resumability via flow id (before a command line option was required)
- 1.9.3 performance improvements
- 1.9.4 name attribute logic fixes
- 1.9.5 delivery script refactor, old delivery code deprecated.
- 1.9.6 diverse fixes and improvements in base script, added
get_jobs_with_tags()
method. - 1.9.7 refactors and new s3 helper method
Backward incompatibility issues:
- Backward incompatibilities may come from the massive introduction of typing hints and fixes in types consistencies, specially when you override some methods. Many of the incompatibities you may find can be seen in advance by using mypy in your project and using typing abundantly in the classes that use shub-workflow.
Minor version changes:
- 1.10.1 continuation of typing hints massive adoption
- 1.10.2 continuation of typing hints massive adoptio
- 1.10.3 new BaseLoopScript class
- 1.10.4
script_args()
context manager - 1.10.5 some refactor and improvements in new delivery script.
- 1.10.6 crawl manager new async schedule mode
- 1.10.7 crawl manager extension of async scheduling
- 1.10.8 max running time for all scripts
- 1.10.9 performance improvements of resume feature
- 1.10.10 new method for async tagging of jobs
- 1.10.11 async tagging on delivery script
- 1.10.12 introduction of script stats
- 1.10.13 crawlmanager stats, delivery stats, AWS email utils.
- 1.10.14 graph manager ability to resume
- 1.10.15 graph manager
bad_outcome_hook()
- 1.10.16 max running time feature on delivery script
- 1.10.17 spider loader object in base script
- 1.10.18 default implementation of
bad_outcome_hook()
on generator crawl manager - 1.10.19 some refactoring of base script.
- 1.10.20 base script: resolve project id before parsing options
- 1.10.21 minor improvement in base script spider loader object
- 1.10.22 crawlmanager
finished_ok_hook()
method - 1.10.23 generator crawlmanager that handle multiple spiders in same process (not good experimient, will probably be deprecated in future)
Backward incompatibility issues:
- Definitively removed old legacy delivery class
- Dupefilter classes moved from delivery folder into utils
Minor version changes:
- 1.11.1 replaced
bloom_filter
dependency bybloom_filter2
- 1.11.2 typing improvements
- 1.11.3 method for removing jobs tags
- 1.11.4 method for reading log level from kumo and use on script main() function.
- 1.11.5 added methods for working with Google Cloud Storage, with same interface than already existin ones for AWS S3
- 1.11.6 generator crawlmanager: method for computing max jobs per spider
- 1.11.7 GCS additions
- 1.11.8 typing hint updates
- 1.11.9 added support for python 3.11
Backward incompatibility issues:
- make installation of s3 and gcs depedencies optional (with shub-workflow[with-s3-tools] and shub-workflow[with-gcs-tools])
Minor version changes:
- 1.12.1 more typing addition and improvements, and related refactoring
- 1.12.2 generator crawlmanager: some methods for conditioning scheduling of new spider jobs
- 1.12.3 reimplementation of
upload_file
s3 util using boto3, for improved performance - 1.12.4 improvements in job clonner class
Backward incompatibility issues:
- Some changes in delivery script interface
Minor version changes:
- 1.13.0 new configuration watchdog script
- 1.13.1 generator crawlmanager: added method for determining retry parameters when a job is retried from default
bad_outcome_hook()
method - 1.13.2 generator crawlmanager: additional featuring on multiple spiders handling
- 1.13.3 configurable retry logging via environment variables
- 1.13.4 base script: some handlers for scheduling methods
- 1.13.5 additions in GCS utils
- 1.13.6 additions in GCS utils
- 1.13.7 base script: print stats on close
- 1.13.8 avoid multiple warnings on
kumo_settings()
function. Additions in GCS utils
Backward incompatibility issues:
- Moved filesystem utils (S3 and GCS utils) from deliver folder into utils folder
Minor version changes:
- 1.14.0 generator crawl manager: acquire all jobs if crawlmanager doesn't have a flow id.
Added
get_canonical_spidername()
andget_project_running_spiders()
helper methods on base script. Added fshelper base script attribute for readily access to this helper tool - 1.14.1 Added method for getting alive real time settings from ScrapyCloud
- 1.14.2 Added BaseMonitor class that is able to monitor aggregated stats on entire workflow jobs.
- 1.14.3 Mixin for provision of Sentry alert capabilities in monitor.
- 1.14.4 Monitor ratios
- 1.14.5 Finished job metadata hook
- 1.14.6 Allow to use project entry keyword in scrapinghub.yml as alternative to project numeric id, when passing command line --project-id option.
- 1.14.7 Mixin for provision of Slack alert capabilities in monitor.
- 1.14.8 Allow to load settings from SC when running script on local environment.
- 1.14.9 Added stats aggregation capabilities to crawlmanager
- 1.14.10 AlertSender class to allow both slack and sentry alerts combined.