Releases: scrapinghub/frontera
Strategy worker hardening and bug fixes
From now on, strategy worker continues to operate after internal exception. There were also minor improvements.
Bug fix
Graphs import was removed from frontera module, therefore SQLAlchemy isn't required anymore, when it's not used.
Crawling strategy improvements and native logging
Here is the change log:
- latest SQLAlchemy unicode-related crashes are fixed,
- corporate website friendly canonical solver has been added.
- crawling strategy concept evolved: added ability to add to queue an arbitrary URL (with transparent state check),
FrontierManager
available on construction, - strategy worker code was refactored,
- default state introduced for links generated during crawling strategy operation,
- got rid of Frontera logging in favor of Python native logging,
- logging system configuration by means of logging.config using file,
- partitions to instances can be assigned from command line now,
- improved test coverage from @Preetwinder.
Enjoy!
Kafka-python bug fix release
This release prevents installing kafka-python
package versions newer than 0.9.5. Newer version has significant architectural changes and requires Frontera code adaptation and testing. If you are using Kafka message bus, than you're encouraged to install this update.
Bug fix release
- fixed API docs generation on RTD,
- added
body
field in Request objects, to support POST-type requests, - guidance on how to set
MAX_NEXT_REQUESTS
and settings docs fixes, - fixed colored logging.
Distributed and easy to use
A tremendous work was done:
distributed-frontera
andfrontera
were merged together into the single project: to make it easier to use and understand,- Backend was completely redesigned. Now it's consisting of
Queue
,Metadata
andStates
objects for low-level code and higher-levelBackend
implementations for crawling policies, - Added definition of run modes: single process, distributed spiders, distributed spider and backend.
- Overall distributed concept is now integrated into Frontera, making difference between usage of components in single process and distributed spiders/backend run modes clearer.
- Significantly restructured and augmented documentation, addressing user needs in a more accessible way.
- Much less configuration footprint.
Enjoy this new year release and let us know what you think!
Numerous bug fixes, and improvements
- tldextract is no longer minimum required dependency,
- SQLAlchemy backend now persists headers, cookies, and method, also
_create_page
method added to ease customization, - Canonical solver code (needs documentation)
- Other fixes and improvements
Frontera configuration from Scrapy settings
Now, it's possible to configure Frontera from Scrapy settings. The order of precedence for configuration sources is following:
- Settings defined in the module pointed by FRONTERA_SETTINGS (higher precedence)
- settings defined in the Scrapy settings,
- default frontier settings.
Better support of ordinary Scrapy spiders and cold start problem fix
Main issue solved in this version is that now, request callbacks and request.meta contents are successfully serializing and deserializing in SQL Alchemy-based backend. Therefore, majority of Scrapy extensions shouldn't suffer from loosing meta or callbacks passing over Frontera anymore. Second, there is hot fix for cold start problem, when seeds are added, and Scrapy is quickly finishing with no further activity. Well thought solution for this will be offered later.
New name, improved scheduling and other
- Frontera is the new name for Crawl Frontier.
- Signature of get_next_requests method is changed, now it accepts arbitrary key-value arguments.
- Overused buffer (subject to remove in the future in favor of downloader internal queue).
- Backend internals became more customizable.
- Scheduler now requests for new requests when there is free space in Scrapy downloader queue, instead of waiting for absolute emptiness.
- Several Frontera middlewares are disabled by default.