Skip to content

Releases: scrapinghub/frontera

Strategy worker hardening and bug fixes

21 Jun 15:41
Compare
Choose a tag to compare

From now on, strategy worker continues to operate after internal exception. There were also minor improvements.

Bug fix

02 Jun 18:16
Compare
Choose a tag to compare

Graphs import was removed from frontera module, therefore SQLAlchemy isn't required anymore, when it's not used.

Crawling strategy improvements and native logging

01 Jun 14:20
Compare
Choose a tag to compare

Here is the change log:

  • latest SQLAlchemy unicode-related crashes are fixed,
  • corporate website friendly canonical solver has been added.
  • crawling strategy concept evolved: added ability to add to queue an arbitrary URL (with transparent state check), FrontierManager available on construction,
  • strategy worker code was refactored,
  • default state introduced for links generated during crawling strategy operation,
  • got rid of Frontera logging in favor of Python native logging,
  • logging system configuration by means of logging.config using file,
  • partitions to instances can be assigned from command line now,
  • improved test coverage from @Preetwinder.

Enjoy!

Kafka-python bug fix release

22 Apr 14:45
Compare
Choose a tag to compare

This release prevents installing kafka-python package versions newer than 0.9.5. Newer version has significant architectural changes and requires Frontera code adaptation and testing. If you are using Kafka message bus, than you're encouraged to install this update.

Bug fix release

18 Jan 10:30
Compare
Choose a tag to compare
  • fixed API docs generation on RTD,
  • added body field in Request objects, to support POST-type requests,
  • guidance on how to set MAX_NEXT_REQUESTS and settings docs fixes,
  • fixed colored logging.

Distributed and easy to use

30 Dec 20:23
Compare
Choose a tag to compare

A tremendous work was done:

  • distributed-frontera and frontera were merged together into the single project: to make it easier to use and understand,
  • Backend was completely redesigned. Now it's consisting of Queue, Metadata and States objects for low-level code and higher-level Backend implementations for crawling policies,
  • Added definition of run modes: single process, distributed spiders, distributed spider and backend.
  • Overall distributed concept is now integrated into Frontera, making difference between usage of components in single process and distributed spiders/backend run modes clearer.
  • Significantly restructured and augmented documentation, addressing user needs in a more accessible way.
  • Much less configuration footprint.

Enjoy this new year release and let us know what you think!

Numerous bug fixes, and improvements

29 Sep 17:08
Compare
Choose a tag to compare
  • tldextract is no longer minimum required dependency,
  • SQLAlchemy backend now persists headers, cookies, and method, also _create_page method added to ease customization,
  • Canonical solver code (needs documentation)
  • Other fixes and improvements

Frontera configuration from Scrapy settings

19 Jun 09:14
Compare
Choose a tag to compare

Now, it's possible to configure Frontera from Scrapy settings. The order of precedence for configuration sources is following:

  1. Settings defined in the module pointed by FRONTERA_SETTINGS (higher precedence)
  2. settings defined in the Scrapy settings,
  3. default frontier settings.

Better support of ordinary Scrapy spiders and cold start problem fix

25 May 14:11
Compare
Choose a tag to compare

Main issue solved in this version is that now, request callbacks and request.meta contents are successfully serializing and deserializing in SQL Alchemy-based backend. Therefore, majority of Scrapy extensions shouldn't suffer from loosing meta or callbacks passing over Frontera anymore. Second, there is hot fix for cold start problem, when seeds are added, and Scrapy is quickly finishing with no further activity. Well thought solution for this will be offered later.

New name, improved scheduling and other

15 Apr 13:19
Compare
Choose a tag to compare
  • Frontera is the new name for Crawl Frontier.
  • Signature of get_next_requests method is changed, now it accepts arbitrary key-value arguments.
  • Overused buffer (subject to remove in the future in favor of downloader internal queue).
  • Backend internals became more customizable.
  • Scheduler now requests for new requests when there is free space in Scrapy downloader queue, instead of waiting for absolute emptiness.
  • Several Frontera middlewares are disabled by default.