Skip to content

Managing Hubstorage Crawl Frontiers ‐ The Modern way

Martin Olveyra edited this page Nov 29, 2024 · 5 revisions

Handling crawl frontier with frontera and scrapy-frontera libraries has been usually a pain in some complex projects. Maintainability and traceability are also complex when the spider is the component that handles the frontier, either for reading or writing it. scrapy-frontera itself hasn't been definitively able to transparently replace scrapy memory and disk queues by the hubstorage crawl frontier (HCF). There have always been some limitations and the effort to overcome them has lead to a quite complex underlying logic and, in some cases, usability.

In this document we describe a different approach for relying on the hcf frontier, by full leveraging on shub-workflow, in order to separate the frontier operations role from the spider. Under this approach, other workflow components, like the already described crawl managers, will take care of reading seeds from the frontier and pass them via arguments to short lived spider jobs. The spider doesn't necessarily need to receive a url. It can build it from the received seeds. The frontier writing role is also separated from the spider. Instead, this task will be taken over by a consumer that scans all finished jobs items and extracts new seeds from them.

[WIP]