Can't find document/discussion/issue about how to resume the crawling after ctrl-c without erasing the pulled dataset #2149

lubobill1990 · 2023-10-28T02:48:30Z

lubobill1990
Oct 28, 2023

Since crawling is time costing, and it's common to stop the crawler intentionally or accidentally.

However, by default, the storage is removed when starting a new job.

Even by setting purgeOnStart to false, the files in storage/datasets/default are removed.

I believe pause/resume is a common feature of crawlers.

So, could anyone help on this question?

B4nan · 2023-10-30T09:14:14Z

B4nan
Oct 30, 2023
Maintainer

Even by setting purgeOnStart to false, the files in storage/datasets/default are removed.

If it does not work for you, are you probably setting it too late (as its the very first async call that is touched some storage which triggers the clean up). Try using the CRAWLEE_PURGE_ON_START env var instead - which is exactly what the log message on pausing suggests. If that won't help, we need to see more details.

0 replies

DavidGOrtega · 2023-11-01T21:19:49Z

DavidGOrtega
Nov 1, 2023

@B4nan I have a similar question, If by any meaning my crawler stops I would like to not repeat already crawled urls. I have read the docs trying to find something regarding this issue but I had no luck.

To clarify: I want to not crawl or request urls that I have already in my Dataset.

2 replies

B4nan Nov 1, 2023
Maintainer

That's how it works with CRAWLEE_PURGE_ON_START disabled, if you don't purge the storage, it will continue where it stopped.

DavidGOrtega Nov 1, 2023

CRAWLEE_PURGE_ON_START=false

effectively does it, however I would like to not continue but also restart from the main page and not process already gathered URLs.
This is quite useful in pages that generates fresh content where the entry point is always the same i.e. a forum

The only idea to handle this right now is to add a ts query param in the index page.

await crawler.run([`https://myforumexample.com?ts=${Date.now()}`]);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't find document/discussion/issue about how to resume the crawling after ctrl-c without erasing the pulled dataset #2149

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Can't find document/discussion/issue about how to resume the crawling after ctrl-c without erasing the pulled dataset #2149

lubobill1990 Oct 28, 2023

Replies: 2 comments · 2 replies

B4nan Oct 30, 2023 Maintainer

DavidGOrtega Nov 1, 2023

B4nan Nov 1, 2023 Maintainer

DavidGOrtega Nov 1, 2023

lubobill1990
Oct 28, 2023

Replies: 2 comments 2 replies

B4nan
Oct 30, 2023
Maintainer

DavidGOrtega
Nov 1, 2023

B4nan Nov 1, 2023
Maintainer