How to... #2550

geezz07 · 2024-06-23T23:16:06Z

geezz07
Jun 23, 2024

Hi friends,

Lets say I am scraping 3 sites, a, b and c .

await crawler.run([
        `https://www.a.com`,
        `https://www.b.com,
        `https://www.c.com`
    ]);
}

Based on how the data is stored, it seems as site C starts to crawl when site B is completed and site B starts when site A is completed. Example:

000001.json = data from site A
000002.json = data from site A
000003.json = data from site A
000004.json = data from site B
000005.json = data from site B
000006.json = data from site C

Should I use parallelism if I want sites A, B and C crawled at the same time? Example:

000001.json = data from site B
000002.json = data from site A
000003.json = data from site C
000004.json = data from site A
000005.json = data from site B
000006.json = data from site A

Perhaps I should make separate crawler for each site

await crawlerA.run([
        `https://www.a.com`,
    ]);
await crawlerB.run([
        `https://www.b.com`,
    ]);
await crawlerC.run([
        `https://www.c.com`,
    ]);
}

What would be my best approach?

Answered by B4nan

Jun 28, 2024

Yes, this is the same as using Promise.all afaik (and I would strongly prefer that over starting promises and awaiting later).

But in the case above I am getting route issues. For example the requests from crawler1 routing to routes from crawler3

Because they all use the same default request queue, you need to create them explicitly and you cant use a default queue. A queue without a name will be always the same regardless of how m many instances you create.

View full answer

B4nan · 2024-06-24T16:58:57Z

B4nan
Jun 24, 2024
Maintainer

The requests should be processed in the same order as you enqueue them (unless you use forefront: true which adds them to the beginning of the queue instead of the end). If you want parallelism, you will need separate processes, crawlee internally only handles concurrency, not parallelism. Note that it's not working the way you described it, it won't wait for the first site to finish, but the order in which you process requests will be dependent on the order in which you add them (so most likely on the order of requests being processed, I guess you have some enqueueLinks call in there?).

Perhaps I should make separate crawler for each site

This code would run sequentially since you await each run call. If you use Promise.all instead of awaiting them one by one, you could maybe achieve what you want, just make sure you use different request queues (since the default queue will be the same for all of them).

3 replies

geezz07 Jun 25, 2024
Author

Your last point about the code would run sequentially, if I do below this it would be asynchronous, no? It'll finish sequentially but the run functions would be asychronous.

const crawler1 = new CheerioCrawler({ requestHandler: router1})'
const crawler2 = new PlaywrightCrawler({ requestHandler: router2})'
const crawler3 = new BasicCrawler({ requestHandler: router3})'

const crawl1 = crawler1.run(['https://a.com]);
const crawl2 = crawler2.run(['https://b.com']);
const crawl3 = crawler3.run(['https://c.com']);

await crawl1;
await crawl2;
await crawl3;

But in the case above I am getting route issues. For example the requests from crawler1 routing to routes from crawler3

B4nan Jun 28, 2024
Maintainer

Yes, this is the same as using Promise.all afaik (and I would strongly prefer that over starting promises and awaiting later).

But in the case above I am getting route issues. For example the requests from crawler1 routing to routes from crawler3

Because they all use the same default request queue, you need to create them explicitly and you cant use a default queue. A queue without a name will be always the same regardless of how m many instances you create.

Answer selected by geezz07

geezz07 Jun 28, 2024
Author

Appreciate your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to... #2550

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to... #2550

geezz07 Jun 23, 2024

Replies: 1 comment · 3 replies

B4nan Jun 24, 2024 Maintainer

geezz07 Jun 25, 2024 Author

B4nan Jun 28, 2024 Maintainer

geezz07 Jun 28, 2024 Author

geezz07
Jun 23, 2024

Replies: 1 comment 3 replies

B4nan
Jun 24, 2024
Maintainer

geezz07 Jun 25, 2024
Author

B4nan Jun 28, 2024
Maintainer

geezz07 Jun 28, 2024
Author