Is it possible to use multiple crawler instances to process same request queue #2097

harshmaur · 2023-09-25T13:14:14Z

harshmaur
Sep 25, 2023

I recently started using apify storage with my own crawler instances and immediately realised I need something that allows me to have a single request queue which can be processed by always available crawler instances. Is that possible to do?

Answered by B4nan

Sep 25, 2023

We just merged support for v2 of the request queue API, which supports request locking, in other words, parallel runs. This is an experimental feature, so I can't suggest using it in production right now, but it's definitely happening soon.

Here is an example test using this feature:

https://github.com/apify/crawlee/blob/master/test/e2e/cheerio-request-queue-v2/actor/main.js

It will work only on the apify platform, and we know about some issues already, but we would appreciate any reports if you would try to test this yourself right now.

View full answer

B4nan · 2023-09-25T13:27:08Z

B4nan
Sep 25, 2023
Maintainer

We just merged support for v2 of the request queue API, which supports request locking, in other words, parallel runs. This is an experimental feature, so I can't suggest using it in production right now, but it's definitely happening soon.

Here is an example test using this feature:

https://github.com/apify/crawlee/blob/master/test/e2e/cheerio-request-queue-v2/actor/main.js

It will work only on the apify platform, and we know about some issues already, but we would appreciate any reports if you would try to test this yourself right now.

8 replies

B4nan Sep 25, 2023
Maintainer

You can try latest beta, its out on NPM (as every commit to master).

Btw I had to create a custom class to be able to use just apify storage with crawlee without using actors.

You can use Apify SDK and just call Actor.init() at the very beginning, or is that not suitable for you?

harshmaur Sep 25, 2023
Author

Wouldnt that also try to use actors on apify? I dont have my code on apify.

B4nan Sep 25, 2023
Maintainer

I am not sure what you mean by that, it only instructs crawlee to use the apify client as storage, in other words, working with RQ, KVS and DS will use the apify API.

B4nan Sep 25, 2023
Maintainer

Few guides around this topic:

https://crawlee.dev/docs/introduction/deployment
https://crawlee.dev/docs/deployment/apify-platform

harshmaur Sep 25, 2023
Author

Ah okay! I will try it then instead of implementing my own class Thank you!!

harshmaur · 2023-09-25T15:21:49Z

harshmaur
Sep 25, 2023
Author

@B4nan I just tried this like this way.

const getCrawler = ({
  keepAlive,
}: {
  keepAlive?: boolean
}) => {
  return new PlaywrightCrawler({
    experiments: {
      requestLocking: true,
    },
    keepAlive,
    requestHandler: router,
    
  })
}

const alwaysRunningCrawler = getCrawler({ keepAlive: true })
const alwaysRunningCrawler2 = getCrawler({ keepAlive: true })

app.get("/crawl", (req, res) => {
 const crawler = getCrawler()
 crawler.addRequests(...manyRequests)
})

await Promise.all([alwaysRunningCrawler.run(), alwaysRunningCrawler2.run()])

I am currently using my local file storage to store the requests to test this.

When the requests are added using the express endpoint, nothing happens. Is it because I am not using apify storage in this test?

10 replies

B4nan Sep 25, 2023
Maintainer

I believe you need to provide the same request queue ID, otherwise you work with completely different storages.

vladfrangu Sep 25, 2023
Maintainer

Yep, try opening a Request Queue manually and passing it to your crawlers (bonus points! you wont need to use getCrawler() per request, and can just call queue.addRequests())

harshmaur Sep 25, 2023
Author

YES! It works now. Here's the working example.

// For more information, see https://crawlee.dev/
import { PlaywrightCrawler, RequestQueueV2 } from 'crawlee';

import { router } from './routes.js';

const startUrls = ['https://crawlee.dev'];

const requestQueue = await RequestQueueV2.open('test-queue')

const getcrawler = (keepAlive?: boolean) => new PlaywrightCrawler({
    requestQueue,
    experiments:{
        requestLocking: true,
    },
    keepAlive,
    // proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
    requestHandler: router,
});

const crawler1 = getcrawler(true)
const crawler2 = getcrawler(true)

await requestQueue.addRequests(startUrls.map(url => ({ url })))

await Promise.all([
    crawler1.run(),
    crawler2.run(),
])

The stats show crawler one crawling 9 pages and the second one crawling 5. Total 14 pages.

harshmaur Sep 25, 2023
Author

Interestingly, if I dont use the requestQueue.addRequests but use a new crawler instance, it doesnt work.

// this works
await requestQueue.addRequests(startUrls.map(url => ({ url })))

// this doesnt
await getCrawler().addRequests(startUrls)

They all are using the same request queue though. I can obviosuly open the request queue everywhere in my routes, but I am not sure if that would be nice since I already get the crawler instance inside the code. Maybe it would just work fine since it would use the same instance.

I will have to try that within my project tomorrow.

vladfrangu Sep 25, 2023
Maintainer

Weird, both of those should've worked. That said, it is much more efficient to just open the request queue once like you've done, and reference that in your express routes!

harshmaur · 2023-09-26T01:35:46Z

harshmaur
Sep 26, 2023
Author

Alright, so it seems to work partially now with the apify storage. I want to understand what happens if the client which is currently executing requests stops due to some reason and doesnt allow cleanup?

Do the requests lying in the storage get unlocked after some time? is it 15 minutes?

Can I configure it somehow?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to use multiple crawler instances to process same request queue #2097

{{title}}

Replies: 3 comments 18 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Is it possible to use multiple crawler instances to process same request queue #2097

harshmaur Sep 25, 2023

Replies: 3 comments · 18 replies

B4nan Sep 25, 2023 Maintainer

B4nan Sep 25, 2023 Maintainer

harshmaur Sep 25, 2023 Author

B4nan Sep 25, 2023 Maintainer

B4nan Sep 25, 2023 Maintainer

harshmaur Sep 25, 2023 Author

harshmaur Sep 25, 2023 Author

B4nan Sep 25, 2023 Maintainer

vladfrangu Sep 25, 2023 Maintainer

harshmaur Sep 25, 2023 Author

harshmaur Sep 25, 2023 Author

vladfrangu Sep 25, 2023 Maintainer

harshmaur Sep 26, 2023 Author

harshmaur
Sep 25, 2023

Replies: 3 comments 18 replies

B4nan
Sep 25, 2023
Maintainer

B4nan Sep 25, 2023
Maintainer

harshmaur Sep 25, 2023
Author

B4nan Sep 25, 2023
Maintainer

B4nan Sep 25, 2023
Maintainer

harshmaur Sep 25, 2023
Author

harshmaur
Sep 25, 2023
Author

B4nan Sep 25, 2023
Maintainer

vladfrangu Sep 25, 2023
Maintainer

harshmaur Sep 25, 2023
Author

harshmaur Sep 25, 2023
Author

vladfrangu Sep 25, 2023
Maintainer

harshmaur
Sep 26, 2023
Author