Expire requests from request queue #2018

abhisheksurve45 · 2023-08-01T10:59:41Z

abhisheksurve45
Aug 1, 2023

Hello,

I have a use case where I need to handle request expiration in the RequestQueue after a specified time (e.g., 30 minutes). Is this achievable in the current scenario?

One possible approach is to set an epoch time in the userData when enqueuing a request. Then, when it reaches the preNavigationHooks phase, you can check the elapsed time against the specified limit and throw a NonRetryableError to prevent further processing of the request.

However, this approach may not be the most elegant solution, and it has the side effect of creating a page object, which in turn opens a browser and creates an empty tab, consuming unnecessary resources.

Is there a more efficient and cleaner way to handle request expiration and avoid the overhead of using resources?

Answered by barjin

Oct 19, 2023

Hi @abhisheksurve45 and thank you for your interest in the Crawlee project!

You are right - there is no internal way of setting a per-request timeout, as we haven't seen any interest in this feature (until now). Would you mind sharing your use case for this feature?

Here I hacked together a better solution that creates a custom RequestQueue class with request expiration. You see you can implement your own RequestQueue class that inherits most of the methods from the original Crawlee's RequestQueue (renamed as RQ) - you only have to add wrappers for the methods that a) put requests in the queue and b) take requests out of the queue.

You can see that in my implementation, the addRequest(s)(…

View full answer

abhisheksurve45 · 2023-08-02T07:09:24Z

abhisheksurve45
Aug 2, 2023
Author

@B4nan Not sure if https://crawlee.dev/api/core/class/RequestQueue#internalTimeoutMillis does the same thing.

0 replies

abhisheksurve45 · 2023-08-08T12:06:14Z

abhisheksurve45
Aug 8, 2023
Author

@B4nan please let me know if any clue on the same. Thanks.

0 replies

abhisheksurve45 · 2023-10-06T06:16:32Z

abhisheksurve45
Oct 6, 2023
Author

@barjin @B4nan Any thoughts on this?

0 replies

barjin · 2023-10-19T12:17:56Z

barjin
Oct 19, 2023
Maintainer

Hi @abhisheksurve45 and thank you for your interest in the Crawlee project!

You are right - there is no internal way of setting a per-request timeout, as we haven't seen any interest in this feature (until now). Would you mind sharing your use case for this feature?

Here I hacked together a better solution that creates a custom RequestQueue class with request expiration. You see you can implement your own RequestQueue class that inherits most of the methods from the original Crawlee's RequestQueue (renamed as RQ) - you only have to add wrappers for the methods that a) put requests in the queue and b) take requests out of the queue.

You can see that in my implementation, the addRequest(s)(Batched) methods add the "time" field to the userData part of the request (just like in your solution) and the fetchNextRequest method checks whether the timeout has expired - if it did, it marks the expired one as handled and retrieves another request.

import { CheerioCrawler, Dictionary, Request, RequestQueue as RQ } from "crawlee";
import { setTimeout } from "timers/promises";

class RequestQueue extends RQ {
    private requestTimeoutSecs = 2;

    override async addRequest(...args: Parameters<RQ["addRequest"]>) {
        return super.addRequest({...args[0], userData: {...args[0].userData, time : Date.now()}}, args[1]);
    }

    override async addRequests(...args: Parameters<RQ["addRequests"]>) {
        return super.addRequests(args[0].map(req => ({...req, userData: {...req.userData, time : Date.now()}})), args[1]);
    }
    
    override async addRequestsBatched(...args: Parameters<RQ['addRequestsBatched']>) {
        return super.addRequestsBatched(args[0].map(req => ({...(
            typeof req === 'string' ? {url: req} : req
        ), userData: {...req?.userData ?? {}, time : Date.now()}})), args[1]);
    }
    
    override async fetchNextRequest<T extends Dictionary = Dictionary>(): Promise<Request<T> | null> {
        let r = await super.fetchNextRequest<T>()
        while (r) {
            if((r?.userData?.time + (this.requestTimeoutSecs * 1e3)) < Date.now()) {
                console.log(`${r.url} expired, loading next request.`);
                await this.markRequestHandled(r);
                r = await super.fetchNextRequest<T>();
            } else {
                return r;
            }
        }

        return null;
    }
}

const rq = await RequestQueue.open();

const crawler = new CheerioCrawler({
    requestHandler: async ({ request, enqueueLinks }) => {
        console.log(`Crawling ${request.url}...`);

        await enqueueLinks();
        await setTimeout(2000);
    },
    maxConcurrency: 1,
    requestQueue: rq
});

await crawler.run(['https://jindrich.bar']);

This example should give you the starting point for your implementation. As I said, I'd love to hear more about your use-case - right now, we're not planning to add this feature to Crawlee, but that can change if there is enough push from the users :)

Thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expire requests from request queue #2018

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Expire requests from request queue #2018

abhisheksurve45 Aug 1, 2023

Replies: 4 comments

abhisheksurve45 Aug 2, 2023 Author

abhisheksurve45 Aug 8, 2023 Author

abhisheksurve45 Oct 6, 2023 Author

barjin Oct 19, 2023 Maintainer

abhisheksurve45
Aug 1, 2023

abhisheksurve45
Aug 2, 2023
Author

abhisheksurve45
Aug 8, 2023
Author

abhisheksurve45
Oct 6, 2023
Author

barjin
Oct 19, 2023
Maintainer