Expire requests from request queue #2018
-
Hello, I have a use case where I need to handle request expiration in the One possible approach is to set an epoch time in the However, this approach may not be the most elegant solution, and it has the side effect of creating a page object, which in turn opens a browser and creates an empty tab, consuming unnecessary resources. Is there a more efficient and cleaner way to handle request expiration and avoid the overhead of using resources? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
@B4nan Not sure if https://crawlee.dev/api/core/class/RequestQueue#internalTimeoutMillis does the same thing. |
Beta Was this translation helpful? Give feedback.
-
@B4nan please let me know if any clue on the same. Thanks. |
Beta Was this translation helpful? Give feedback.
-
Hi @abhisheksurve45 and thank you for your interest in the Crawlee project! You are right - there is no internal way of setting a per-request timeout, as we haven't seen any interest in this feature (until now). Would you mind sharing your use case for this feature? Here I hacked together a better solution that creates a custom You can see that in my implementation, the import { CheerioCrawler, Dictionary, Request, RequestQueue as RQ } from "crawlee";
import { setTimeout } from "timers/promises";
class RequestQueue extends RQ {
private requestTimeoutSecs = 2;
override async addRequest(...args: Parameters<RQ["addRequest"]>) {
return super.addRequest({...args[0], userData: {...args[0].userData, time : Date.now()}}, args[1]);
}
override async addRequests(...args: Parameters<RQ["addRequests"]>) {
return super.addRequests(args[0].map(req => ({...req, userData: {...req.userData, time : Date.now()}})), args[1]);
}
override async addRequestsBatched(...args: Parameters<RQ['addRequestsBatched']>) {
return super.addRequestsBatched(args[0].map(req => ({...(
typeof req === 'string' ? {url: req} : req
), userData: {...req?.userData ?? {}, time : Date.now()}})), args[1]);
}
override async fetchNextRequest<T extends Dictionary = Dictionary>(): Promise<Request<T> | null> {
let r = await super.fetchNextRequest<T>()
while (r) {
if((r?.userData?.time + (this.requestTimeoutSecs * 1e3)) < Date.now()) {
console.log(`${r.url} expired, loading next request.`);
await this.markRequestHandled(r);
r = await super.fetchNextRequest<T>();
} else {
return r;
}
}
return null;
}
}
const rq = await RequestQueue.open();
const crawler = new CheerioCrawler({
requestHandler: async ({ request, enqueueLinks }) => {
console.log(`Crawling ${request.url}...`);
await enqueueLinks();
await setTimeout(2000);
},
maxConcurrency: 1,
requestQueue: rq
});
await crawler.run(['https://jindrich.bar']); This example should give you the starting point for your implementation. As I said, I'd love to hear more about your use-case - right now, we're not planning to add this feature to Crawlee, but that can change if there is enough push from the users :) Thanks! |
Beta Was this translation helpful? Give feedback.
Hi @abhisheksurve45 and thank you for your interest in the Crawlee project!
You are right - there is no internal way of setting a per-request timeout, as we haven't seen any interest in this feature (until now). Would you mind sharing your use case for this feature?
Here I hacked together a better solution that creates a custom
RequestQueue
class with request expiration. You see you can implement your own RequestQueue class that inherits most of the methods from the original Crawlee'sRequestQueue
(renamed asRQ
) - you only have to add wrappers for the methods that a) put requests in the queue and b) take requests out of the queue.You can see that in my implementation, the
addRequest(s)(…