-
I am working with an API that has rate-limiting in place. The API gives me a timestamp of when the current rate limit will expire in seconds. I need to delay my next request by this many seconds, which is usually 15 ish minutes. What I have tried so far is, 1st ApproachInside my export function delay(seconds: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, seconds * 1000))
} After this delay is finished I add the next request to my crawler and return from the const newRequest: RequestOptions = {...}
crawler.addRequests([newRequest])
return 2nd ApproachInside my setTimeout(() => {
const newRequest: RequestOptions = {...}
crawler.addRequests([newRequest])
}, 300000 * 2) The ErrorBoth of these approaches give me the same error
I am not familiar with the internals of Crawlee. So is my approach correct for having a delay between requests? or is something like this even possible with Crawlee? I would appreciate any help or suggestions, Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 15 replies
-
Both of your examples are not awaiting the async calls, which is probably the reason you are getting Easiest thing you can do is to limit the requests per minute: https://crawlee.dev/docs/guides/scaling-crawlers#maxrequestsperminute That page also lists another options for scaling that affect the speed, usually its better not to overload the target site and be nice :] |
Beta Was this translation helpful? Give feedback.
-
Leaving my final solution here if anyone wants to implement a similar flow. I gave up on trying to make it work with I re-wrote my script to use Here is a simple example. // routes.ts
import axios from "axios"
import { createBasicRouter, sleep } from "crawlee"
const axiosClient = axios.create({
baseURL: "https://jsonplaceholder.typicode.com",
})
export const basicRouter = createBasicRouter()
basicRouter.addHandler("/todos", async ({ request, log }) => {
const userData = request.userData
try {
const response = await axiosClient.get(userData.endpoint)
log.info(`Fetched ${response.data.length} todos`)
// Wait before resolving this handler
// This will stop further requests from executing depending on how you have setup maxConcurrency
await sleep(10000)
// TODO: Do something after the sleep has completed. e.g., Enqueue new requests.
} catch (error) {
if (axios.isAxiosError(error)) {
// TODO: Handle errors
}
}
}) // main.ts
import { BasicCrawler, RequestOptions } from "crawlee"
import { basicRouter } from "./routes.js"
const basicCrawler = new BasicCrawler({
requestHandler: basicRouter,
maxConcurrency: 1,
maxRequestRetries: 1,
// Make sure this is longer than the maximum time your requestHandlers wait for, otherwise they will have timeout errors.
requestHandlerTimeoutSecs: 86400,
errorHandler: async function ({ log }, error) {
log.error(`errorHandler ${error.message}`)
},
failedRequestHandler: async function ({ log }, error) {
log.error(`failedRequestHandler ${error.message}`)
}
})
// Passing url in my case is redundant since I am passing an endpoint with userData
// But it still needs to be a valid url for the validators to not fail
const request: RequestOptions = {
uniqueKey: Date.now().toString(),
url: "https://jsonplaceholder.typicode.com",
userData: {
endpoint: "/todos"
},
label: "/todos"
}
await basicCrawler.addRequests([request])
await basicCrawler.run() |
Beta Was this translation helpful? Give feedback.
-
@viconx98 As an example, can you post your source code? |
Beta Was this translation helpful? Give feedback.
Leaving my final solution here if anyone wants to implement a similar flow.
I gave up on trying to make it work with
HttpCrawler
orBasicCrawler + sendRequest
.I re-wrote my script to use
BasicCrawler
and now it makes requests using Axios instead ofsendRequest
. It seems to have resolved the problem and lets me wait as long as I want right inside therequestHandler
. Tested with a 30 minute delay between requests and there were no issues.Here is a simple example.