Very slow crawling inside docker #2054
-
Which package is this bug report for? If unsure which one to select, leave blank@crawlee/http (HttpCrawler) Issue descriptionGood day everyone! I am running a web crawler inside a Docker container on a robust machine. The crawler's performance starts fast but degrades significantly over time, sometimes to as slow as 1-2 requests per hour. The setup works without issues on my MacBook (not in docker). Environment
DockerfileFROM apify/actor-node:16
WORKDIR /app
COPY ./ /app
RUN npm ci
CMD npm start --silent I tried tweaking different settings and searching across docs, issues and code to figure it out myself but am feeling unsuccessful so far. Here is my crawler setup: const config = Configuration.getGlobalConfig();
config.set("availableMemoryRatio", 0.9);
config.set("memoryMbytes", 22000); // setting enough memory
config.set("persistStorage", false);
const proxyUrls = await listProxies(); // returns a list of 100 fast proxies
const proxyConfiguration = new ProxyConfiguration({
proxyUrls,
});
const router = createBasicRouter();
export const queue = await RequestQueue.open();
queue.timeoutSecs = 10;
const crawler = new HttpCrawler({
useSessionPool: true,
maxConcurrency: 30,
sessionPoolOptions: { maxPoolSize: 100 }, // to equal amount of proxies
autoscaledPoolOptions: {
snapshotterOptions: {
eventLoopSnapshotIntervalSecs: 2,
maxBlockedMillis: 100,
},
systemStatusOptions: {
maxEventLoopOverloadedRatio: 1.9, // tried different options, no effect, even when it's not specified
},
},
requestHandlerTimeoutSecs: 60 * 2,
requestQueue: queue,
keepAlive: true,
proxyConfiguration,
requestHandler: router,
failedRequestHandler({ request, proxyInfo }) {
log.debug(`Request ${request.url} failed.`);
log.debug("Proxy info: ", proxyInfo);
},
postNavigationHooks: [
async (context: BasicCrawlingContext) => {
const { url } = context.request;
await markUrlProcessed(url); // logic i need for my app
},
],
});
await crawler.run(); In the autoscaled pool logs i noticed that the event loop is often at the ratio of 1, that's why i tried increasing it: DEBUG HttpCrawler:AutoscaledPool: scaling up {
"oldConcurrency": 28,
"newConcurrency": 30,
"systemStatus": {
"isSystemIdle": true,
"memInfo": {
"isOverloaded": false,
"limitRatio": 0.2,
"actualRatio": 0
},
"eventLoopInfo": {
"isOverloaded": false,
"limitRatio": 1.9,
"actualRatio": 1
},
"cpuInfo": {
"isOverloaded": false,
"limitRatio": 0.4,
"actualRatio": 0
},
"clientInfo": {
"isOverloaded": false,
"limitRatio": 0.3,
"actualRatio": 0
}
}
} Questions Code sampleNo response Package version3.4.0 Node.js version16.20.2 Operating systemNo response Apify platform
I have tested this on the
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
Can you provide some logs? My guess is that the crawler is overscaled, which ends up degrading the performance over time.
This is wrong and can have such side effect probably. The option is about memory in megabytes, I doubt you have 22 terabytes of RAM. It could be as well connected to how you run the docker image, as having 32g ram in your system does not mean the docker image can access it, the defaults are usually much lower (maybe even less than 1g, not sure now). Generally speaking, there are no known issues when it comes to running in docker, we run things inside docker images on daily basis, as that is how the apify platform works. You shouldn't need to alter any of the options for that. I would be very careful with adjusting the autoscaled pool options, if you see an overloaded event loop, it's not about making the limits higher, it's the very opposite, it's about doing less so the event loop can catch up. |
Beta Was this translation helpful? Give feedback.
-
After deeper debugging, I found out that the issue was related to how i was handling the data outside of crawlee. So the issue is not related to Crawlee. Thanks for the support! |
Beta Was this translation helpful? Give feedback.
After deeper debugging, I found out that the issue was related to how i was handling the data outside of crawlee. So the issue is not related to Crawlee. Thanks for the support!