Very slow crawling inside docker #2054

wicked-network · 2023-08-28T10:08:39Z

wicked-network
Aug 28, 2023

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/http (HttpCrawler)

Issue description

Good day everyone!

I am running a web crawler inside a Docker container on a robust machine. The crawler's performance starts fast but degrades significantly over time, sometimes to as slow as 1-2 requests per hour. The setup works without issues on my MacBook (not in docker).

Environment

Docker version: v 4.21.1
Host Machine: Windows 11 Pro, 32GB RAM, AMD Ryzen 7 5700X 8-core, 12GB VRAM
Docker Image: apify/actor-node:16

Dockerfile

FROM apify/actor-node:16
WORKDIR /app

COPY ./ /app

RUN npm ci

CMD npm start --silent

I tried tweaking different settings and searching across docs, issues and code to figure it out myself but am feeling unsuccessful so far. Here is my crawler setup:

const config = Configuration.getGlobalConfig();
config.set("availableMemoryRatio", 0.9);
config.set("memoryMbytes", 22000); // setting enough memory
config.set("persistStorage", false);

const proxyUrls = await listProxies(); // returns a list of 100 fast proxies
const proxyConfiguration = new ProxyConfiguration({
  proxyUrls,
});

const router = createBasicRouter();

export const queue = await RequestQueue.open();
queue.timeoutSecs = 10;

const crawler = new HttpCrawler({
  useSessionPool: true,
  maxConcurrency: 30,
  sessionPoolOptions: { maxPoolSize: 100 }, // to equal amount of proxies
  autoscaledPoolOptions: {
    snapshotterOptions: {
      eventLoopSnapshotIntervalSecs: 2,
      maxBlockedMillis: 100,
    },
    systemStatusOptions: {
      maxEventLoopOverloadedRatio: 1.9, // tried different options, no effect, even when it's not specified
    },
  },
  requestHandlerTimeoutSecs: 60 * 2,
  requestQueue: queue,
  keepAlive: true,

  proxyConfiguration,
  requestHandler: router,

  failedRequestHandler({ request, proxyInfo }) {
    log.debug(`Request ${request.url} failed.`);
    log.debug("Proxy info: ", proxyInfo);
  },

  postNavigationHooks: [
    async (context: BasicCrawlingContext) => {
      const { url } = context.request;

      await markUrlProcessed(url); // logic i need for my app
    },
  ],
});

await crawler.run();

In the autoscaled pool logs i noticed that the event loop is often at the ratio of 1, that's why i tried increasing it:

DEBUG HttpCrawler:AutoscaledPool: scaling up {
  "oldConcurrency": 28,
  "newConcurrency": 30,
  "systemStatus": {
    "isSystemIdle": true,
    "memInfo": {
      "isOverloaded": false,
      "limitRatio": 0.2,
      "actualRatio": 0
    },
    "eventLoopInfo": {
      "isOverloaded": false,
      "limitRatio": 1.9,
      "actualRatio": 1
    },
    "cpuInfo": {
      "isOverloaded": false,
      "limitRatio": 0.4,
      "actualRatio": 0
    },
    "clientInfo": {
      "isOverloaded": false,
      "limitRatio": 0.3,
      "actualRatio": 0
    }
  }
}

Questions
Is there any misconfiguration in my setup causing the slowdown?
Are there any known issues regarding the performance of HttpCrawler when running in a Docker environment?
Are there platform-specific considerations I might have overlooked?

Code sample

No response

Package version

3.4.0

Node.js version

16.20.2

Operating system

No response

Apify platform

Tick me if you encountered this issue on the Apify platform

I have tested this on the `next` release

No response

Other context

No response

Answered by wicked-network

Sep 13, 2023

After deeper debugging, I found out that the issue was related to how i was handling the data outside of crawlee. So the issue is not related to Crawlee. Thanks for the support!

View full answer

B4nan · 2023-08-28T10:23:22Z

B4nan
Aug 28, 2023
Maintainer

Can you provide some logs? My guess is that the crawler is overscaled, which ends up degrading the performance over time.

config.set("memoryMbytes", 22000); // setting enough memory

This is wrong and can have such side effect probably. The option is about memory in megabytes, I doubt you have 22 terabytes of RAM.

It could be as well connected to how you run the docker image, as having 32g ram in your system does not mean the docker image can access it, the defaults are usually much lower (maybe even less than 1g, not sure now).

Generally speaking, there are no known issues when it comes to running in docker, we run things inside docker images on daily basis, as that is how the apify platform works. You shouldn't need to alter any of the options for that. I would be very careful with adjusting the autoscaled pool options, if you see an overloaded event loop, it's not about making the limits higher, it's the very opposite, it's about doing less so the event loop can catch up.

4 replies

wicked-network Aug 28, 2023
Author

Sure, I'd be happy to provide logs, but which logs exactly would you like me to provide? I added the autoscaled pool logs above, and here is an excerpt from the debug logs (notice how it's stuck at 5 requests, like this for some time already):

2023-08-28 12:30:02 DEBUG HttpCrawler: Status message: Crawled 5/1010 pages, 0 failed requests.
2023-08-28 12:30:02 DEBUG HttpCrawler:AutoscaledPool: scaling up {"oldConcurrency":13,"newConcurrency":14,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":1.9,"actualRatio":1},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
2023-08-28 12:30:02 INFO  HttpCrawler:AutoscaledPool: state {"currentConcurrency":13,"desiredConcurrency":14,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":1.9,"actualRatio":1},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
2023-08-28 12:30:07 DEBUG HttpCrawler:SessionPool: Created new Session - session_Q5RCxlV3eD
2023-08-28 12:30:12 DEBUG HttpCrawler: Status message: Crawled 5/1010 pages, 0 failed requests.
2023-08-28 12:30:12 DEBUG HttpCrawler:AutoscaledPool: scaling up {"oldConcurrency":14,"newConcurrency":15,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":1.9,"actualRatio":1},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
2023-08-28 12:30:14 DEBUG HttpCrawler: Page opened. {"label":"label1","url":"<url>"}
2023-08-28 12:30:19 DEBUG HttpCrawler:SessionPool: Created new Session - session_tBC8ErpYJo
2023-08-28 12:30:22 DEBUG HttpCrawler: Status message: Crawled 5/1010 pages, 0 failed requests.
2023-08-28 12:30:22 DEBUG HttpCrawler:AutoscaledPool: scaling up {"oldConcurrency":15,"newConcurrency":16,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":1.9,"actualRatio":0.928},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
2023-08-28 12:30:22 DEBUG HttpCrawler: Page opened. {"label":"label2","url":"<url>"}
2023-08-28 12:30:24 DEBUG HttpCrawler: Page opened. {"label":"label3","url":"<url>"}
2023-08-28 12:30:29 DEBUG HttpCrawler:SessionPool: Created new Session - session_aBaaeD5LgW
2023-08-28 12:30:29 DEBUG HttpCrawler:SessionPool: Created new Session - session_b1PGatJRLI
2023-08-28 12:30:29 DEBUG HttpCrawler:SessionPool: Created new Session - session_6PBGtVIYeY
2023-08-28 12:30:33 DEBUG HttpCrawler: Status message: Crawled 5/1010 pages, 0 failed requests.
2023-08-28 12:30:33 DEBUG HttpCrawler:AutoscaledPool: scaling up {"oldConcurrency":16,"newConcurrency":17,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":1.9,"actualRatio":0.929},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
2023-08-28 12:30:38 DEBUG HttpCrawler:SessionPool: Created new Session - session_6GyDOqSWXM
2023-08-28 12:30:38 DEBUG HttpCrawler:SessionPool: Created new Session - session_8Z6igZuiur
2023-08-28 12:30:40 DEBUG HttpCrawler:SessionPool: Created new Session - session_4v6CmAkgty

I run the app via docker compose inside WSL2, with the following settings in my .wslconfig:

[wsl2]
memory=28GB

These settings should be propagated to docker engine, since the resources are managed by windows.

This is my simple docker compose config:

version: "3.1"

services:
  # local services
  scraper:
    build:
      context: .
      dockerfile: ./app/Dockerfile
    restart: on-failure
    platform: linux/amd64
    environment:
      RUNNING_IN_DOCKER: 1
      MONGO_HOST: mongo
    env_file:
      - .env
    ports:
      - 5555:3000
    volumes:
      - ./app:/app
    depends_on:
      - mongo
      - redis

The option is about memory in megabytes, I doubt you have 22 terabytes of RAM

Isn't this config.set("memoryMbytes", 22000); equal to approximately 22GB? I am confused, maybe I am missing something.

And thank you for such a fast reply!

B4nan Aug 28, 2023
Maintainer

Isn't this config.set("memoryMbytes", 22000); equal to approximately 22GB? I am confused, maybe I am missing something.

Ouch, you are correct.

but which logs exactly would you like me to provide?

Exactly those logs as you provided. From that section you provided, it looks like the request processing is indeed stuck, but due to the overridden autoscaling options, it keeps trying to run more things in parallel. What if you try logging things in your request handler? Are you sure that it does not get stuck e.g. in the postNavigationHooks? Also, try removing the autoscaledPoolOptions options, so the scaling can do its job.

You say it works fine on macos but without docker, how does it behave on windows without docker? And how on macos with docker?

wicked-network Aug 28, 2023
Author

Just tested without the config.set("memoryMbytes", 22000); completely (only using config.set("availableMemoryRatio", 0.9);), as well as removing the autoscaledPoolOptions config.

Seems like running into the same problems, the event loop becomes overloaded very quickly.

Here are the logs:

2023-08-28 12:48:52 DEBUG HttpCrawler: Status message: Initializing the crawler.
2023-08-28 12:48:52 INFO  HttpCrawler: Starting the crawl
2023-08-28 12:48:52 DEBUG HttpCrawler:AutoscaledPool:Snapshotter: Setting max memory of this run to 25261 MB. Use the CRAWLEE_MEMORY_MBYTES or CRAWLEE_AVAILABLE_MEMORY_RATIO environment variable to override it.
.....
2023-08-28 12:49:56 DEBUG HttpCrawler: Status message: Crawled 9/1009 pages, 0 failed requests.
2023-08-28 12:49:56 DEBUG HttpCrawler:AutoscaledPool: scaling down {"oldConcurrency":9,"newConcurrency":8,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":true,"limitRatio":0.7,"actualRatio":1},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
2023-08-28 12:49:56 INFO  HttpCrawler:AutoscaledPool: state {"currentConcurrency":8,"desiredConcurrency":8,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":true,"limitRatio":0.7,"actualRatio":1},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
2023-08-28 12:50:06 DEBUG HttpCrawler: Status message: Crawled 10/1015 pages, 0 failed requests.
2023-08-28 12:50:06 DEBUG HttpCrawler:AutoscaledPool: scaling down {"oldConcurrency":8,"newConcurrency":7,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":true,"limitRatio":0.7,"actualRatio":1},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
2023-08-28 12:50:17 DEBUG HttpCrawler: Status message: Crawled 10/1015 pages, 0 failed requests.
2023-08-28 12:50:17 DEBUG HttpCrawler:AutoscaledPool: scaling down {"oldConcurrency":7,"newConcurrency":6,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":true,"limitRatio":0.7,"actualRatio":1},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
2023-08-28 12:50:27 DEBUG HttpCrawler: Status message: Crawled 10/1015 pages, 0 failed requests.
2023-08-28 12:50:27 DEBUG HttpCrawler:AutoscaledPool: scaling down {"oldConcurrency":6,"newConcurrency":5,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":true,"limitRatio":0.7,"actualRatio":1},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
2023-08-28 12:50:37 DEBUG HttpCrawler: Status message: Crawled 10/1015 pages, 0 failed requests.
2023-08-28 12:50:37 DEBUG HttpCrawler:AutoscaledPool: scaling down {"oldConcurrency":5,"newConcurrency":4,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":true,"limitRatio":0.7,"actualRatio":1},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
2023-08-28 12:50:48 DEBUG HttpCrawler: Status message: Crawled 10/1015 pages, 0 failed requests.
2023-08-28 12:50:48 DEBUG HttpCrawler:AutoscaledPool: scaling down {"oldConcurrency":4,"newConcurrency":3,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":true,"limitRatio":0.7,"actualRatio":1},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
2023-08-28 12:50:54 DEBUG HttpCrawler:SessionPool: Persisting state {"persistStateKey":"SDK_SESSION_POOL_STATE"}
2023-08-28 12:50:54 DEBUG Statistics: Persisting state {"persistStateKey":"SDK_CRAWLER_STATISTICS_0"}
2023-08-28 12:50:54 INFO  Statistics: HttpCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":12495,"requestsFinishedPerMinute":5,"requestsFailedPerMinute":0,"requestTotalDurationMillis":124950,"requestsTotal":10,"crawlerRuntimeMillis":122258,"retryHistogram":[10]}

wicked-network Aug 28, 2023
Author

I will need to test more, how the app runs in those different environments and will report a bit later.

wicked-network · 2023-09-13T14:21:54Z

wicked-network
Sep 13, 2023
Author

After deeper debugging, I found out that the issue was related to how i was handling the data outside of crawlee. So the issue is not related to Crawlee. Thanks for the support!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow crawling inside docker #2054

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Very slow crawling inside docker #2054

wicked-network Aug 28, 2023

Which package is this bug report for? If unsure which one to select, leave blank

Issue description

Environment

Dockerfile

Code sample

Package version

Node.js version

Operating system

Apify platform

I have tested this on the next release

Other context

Replies: 2 comments · 4 replies

B4nan Aug 28, 2023 Maintainer

wicked-network Aug 28, 2023 Author

B4nan Aug 28, 2023 Maintainer

wicked-network Aug 28, 2023 Author

wicked-network Aug 28, 2023 Author

wicked-network Sep 13, 2023 Author

wicked-network
Aug 28, 2023

I have tested this on the `next` release

Replies: 2 comments 4 replies

B4nan
Aug 28, 2023
Maintainer

wicked-network Aug 28, 2023
Author

B4nan Aug 28, 2023
Maintainer

wicked-network Aug 28, 2023
Author

wicked-network Aug 28, 2023
Author

wicked-network
Sep 13, 2023
Author