High CPU Usage #722

stijndehaes · 2023-03-09T10:20:17Z

Version

Karpenter Version: v0.26.1

Kubernetes Version: v1.24.8-eks-ffeb93d

Expected Behavior

We expect CPU usage to be relatively OK

Actual Behavior

We currently see very high CPU usage:

The grap shows: rate(process_cpu_seconds_total{job="karpenter"}[2m])

Steps to Reproduce the Problem

We are currently running a cluster with 200 nodes, but at night it can peak to 700+ nodes.
There are batch jobs running on the cluster.

We are currently using 4 provisioners:

3 provisioner uses emptiness checking
1 provisioner uses consolidation to downscale

We recently upgraded from karpenter 0.25.0, and have also just split our biggest provisioner into 2:

1 running emptiness
1 running consolidation

Resource Specs and Logs

We have currently assigned more resource to mitgate the issue:

cpu: 800m
memeory 3Gi
Not all the memory is currently being used.
I also enabled pprof and have created a cpu profile, and memory profile

The pdf of the cpu profile is attached here. In here I can see that a lot of time is spent in the sets.Union method. And also in garbage collection I think? At that moment in time however only 1Gi of 3GI memory was used. So I don't think memory is the problem. The most problematic thing seems to be the amount of times the reconcile method is triggered, and the amount of time it takes sets.Union in that loop.

The pdf shows the visualsation of the CPU profile
profile002.pdf

This zip contains, the CPU profile, memory heap snapshot and goroutine snapshot
karpenter.zip

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

jonathan-innis · 2023-03-09T17:48:30Z

@stijndehaes Is there any degredation to anything on your cluster while you are seeing that spike in CPU usage. Karpenter is fairly CPU intensive in general because of all the computation that we are doing to do scheduling simulations during provisioning and deprovisioning so this is about what we expect.

I do think there is still some room to improve this usage moving forward, but for now the recommendation is to bump your resource requests for what Karpenter needs

stijndehaes · 2023-03-10T08:01:22Z

@jonathan-innis there was some degredation 2 nights ago. We recently switched our largest provisioner group from consolidation checking to emptiness checking. All of a sound we were getting OOMKilled and our CPU usage roughly doubled.

In the mean time we also upgraded from karpenter 0.25 to 0.26.1, but I don't think that was the issue.

I am willing to help test new versions of karpenter and report CPU/Memory usage here. I could even build karpenter from a branch to test it out :)

stijndehaes · 2023-03-10T14:11:42Z

@jonathan-innis I made a PR to karpenter-core which reduces some CPU cycles in the hot code path :) It helped us reduce CPU load significantly:

jonathan-innis · 2023-03-10T19:25:30Z

@stijndehaes That PR is great! Any hypothesis as to why this changed to increase cpu/memory usage when using ttlSecondsAfterEmpty? I would expect consolidation to be more cpu-intensive than the emptiness check

stijndehaes · 2023-03-13T08:10:35Z

@stijndehaes That PR is great! Any hypothesis as to why this changed to increase cpu/memory usage when using ttlSecondsAfterEmpty? I would expect consolidation to be more cpu-intensive than the emptiness check

The only thing I can think of is that we have 2 provisioners with similar specifications but one is running empty and one running consolidation. The reason we do this is the following. We have some long running pods that are running Airflow (web application + scheduler), and for the rest we run batch jobs that finish at a certain moment in time.
The long running pods, are ran on a provisioner with consolidation enabled so they run in a bin packed way. The batch jobs can run both on the provisioner for long running jobs, plus an extra provisioner with emptiness enabled.

Not sure if this can impact the cost for calculating consolidation?

stijndehaes · 2023-03-14T14:49:34Z

@jonathan-innis I found the reason for the CPU changes we are seeing.
At most of our cluster CPU usage did indeed drop, however on one cluster it increased. I was able to get similar spikes on the CPU of karpenter by doing the following:

Create a provisioner with consolidation enabled
Launch enough pods to launch around 20 nodes, these pods can not be consolidated into less nodes
CPU usage of karpenter will rise significantly doing these steps.

What we did before is run several batch jobs at the same time on these nodes, these pods had the karpenter.sh/do-not-evict: true annotation enabled. This made it so deprovisioning skipped these nodes as candidates, and this actually reduced CPU usage for us quite a bit. Since these nodes are now never skipped as candidates we have a higher CPU load.

It is mostly the singlenodeprovisioning that is taking up so much CPU, a way to improve performance might be to have a utlisation threshold before counting these nodes as candidates for consolidation WDYT? There might be other ways to achieve this, filtering out node candidates early is one thing.

github-actions · 2023-04-19T12:04:06Z

Labeled for closure due to inactivity in 10 days.

jonathan-innis · 2023-04-19T23:04:25Z

@stijndehaes Sorry, for accidentally labeling this as stale, we didn't have the stalebot setup properly here.

way to improve performance might be to have a utlisation threshold before counting these nodes as candidates for consolidation WDYT

We've thought about doing this a little bit. There's a couple issues right now that's tracking all the different knobs that everyone wants for deprovisoining (#3520 and #735) if you want to add your thoughts to either of those. There's definitely a performance impact to not filtering early on in the process that we need to consider as you called out.

stijndehaes · 2023-04-20T06:01:54Z

@stijndehaes Sorry, for accidentally labeling this as stale, we didn't have the stalebot setup properly here.

way to improve performance might be to have a utlisation threshold before counting these nodes as candidates for consolidation WDYT

We've thought about doing this a little bit. There's a couple issues right now that's tracking all the different knobs that everyone wants for deprovisoining (#3520 and #735) if you want to add your thoughts to either of those. There's definitely a performance impact to not filtering early on in the process that we need to consider as you called out.

@jonathan-innis no problem, this can I happen. I am just back from 3 week of Holidays.

After deploying a new version with the improvement I made, we see a 50% reduction in CPU usage in our biggest cluster (with up to 600 nodes), and in a smaller 50 node cluster we see an 80% reduction in CPU usage. So my previous inprovement helps a lot. However in the biggest cluster we still see an average CPU usage of about 1 core continuously, in smaller clusters CPU usage is almost neglegible at the moment.

I'll have a look at those issues later today :)

jonathan-innis · 2023-04-20T17:29:44Z

This is awesome to hear @stijndehaes! Huge win here! Separate question: Are you in the #karpenter-dev channel in the Karpenter slack? It would be cool to give you a shoutout and anecdote for the work you are doing here!

stijndehaes · 2023-04-21T08:11:52Z

This is awesome to hear @stijndehaes! Huge win here! Separate question: Are you in the #karpenter-dev channel in the Karpenter slack? It would be cool to give you a shoutout and anecdote for the work you are doing here!

I wasn't yet but I am now :)

billrayburn · 2023-08-30T18:25:37Z

Closing because the remaining outstanding goal is not measurable. Please open new issues for specific performance improvements or findings. Thank you @stijndehaes for the contribution!

stijndehaes added the kind/bug Categorizes issue or PR as related to a bug. label Mar 9, 2023

stijndehaes mentioned this issue Mar 10, 2023

perf: Improve the performance of the provisioner #235

Merged

stijndehaes mentioned this issue Mar 13, 2023

perf: Make list calls more consistent #239

Closed

billrayburn added v1 Issues requiring resolution by the v1 milestone and removed kind/bug Categorizes issue or PR as related to a bug. labels Mar 29, 2023

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 19, 2023

tzneal removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 19, 2023

jonathan-innis added performance Issues relating to performance (memory usage, cpu usage, timing) operational-excellence labels Apr 19, 2023

billrayburn closed this as completed Aug 30, 2023

njtran transferred this issue from aws/karpenter-provider-aws Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU Usage #722

High CPU Usage #722

stijndehaes commented Mar 9, 2023

jonathan-innis commented Mar 9, 2023

stijndehaes commented Mar 10, 2023

stijndehaes commented Mar 10, 2023

jonathan-innis commented Mar 10, 2023

stijndehaes commented Mar 13, 2023

stijndehaes commented Mar 14, 2023

github-actions bot commented Apr 19, 2023

jonathan-innis commented Apr 19, 2023

stijndehaes commented Apr 20, 2023

jonathan-innis commented Apr 20, 2023

stijndehaes commented Apr 21, 2023

billrayburn commented Aug 30, 2023

High CPU Usage #722

High CPU Usage #722

Comments

stijndehaes commented Mar 9, 2023

Version

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Resource Specs and Logs

Community Note

jonathan-innis commented Mar 9, 2023

stijndehaes commented Mar 10, 2023

stijndehaes commented Mar 10, 2023

jonathan-innis commented Mar 10, 2023

stijndehaes commented Mar 13, 2023

stijndehaes commented Mar 14, 2023

github-actions bot commented Apr 19, 2023

jonathan-innis commented Apr 19, 2023

stijndehaes commented Apr 20, 2023

jonathan-innis commented Apr 20, 2023

stijndehaes commented Apr 21, 2023

billrayburn commented Aug 30, 2023