-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High CPU Usage #722
Comments
@stijndehaes Is there any degredation to anything on your cluster while you are seeing that spike in CPU usage. Karpenter is fairly CPU intensive in general because of all the computation that we are doing to do scheduling simulations during provisioning and deprovisioning so this is about what we expect. I do think there is still some room to improve this usage moving forward, but for now the recommendation is to bump your resource requests for what Karpenter needs |
@jonathan-innis there was some degredation 2 nights ago. We recently switched our largest provisioner group from consolidation checking to emptiness checking. All of a sound we were getting OOMKilled and our CPU usage roughly doubled. In the mean time we also upgraded from karpenter 0.25 to 0.26.1, but I don't think that was the issue. I am willing to help test new versions of karpenter and report CPU/Memory usage here. I could even build karpenter from a branch to test it out :) |
@jonathan-innis I made a PR to karpenter-core which reduces some CPU cycles in the hot code path :) It helped us reduce CPU load significantly: |
@stijndehaes That PR is great! Any hypothesis as to why this changed to increase cpu/memory usage when using ttlSecondsAfterEmpty? I would expect consolidation to be more cpu-intensive than the emptiness check |
The only thing I can think of is that we have 2 provisioners with similar specifications but one is running empty and one running consolidation. The reason we do this is the following. We have some long running pods that are running Airflow (web application + scheduler), and for the rest we run batch jobs that finish at a certain moment in time. Not sure if this can impact the cost for calculating consolidation? |
@jonathan-innis I found the reason for the CPU changes we are seeing.
What we did before is run several batch jobs at the same time on these nodes, these pods had the It is mostly the singlenodeprovisioning that is taking up so much CPU, a way to improve performance might be to have a utlisation threshold before counting these nodes as candidates for consolidation WDYT? There might be other ways to achieve this, filtering out node candidates early is one thing. |
Labeled for closure due to inactivity in 10 days. |
@stijndehaes Sorry, for accidentally labeling this as stale, we didn't have the stalebot setup properly here.
We've thought about doing this a little bit. There's a couple issues right now that's tracking all the different knobs that everyone wants for deprovisoining (#3520 and #735) if you want to add your thoughts to either of those. There's definitely a performance impact to not filtering early on in the process that we need to consider as you called out. |
@jonathan-innis no problem, this can I happen. I am just back from 3 week of Holidays. After deploying a new version with the improvement I made, we see a 50% reduction in CPU usage in our biggest cluster (with up to 600 nodes), and in a smaller 50 node cluster we see an 80% reduction in CPU usage. So my previous inprovement helps a lot. However in the biggest cluster we still see an average CPU usage of about 1 core continuously, in smaller clusters CPU usage is almost neglegible at the moment. I'll have a look at those issues later today :) |
This is awesome to hear @stijndehaes! Huge win here! Separate question: Are you in the #karpenter-dev channel in the Karpenter slack? It would be cool to give you a shoutout and anecdote for the work you are doing here! |
I wasn't yet but I am now :) |
Closing because the remaining outstanding goal is not measurable. Please open new issues for specific performance improvements or findings. Thank you @stijndehaes for the contribution! |
Version
Karpenter Version: v0.26.1
Kubernetes Version: v1.24.8-eks-ffeb93d
Expected Behavior
We expect CPU usage to be relatively OK
Actual Behavior
We currently see very high CPU usage:
The grap shows:
rate(process_cpu_seconds_total{job="karpenter"}[2m])
Steps to Reproduce the Problem
We are currently running a cluster with 200 nodes, but at night it can peak to 700+ nodes.
There are batch jobs running on the cluster.
We are currently using 4 provisioners:
We recently upgraded from karpenter 0.25.0, and have also just split our biggest provisioner into 2:
Resource Specs and Logs
We have currently assigned more resource to mitgate the issue:
Not all the memory is currently being used.
I also enabled pprof and have created a cpu profile, and memory profile
The pdf of the cpu profile is attached here. In here I can see that a lot of time is spent in the
sets.Union
method. And also in garbage collection I think? At that moment in time however only 1Gi of 3GI memory was used. So I don't think memory is the problem. The most problematic thing seems to be the amount of times the reconcile method is triggered, and the amount of time it takessets.Union
in that loop.The pdf shows the visualsation of the CPU profile
profile002.pdf
This zip contains, the CPU profile, memory heap snapshot and goroutine snapshot
karpenter.zip
Community Note
The text was updated successfully, but these errors were encountered: