Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU Usage #722

Closed
stijndehaes opened this issue Mar 9, 2023 · 12 comments
Closed

High CPU Usage #722

stijndehaes opened this issue Mar 9, 2023 · 12 comments
Labels
operational-excellence performance Issues relating to performance (memory usage, cpu usage, timing) v1 Issues requiring resolution by the v1 milestone

Comments

@stijndehaes
Copy link
Contributor

Version

Karpenter Version: v0.26.1

Kubernetes Version: v1.24.8-eks-ffeb93d

Expected Behavior

We expect CPU usage to be relatively OK

Actual Behavior

We currently see very high CPU usage:
Screenshot 2023-03-09 at 10 48 28

The grap shows: rate(process_cpu_seconds_total{job="karpenter"}[2m])

Steps to Reproduce the Problem

We are currently running a cluster with 200 nodes, but at night it can peak to 700+ nodes.
There are batch jobs running on the cluster.

We are currently using 4 provisioners:

  • 3 provisioner uses emptiness checking
  • 1 provisioner uses consolidation to downscale

We recently upgraded from karpenter 0.25.0, and have also just split our biggest provisioner into 2:

  • 1 running emptiness
  • 1 running consolidation

Resource Specs and Logs

We have currently assigned more resource to mitgate the issue:

  • cpu: 800m
  • memeory 3Gi
    Not all the memory is currently being used.
    I also enabled pprof and have created a cpu profile, and memory profile

The pdf of the cpu profile is attached here. In here I can see that a lot of time is spent in the sets.Union method. And also in garbage collection I think? At that moment in time however only 1Gi of 3GI memory was used. So I don't think memory is the problem. The most problematic thing seems to be the amount of times the reconcile method is triggered, and the amount of time it takes sets.Union in that loop.

The pdf shows the visualsation of the CPU profile
profile002.pdf

This zip contains, the CPU profile, memory heap snapshot and goroutine snapshot
karpenter.zip

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@stijndehaes stijndehaes added the kind/bug Categorizes issue or PR as related to a bug. label Mar 9, 2023
@jonathan-innis
Copy link
Member

@stijndehaes Is there any degredation to anything on your cluster while you are seeing that spike in CPU usage. Karpenter is fairly CPU intensive in general because of all the computation that we are doing to do scheduling simulations during provisioning and deprovisioning so this is about what we expect.

I do think there is still some room to improve this usage moving forward, but for now the recommendation is to bump your resource requests for what Karpenter needs

@stijndehaes
Copy link
Contributor Author

@jonathan-innis there was some degredation 2 nights ago. We recently switched our largest provisioner group from consolidation checking to emptiness checking. All of a sound we were getting OOMKilled and our CPU usage roughly doubled.

In the mean time we also upgraded from karpenter 0.25 to 0.26.1, but I don't think that was the issue.

I am willing to help test new versions of karpenter and report CPU/Memory usage here. I could even build karpenter from a branch to test it out :)

@stijndehaes
Copy link
Contributor Author

@jonathan-innis I made a PR to karpenter-core which reduces some CPU cycles in the hot code path :) It helped us reduce CPU load significantly:

image

@jonathan-innis
Copy link
Member

@stijndehaes That PR is great! Any hypothesis as to why this changed to increase cpu/memory usage when using ttlSecondsAfterEmpty? I would expect consolidation to be more cpu-intensive than the emptiness check

@stijndehaes
Copy link
Contributor Author

@stijndehaes That PR is great! Any hypothesis as to why this changed to increase cpu/memory usage when using ttlSecondsAfterEmpty? I would expect consolidation to be more cpu-intensive than the emptiness check

The only thing I can think of is that we have 2 provisioners with similar specifications but one is running empty and one running consolidation. The reason we do this is the following. We have some long running pods that are running Airflow (web application + scheduler), and for the rest we run batch jobs that finish at a certain moment in time.
The long running pods, are ran on a provisioner with consolidation enabled so they run in a bin packed way. The batch jobs can run both on the provisioner for long running jobs, plus an extra provisioner with emptiness enabled.

Not sure if this can impact the cost for calculating consolidation?

@stijndehaes
Copy link
Contributor Author

@jonathan-innis I found the reason for the CPU changes we are seeing.
At most of our cluster CPU usage did indeed drop, however on one cluster it increased. I was able to get similar spikes on the CPU of karpenter by doing the following:

  • Create a provisioner with consolidation enabled
  • Launch enough pods to launch around 20 nodes, these pods can not be consolidated into less nodes
    CPU usage of karpenter will rise significantly doing these steps.

What we did before is run several batch jobs at the same time on these nodes, these pods had the karpenter.sh/do-not-evict: true annotation enabled. This made it so deprovisioning skipped these nodes as candidates, and this actually reduced CPU usage for us quite a bit. Since these nodes are now never skipped as candidates we have a higher CPU load.

It is mostly the singlenodeprovisioning that is taking up so much CPU, a way to improve performance might be to have a utlisation threshold before counting these nodes as candidates for consolidation WDYT? There might be other ways to achieve this, filtering out node candidates early is one thing.

@billrayburn billrayburn added v1 Issues requiring resolution by the v1 milestone and removed kind/bug Categorizes issue or PR as related to a bug. labels Mar 29, 2023
@github-actions
Copy link

Labeled for closure due to inactivity in 10 days.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 19, 2023
@tzneal tzneal removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 19, 2023
@jonathan-innis jonathan-innis added performance Issues relating to performance (memory usage, cpu usage, timing) operational-excellence labels Apr 19, 2023
@jonathan-innis
Copy link
Member

@stijndehaes Sorry, for accidentally labeling this as stale, we didn't have the stalebot setup properly here.

way to improve performance might be to have a utlisation threshold before counting these nodes as candidates for consolidation WDYT

We've thought about doing this a little bit. There's a couple issues right now that's tracking all the different knobs that everyone wants for deprovisoining (#3520 and #735) if you want to add your thoughts to either of those. There's definitely a performance impact to not filtering early on in the process that we need to consider as you called out.

@stijndehaes
Copy link
Contributor Author

@stijndehaes Sorry, for accidentally labeling this as stale, we didn't have the stalebot setup properly here.

way to improve performance might be to have a utlisation threshold before counting these nodes as candidates for consolidation WDYT

We've thought about doing this a little bit. There's a couple issues right now that's tracking all the different knobs that everyone wants for deprovisoining (#3520 and #735) if you want to add your thoughts to either of those. There's definitely a performance impact to not filtering early on in the process that we need to consider as you called out.

@jonathan-innis no problem, this can I happen. I am just back from 3 week of Holidays.

After deploying a new version with the improvement I made, we see a 50% reduction in CPU usage in our biggest cluster (with up to 600 nodes), and in a smaller 50 node cluster we see an 80% reduction in CPU usage. So my previous inprovement helps a lot. However in the biggest cluster we still see an average CPU usage of about 1 core continuously, in smaller clusters CPU usage is almost neglegible at the moment.

I'll have a look at those issues later today :)

@jonathan-innis
Copy link
Member

This is awesome to hear @stijndehaes! Huge win here! Separate question: Are you in the #karpenter-dev channel in the Karpenter slack? It would be cool to give you a shoutout and anecdote for the work you are doing here!

@stijndehaes
Copy link
Contributor Author

This is awesome to hear @stijndehaes! Huge win here! Separate question: Are you in the #karpenter-dev channel in the Karpenter slack? It would be cool to give you a shoutout and anecdote for the work you are doing here!

I wasn't yet but I am now :)

@billrayburn
Copy link

Closing because the remaining outstanding goal is not measurable. Please open new issues for specific performance improvements or findings. Thank you @stijndehaes for the contribution!

@njtran njtran transferred this issue from aws/karpenter-provider-aws Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
operational-excellence performance Issues relating to performance (memory usage, cpu usage, timing) v1 Issues requiring resolution by the v1 milestone
Projects
None yet
Development

No branches or pull requests

4 participants