-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Improve the performance of the provisioner #235
perf: Improve the performance of the provisioner #235
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! Just one thought on giving an explanation
Pull Request Test Coverage Report for Build 4402023601
💛 - Coveralls |
Can you run the scheduling benchmark before and after? Would love to see the results. |
I think I found it: ➜ go test -tags=test_performance -run=SchedulingProfile
scheduled 7610 against 21 nodes in total in 3.547897064s 2144.9325791375327 pods/sec
400 instances 10 pods 1 nodes 2.686218ms per scheduling 268.621µs per pod
400 instances 100 pods 1 nodes 27.049895ms per scheduling 270.498µs per pod
400 instances 500 pods 1 nodes 127.605296ms per scheduling 255.21µs per pod
400 instances 1000 pods 3 nodes 293.152625ms per scheduling 293.152µs per pod
400 instances 1500 pods 4 nodes 424.123055ms per scheduling 282.748µs per pod
400 instances 2000 pods 5 nodes 564.165291ms per scheduling 282.082µs per pod
400 instances 2500 pods 6 nodes 642.586604ms per scheduling 257.034µs per pod
PASS
ok github.com/aws/karpenter-core/pkg/controllers/provisioning/scheduling 13.892s after:
Looks rather similar to me |
This is what I get for reviewing code on my phone. Apologies -- this provisioner code is outside of the scope of the benchmark 🤦 . I'm honestly shocked to see this amount of performance hit from this piece of code, given that it's essentially initialization logic. Do you have a massive number of provisioners, or something? |
@ellistarn we only use 4 provisioners, 3 of them running emptiness, and one using consolidation. We run from 50 to 800 nodes, and are constantly scheduling batch jobs You can also see that most of the time of the CPU goes to the sets.Union, because there are many copies of data happening here. In general in the code there are many copies happening resulting in a lot of CPU cycles. I realise you guys have been focussing on features, but I would like to help you guys improve the performance :) Maybe I can try to create some benchmarks that show the same performance issues I am seeing? |
Agree. We're shifting gears as we approach v1.
Fantastic. Can you work with @jonathan-innis to help define this? Are you thinking about code benchmarks or real test workloads? We need to build these benchmarks into our e2e suites, since GHA has such unreliable performance @spring1843. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
In support of: #722
Description
This changes improves the performance of the provisioner. We saw high CPU usage, using pprof I found out that a lot of CPU cycles are wasted on the Union function on the string set.
Instead of using union I inserted all the new strings in place in the set, this decreased CPU usage by up to 50%.
How was this change tested?
I tested it out by building karpenter from source and deploying on a cluster with high CPU usage. You can see the result here. The green line is the before and the yellow after the change:
I also have 2 pprof profiles, one before the change you can find the pdf here:
profile-old.pdf
And one after the change:
profile-improved.pdf
You can clearly see that the hot code path has changed
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.