Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Improve the performance of the provisioner #235

Merged
merged 1 commit into from
Mar 13, 2023
Merged

perf: Improve the performance of the provisioner #235

merged 1 commit into from
Mar 13, 2023

Conversation

stijndehaes
Copy link
Contributor

In support of: #722

Description

This changes improves the performance of the provisioner. We saw high CPU usage, using pprof I found out that a lot of CPU cycles are wasted on the Union function on the string set.

Instead of using union I inserted all the new strings in place in the set, this decreased CPU usage by up to 50%.

How was this change tested?

I tested it out by building karpenter from source and deploying on a cluster with high CPU usage. You can see the result here. The green line is the before and the yellow after the change:
image

I also have 2 pprof profiles, one before the change you can find the pdf here:
profile-old.pdf

And one after the change:
profile-improved.pdf

You can clearly see that the hot code path has changed

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@stijndehaes stijndehaes requested a review from a team as a code owner March 10, 2023 14:05
Copy link
Member

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Just one thought on giving an explanation

pkg/controllers/provisioning/provisioner.go Show resolved Hide resolved
@coveralls
Copy link

coveralls commented Mar 10, 2023

Pull Request Test Coverage Report for Build 4402023601

  • 14 of 15 (93.33%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.04%) to 80.928%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/provisioning/provisioner.go 14 15 93.33%
Totals Coverage Status
Change from base Build 4377678707: 0.04%
Covered Lines: 6556
Relevant Lines: 8101

💛 - Coveralls

@ellistarn
Copy link
Contributor

Can you run the scheduling benchmark before and after? Would love to see the results.

@stijndehaes
Copy link
Contributor Author

stijndehaes commented Mar 13, 2023

Can you run the scheduling benchmark before and after? Would love to see the results.

I think I found it:
before:

➜  go test -tags=test_performance -run=SchedulingProfile
scheduled 7610 against 21 nodes in total in 3.547897064s 2144.9325791375327 pods/sec
400 instances 10 pods    1 nodes  2.686218ms per scheduling    268.621µs per pod
400 instances 100 pods   1 nodes  27.049895ms per scheduling   270.498µs per pod
400 instances 500 pods   1 nodes  127.605296ms per scheduling  255.21µs per pod
400 instances 1000 pods  3 nodes  293.152625ms per scheduling  293.152µs per pod
400 instances 1500 pods  4 nodes  424.123055ms per scheduling  282.748µs per pod
400 instances 2000 pods  5 nodes  564.165291ms per scheduling  282.082µs per pod
400 instances 2500 pods  6 nodes  642.586604ms per scheduling  257.034µs per pod
PASS
ok      github.com/aws/karpenter-core/pkg/controllers/provisioning/scheduling   13.892s

after:

➜  go test -tags=test_performance -run=SchedulingProfile
scheduled 7610 against 21 nodes in total in 3.513244782s 2166.0887504877533 pods/sec
400 instances 10 pods    1 nodes  2.719053ms per scheduling    271.905µs per pod
400 instances 100 pods   1 nodes  27.699721ms per scheduling   276.997µs per pod
400 instances 500 pods   1 nodes  126.389574ms per scheduling  252.779µs per pod
400 instances 1000 pods  3 nodes  282.285583ms per scheduling  282.285µs per pod
400 instances 1500 pods  4 nodes  426.077208ms per scheduling  284.051µs per pod
400 instances 2000 pods  5 nodes  558.808812ms per scheduling  279.404µs per pod
400 instances 2500 pods  6 nodes  636.097562ms per scheduling  254.439µs per pod
PASS
ok      github.com/aws/karpenter-core/pkg/controllers/provisioning/scheduling   13.712s

Looks rather similar to me

@ellistarn
Copy link
Contributor

Looks rather similar to me

This is what I get for reviewing code on my phone. Apologies -- this provisioner code is outside of the scope of the benchmark 🤦 . I'm honestly shocked to see this amount of performance hit from this piece of code, given that it's essentially initialization logic. Do you have a massive number of provisioners, or something?

@stijndehaes
Copy link
Contributor Author

Looks rather similar to me

This is what I get for reviewing code on my phone. Apologies -- this provisioner code is outside of the scope of the benchmark 🤦 . I'm honestly shocked to see this amount of performance hit from this piece of code, given that it's essentially initialization logic. Do you have a massive number of provisioners, or something?

@ellistarn we only use 4 provisioners, 3 of them running emptiness, and one using consolidation. We run from 50 to 800 nodes, and are constantly scheduling batch jobs
You can find more info here: #722

You can also see that most of the time of the CPU goes to the sets.Union, because there are many copies of data happening here. In general in the code there are many copies happening resulting in a lot of CPU cycles. I realise you guys have been focussing on features, but I would like to help you guys improve the performance :) Maybe I can try to create some benchmarks that show the same performance issues I am seeing?

@ellistarn
Copy link
Contributor

ellistarn commented Mar 13, 2023

I realise you guys have been focussing on features

Agree. We're shifting gears as we approach v1.

but I would like to help you guys improve the performance :) Maybe I can try to create some benchmarks that show the same performance issues I am seeing?

Fantastic. Can you work with @jonathan-innis to help define this? Are you thinking about code benchmarks or real test workloads? We need to build these benchmarks into our e2e suites, since GHA has such unreliable performance @spring1843.

Copy link
Member

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@jonathan-innis jonathan-innis merged commit 2159f72 into kubernetes-sigs:main Mar 13, 2023
@stijndehaes stijndehaes deleted the feature/optimize-performance-2 branch March 13, 2023 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants