perf: Improve the performance of the provisioner #235

stijndehaes · 2023-03-10T14:05:10Z

In support of: #722

Description

This changes improves the performance of the provisioner. We saw high CPU usage, using pprof I found out that a lot of CPU cycles are wasted on the Union function on the string set.

Instead of using union I inserted all the new strings in place in the set, this decreased CPU usage by up to 50%.

How was this change tested?

I tested it out by building karpenter from source and deploying on a cluster with high CPU usage. You can see the result here. The green line is the before and the yellow after the change:

I also have 2 pprof profiles, one before the change you can find the pdf here:
profile-old.pdf

And one after the change:
profile-improved.pdf

You can clearly see that the hot code path has changed

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

jonathan-innis

This is awesome! Just one thought on giving an explanation

pkg/controllers/provisioning/provisioner.go

coveralls · 2023-03-10T19:40:16Z

Pull Request Test Coverage Report for Build 4402023601

14 of 15 (93.33%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.04%) to 80.928%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/controllers/provisioning/provisioner.go	14	15	93.33%

Totals
Change from base Build 4377678707:	0.04%
Covered Lines:	6556
Relevant Lines:	8101

💛 - Coveralls

ellistarn · 2023-03-11T04:05:03Z

Can you run the scheduling benchmark before and after? Would love to see the results.

stijndehaes · 2023-03-13T06:25:57Z

Can you run the scheduling benchmark before and after? Would love to see the results.

I think I found it:
before:

➜  go test -tags=test_performance -run=SchedulingProfile
scheduled 7610 against 21 nodes in total in 3.547897064s 2144.9325791375327 pods/sec
400 instances 10 pods    1 nodes  2.686218ms per scheduling    268.621µs per pod
400 instances 100 pods   1 nodes  27.049895ms per scheduling   270.498µs per pod
400 instances 500 pods   1 nodes  127.605296ms per scheduling  255.21µs per pod
400 instances 1000 pods  3 nodes  293.152625ms per scheduling  293.152µs per pod
400 instances 1500 pods  4 nodes  424.123055ms per scheduling  282.748µs per pod
400 instances 2000 pods  5 nodes  564.165291ms per scheduling  282.082µs per pod
400 instances 2500 pods  6 nodes  642.586604ms per scheduling  257.034µs per pod
PASS
ok      github.com/aws/karpenter-core/pkg/controllers/provisioning/scheduling   13.892s

after:

➜  go test -tags=test_performance -run=SchedulingProfile
scheduled 7610 against 21 nodes in total in 3.513244782s 2166.0887504877533 pods/sec
400 instances 10 pods    1 nodes  2.719053ms per scheduling    271.905µs per pod
400 instances 100 pods   1 nodes  27.699721ms per scheduling   276.997µs per pod
400 instances 500 pods   1 nodes  126.389574ms per scheduling  252.779µs per pod
400 instances 1000 pods  3 nodes  282.285583ms per scheduling  282.285µs per pod
400 instances 1500 pods  4 nodes  426.077208ms per scheduling  284.051µs per pod
400 instances 2000 pods  5 nodes  558.808812ms per scheduling  279.404µs per pod
400 instances 2500 pods  6 nodes  636.097562ms per scheduling  254.439µs per pod
PASS
ok      github.com/aws/karpenter-core/pkg/controllers/provisioning/scheduling   13.712s

Looks rather similar to me

ellistarn · 2023-03-13T15:30:09Z

Looks rather similar to me

This is what I get for reviewing code on my phone. Apologies -- this provisioner code is outside of the scope of the benchmark 🤦 . I'm honestly shocked to see this amount of performance hit from this piece of code, given that it's essentially initialization logic. Do you have a massive number of provisioners, or something?

stijndehaes · 2023-03-13T16:42:53Z

Looks rather similar to me

This is what I get for reviewing code on my phone. Apologies -- this provisioner code is outside of the scope of the benchmark 🤦 . I'm honestly shocked to see this amount of performance hit from this piece of code, given that it's essentially initialization logic. Do you have a massive number of provisioners, or something?

@ellistarn we only use 4 provisioners, 3 of them running emptiness, and one using consolidation. We run from 50 to 800 nodes, and are constantly scheduling batch jobs
You can find more info here: #722

You can also see that most of the time of the CPU goes to the sets.Union, because there are many copies of data happening here. In general in the code there are many copies happening resulting in a lot of CPU cycles. I realise you guys have been focussing on features, but I would like to help you guys improve the performance :) Maybe I can try to create some benchmarks that show the same performance issues I am seeing?

ellistarn · 2023-03-13T17:58:09Z

I realise you guys have been focussing on features

Agree. We're shifting gears as we approach v1.

but I would like to help you guys improve the performance :) Maybe I can try to create some benchmarks that show the same performance issues I am seeing?

Fantastic. Can you work with @jonathan-innis to help define this? Are you thinking about code benchmarks or real test workloads? We need to build these benchmarks into our e2e suites, since GHA has such unreliable performance @spring1843.

jonathan-innis

LGTM 🚀

stijndehaes requested a review from a team as a code owner March 10, 2023 14:05

stijndehaes requested a review from jonathan-innis March 10, 2023 14:05

jonathan-innis reviewed Mar 10, 2023

View reviewed changes

pkg/controllers/provisioning/provisioner.go Show resolved Hide resolved

Improve the performance of the provisioner

c0bbd16

ellistarn approved these changes Mar 13, 2023

View reviewed changes

jonathan-innis approved these changes Mar 13, 2023

View reviewed changes

jonathan-innis merged commit 2159f72 into kubernetes-sigs:main Mar 13, 2023

stijndehaes deleted the feature/optimize-performance-2 branch March 13, 2023 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Improve the performance of the provisioner #235

perf: Improve the performance of the provisioner #235

stijndehaes commented Mar 10, 2023

jonathan-innis left a comment •

edited

Loading

coveralls commented Mar 10, 2023 •

edited

Loading

ellistarn commented Mar 11, 2023

stijndehaes commented Mar 13, 2023 •

edited

Loading

ellistarn commented Mar 13, 2023

stijndehaes commented Mar 13, 2023

ellistarn commented Mar 13, 2023 •

edited

Loading

jonathan-innis left a comment

perf: Improve the performance of the provisioner #235

perf: Improve the performance of the provisioner #235

Conversation

stijndehaes commented Mar 10, 2023

jonathan-innis left a comment • edited Loading

Choose a reason for hiding this comment

coveralls commented Mar 10, 2023 • edited Loading

Pull Request Test Coverage Report for Build 4402023601

💛 - Coveralls

ellistarn commented Mar 11, 2023

stijndehaes commented Mar 13, 2023 • edited Loading

ellistarn commented Mar 13, 2023

stijndehaes commented Mar 13, 2023

ellistarn commented Mar 13, 2023 • edited Loading

jonathan-innis left a comment

Choose a reason for hiding this comment

jonathan-innis left a comment •

edited

Loading

coveralls commented Mar 10, 2023 •

edited

Loading

stijndehaes commented Mar 13, 2023 •

edited

Loading

ellistarn commented Mar 13, 2023 •

edited

Loading