Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support maxSurge in rolling update #98

Merged
merged 3 commits into from
May 8, 2024

Conversation

kerthcet
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it

Which issue(s) this PR fixes

Fixes #3

Special notes for your reviewer

It works like:

  • When rolling update, lws will burst out maxSurge replicas
  • When rolling update enter into the final stage, we'll update a new replica and reclaim a bursted replica
  • Finally the rolling update succeeded.

Does this PR introduce a user-facing change?

Support maxSurge in rolling update.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels Apr 12, 2024
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 12, 2024
@kerthcet
Copy link
Contributor Author

This is the general prototype, I need more tests.

@kerthcet
Copy link
Contributor Author

So when rolling update, lws will request for hardware as well, there's no difference with lws without maxSurge, this can not help with your case, right? @liurupeng But when reclaiming the bursted replicas, it can somehow help here ...

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 13, 2024
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 16, 2024
@kerthcet kerthcet changed the title [WIP]Support maxSurge in rolling update Support maxSurge in rolling update Apr 16, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 16, 2024
@kerthcet
Copy link
Contributor Author

You can take a look if you want, only webhook validation and e2e tests not added. But seems @ahg-g has question about this feature. We can't achieve the similar behavior as deployment does.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 16, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 17, 2024
@kerthcet
Copy link
Contributor Author

All set now. PTAL.

@kerthcet
Copy link
Contributor Author

kerthcet commented Apr 18, 2024

Would you like to take a look of this PR? Because it changes a lot and once other PR (I mean related ones) merges, lots of conflicts have to be solve. @ahg-g @liurupeng

@ahg-g
Copy link
Contributor

ahg-g commented Apr 18, 2024

I will try to find sometime this Friday

@liurupeng
Copy link
Collaborator

Sorry for the delay, added two comments about the semantics of the MaxSurge support, once that's finalized, will check more detailed, thanks Kante!

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 19, 2024
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 30, 2024
@kerthcet
Copy link
Contributor Author

How rolling update looks like right now with maxSurge, let's say we have a lws with replicas=4, maxUnavailable=2, maxSurge=2, rollinStep = 2+2 = 4:

  • ✅ replica updated successfully
  • ❎ replica hasn't updated yet
  • ⏳ rolling update in progress
Partition Replicas R-0 R-1 R-2 R-3 R-4 R-5 Note
Stage1 0 4 Before rolling update
Stage2 4 6 Rolling update started, bursted 2 replicas, Partition=lws.replicas
Stage3 2 6 Partition changes to 2 (6-4)
Stage4 2 6 Since the last replica is not ready, Partition will not change
Stage5 0 6 Partition = 6-4-2=0
Stage6 0 6
Stage7 0 5 Release a replica when unavailableReplicas == maxSurge
Stage8 0 4 Release another replica for accommodation
Stage9 0 4 Rolling update finished

@kerthcet
Copy link
Contributor Author

/retest

@kerthcet
Copy link
Contributor Author

/test pull-lws-test-integration-main

@kerthcet
Copy link
Contributor Author

@ahg-g
Copy link
Contributor

ahg-g commented Apr 30, 2024

How rolling update looks like right now with maxSurge, let's say we have a lws with replicas=4, maxUnavailable=2, maxSurge=2, rollinStep = 2+2 = 4:

  • ✅ replica updated successfully
  • ❎ replica hasn't updated yet
  • ⏳ rolling update in progress

Partition Replicas R-0 R-1 R-2 R-3 R-4 R-5 Note
Stage1 0 4 ✅ ✅ ✅ ✅ Before rolling update
Stage2 4 6 ❎ ❎ ❎ ❎ ⏳ ⏳ Rolling update started, bursted 2 replicas, Partition=lws.replicas
Stage3 2 6 ❎ ❎ ⏳ ⏳ ⏳ ⏳ Partition changes to 2 (6-4)
Stage4 2 6 ❎ ❎ ⏳ ⏳ ✅ ⏳ Since the last replica is not ready, Partition will not change
Stage5 0 6 ⏳ ⏳ ⏳ ⏳ ✅ ✅ Partition = 6-4-2=0
Stage6 0 6 ⏳ ⏳ ⏳ ✅ ✅ ✅
Stage7 0 5 ⏳ ✅ ⏳ ✅ ✅ Release a replica when unavailableReplicas == maxSurge
Stage8 0 4 ✅ ⏳ ✅ ✅ Release another replica for accommodation
Stage9 0 4 ✅ ✅ ✅ ✅ Rolling update finished

This is fantastics and makes a lot of sense, we need to make sure this is well documented!

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 30, 2024
@@ -94,13 +94,13 @@ func (r *LeaderWorkerSetReconciler) Reconcile(ctx context.Context, req ctrl.Requ
log := ctrl.LoggerFrom(ctx).WithValues("leaderworkerset", klog.KObj(lws))
ctx = ctrl.LoggerInto(ctx, log)

partition, err := r.rollingUpdatePartition(ctx, lws)
partition, replicas, err := r.rollingUpdateParameters(ctx, lws)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
partition, replicas, err := r.rollingUpdateParameters(ctx, lws)
partition, desiredReplicasCount, err := r.rollingUpdateParameters(ctx, lws)

How do you feel like renaming it as "desiredReplicasCount" so that we could know that this will be set as the real replica count

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside the rollingupdateParameters, there's only one replicas, I guess it's ok.

@liurupeng
Copy link
Collaborator

overall lgtm! Thanks Kante!

Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, really impressive!

}
// No need to burst more than the replicas.
if maxSurge > int(lwsReplicas) {
maxSurge = int(lwsReplicas)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just be part of the api validation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replicas is elastic, same to maxUnavailable, we can't force the maxSurge <= Replicas.

pkg/controllers/leaderworkerset_controller.go Outdated Show resolved Hide resolved
sts := &appsv1.StatefulSet{}
err := r.Get(ctx, types.NamespacedName{Name: lws.Name, Namespace: lws.Namespace}, sts)
err = r.Get(ctx, types.NamespacedName{Name: lws.Name, Namespace: lws.Namespace}, sts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think getting the sts should be done outside this function, and the sts is passed as a parameter. It doesn't seem relevant here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this out we can't tell whether it's because of sts not found or request error, unless we have the logics outside, prefer to leave here.

pkg/controllers/leaderworkerset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/leaderworkerset_controller.go Show resolved Hide resolved
pkg/controllers/leaderworkerset_controller.go Outdated Show resolved Hide resolved

defer func() {
// Reclaim the bursted replicas gradually.
if lwsUnreadyReplicas <= int32(maxSurge) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it hard to reason about when using a defer function and named return variables, I suggest to directly put the value in the return statements below and remove the named return variable and defer func.

Also, I am not sure the calculation here is correct, shouldn't it always be the following (without the if statement above):

replicas = lwsReplicas + min(utils.NonZeroValue(lwsUnreadyReplicas-maxUnavailable), maxSurge)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general idea is like when we have 2 unready replicas, and 2 burst replicas, what can happen is the 2 unready replicas may stuck in pending as stockout, so what I want to achieve here is at this moment, let us reclaim one replica from the burst ones and make room for the unready ones.

We can also reclaim replicas >= 1 as well, actually I think this has nothing related to maxUnavailable. It's about how many replicas we should reclaim each time.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 6, 2024
Signed-off-by: kerthcet <[email protected]>
@kerthcet
Copy link
Contributor Author

kerthcet commented May 7, 2024

/retest

@kerthcet
Copy link
Contributor Author

kerthcet commented May 8, 2024

kindly ping @ahg-g @liurupeng can we push this forward. Hope not to rebase again. 😅

@liurupeng
Copy link
Collaborator

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 8, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kerthcet, liurupeng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 70ca1c7 into kubernetes-sigs:main May 8, 2024
7 checks passed
@liurupeng liurupeng mentioned this pull request Jun 4, 2024
20 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for rolling updates
4 participants