-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support maxSurge in rolling update #98
Support maxSurge in rolling update #98
Conversation
This is the general prototype, I need more tests. |
5dadf9e
to
4f3a924
Compare
So when rolling update, lws will request for hardware as well, there's no difference with |
4f3a924
to
809fe16
Compare
ec3d3bc
to
ad84cad
Compare
You can take a look if you want, only webhook validation and e2e tests not added. But seems @ahg-g has question about this feature. We can't achieve the similar behavior as deployment does. |
ad84cad
to
ecd806e
Compare
All set now. PTAL. |
Would you like to take a look of this PR? Because it changes a lot and once other PR (I mean related ones) merges, lots of conflicts have to be solve. @ahg-g @liurupeng |
I will try to find sometime this Friday |
Sorry for the delay, added two comments about the semantics of the MaxSurge support, once that's finalized, will check more detailed, thanks Kante! |
ecd806e
to
229a677
Compare
How rolling update looks like right now with maxSurge, let's say we have a lws with replicas=4, maxUnavailable=2, maxSurge=2, rollinStep = 2+2 = 4:
|
/retest |
/test pull-lws-test-integration-main |
Have no idea why https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_lws/98/pull-lws-test-integration-main/1785199425849659392 is failed, maybe the controller is stuck. |
This is fantastics and makes a lot of sense, we need to make sure this is well documented! |
@@ -94,13 +94,13 @@ func (r *LeaderWorkerSetReconciler) Reconcile(ctx context.Context, req ctrl.Requ | |||
log := ctrl.LoggerFrom(ctx).WithValues("leaderworkerset", klog.KObj(lws)) | |||
ctx = ctrl.LoggerInto(ctx, log) | |||
|
|||
partition, err := r.rollingUpdatePartition(ctx, lws) | |||
partition, replicas, err := r.rollingUpdateParameters(ctx, lws) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partition, replicas, err := r.rollingUpdateParameters(ctx, lws) | |
partition, desiredReplicasCount, err := r.rollingUpdateParameters(ctx, lws) |
How do you feel like renaming it as "desiredReplicasCount" so that we could know that this will be set as the real replica count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outside the rollingupdateParameters, there's only one replicas, I guess it's ok.
overall lgtm! Thanks Kante! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, really impressive!
} | ||
// No need to burst more than the replicas. | ||
if maxSurge > int(lwsReplicas) { | ||
maxSurge = int(lwsReplicas) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this just be part of the api validation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replicas
is elastic, same to maxUnavailable, we can't force the maxSurge <= Replicas.
sts := &appsv1.StatefulSet{} | ||
err := r.Get(ctx, types.NamespacedName{Name: lws.Name, Namespace: lws.Namespace}, sts) | ||
err = r.Get(ctx, types.NamespacedName{Name: lws.Name, Namespace: lws.Namespace}, sts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think getting the sts should be done outside this function, and the sts is passed as a parameter. It doesn't seem relevant here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this out we can't tell whether it's because of sts not found or request error, unless we have the logics outside, prefer to leave here.
|
||
defer func() { | ||
// Reclaim the bursted replicas gradually. | ||
if lwsUnreadyReplicas <= int32(maxSurge) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it hard to reason about when using a defer function and named return variables, I suggest to directly put the value in the return statements below and remove the named return variable and defer func.
Also, I am not sure the calculation here is correct, shouldn't it always be the following (without the if statement above):
replicas = lwsReplicas + min(utils.NonZeroValue(lwsUnreadyReplicas-maxUnavailable), maxSurge)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general idea is like when we have 2 unready replicas, and 2 burst replicas, what can happen is the 2 unready replicas may stuck in pending as stockout, so what I want to achieve here is at this moment, let us reclaim one replica from the burst ones and make room for the unready ones.
We can also reclaim replicas >= 1 as well, actually I think this has nothing related to maxUnavailable. It's about how many replicas we should reclaim each time.
Signed-off-by: kerthcet <[email protected]>
229a677
to
8592a8a
Compare
Signed-off-by: kerthcet <[email protected]>
Signed-off-by: kerthcet <[email protected]>
8592a8a
to
0dcf3ce
Compare
/retest |
kindly ping @ahg-g @liurupeng can we push this forward. Hope not to rebase again. 😅 |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kerthcet, liurupeng The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it
Which issue(s) this PR fixes
Fixes #3
Special notes for your reviewer
It works like:
Does this PR introduce a user-facing change?