Add support for rolling updates #3

ahg-g · 2024-02-28T23:17:14Z

kerthcet · 2024-03-05T02:22:02Z

/assign

kerthcet · 2024-03-05T15:04:44Z

The overall design looks good to me, some detailed technical discussions we can move to the PR. I'll work on this then.

kerthcet · 2024-04-04T04:55:36Z

About maxSurge, we know that deployment is stateless, so we can burst out extra replicas for service availability, and no need to delete them in the final as long as meet the replicas. But for statefulset, it doesn't has this capability because the replicas of statefulsets is indexed, then maxSurge seems not fit for lws as well. e.g. we have a lws with 3 replicas, and the maxSurge is 2, rolling update looks like:

replica-0 (old)
replica-1 (old)
replica-2 (rolling update)
replica-3 (new)
replica-4 (new)

Then when rolling update finished, we'll remove the replica-3 and replica-4? This seems like a waste especially for LLM. @ahg-g @liurupeng any other opinions?

liurupeng · 2024-04-04T05:54:49Z

@kerthcet we could add maxSurge support later. But we may need the maxSurge feature. Right now for GPU/TPU customers, one of most critical issues is hardware stockout which means the accelerators in the zone has all been used up. If the customer use maxUnavailable, with cluster-autoscaler/nodepool-auto-provisioning enabled. It happens a lot during the rollingUpdate the hardware of the updating podgroup will be released. But to provision a new accelerator node, it will be pretty hard(due to stockout), and this may cause issues if the total number of nodes is not large. With maxSurge, this could be mitigated since we always ensure there is at least lws.spec.replicas number of pod groups.

kerthcet · 2024-04-05T10:15:38Z

Thanks for the explanation, got the idea about Reserve. Several concerns come into my mind:

This would require additional costs for rolling update, like we have to cleanup the extra replicas, or even trigger the unnecessary autoscaling
From a system/cluster level, will this intensify the resource competition because hardware is scarce and we required extra resources
If maxSurge is set and resource only fit for spec.Replicas, will this block the rolling update. But we can scale down the maxSurge to 0 to solve this.
Maybe we should support reserve pod in kubernetes or at least at kube-scheduler.
Would this fit the Stetefulset? Do we have similar scenarios?

Sorry, I need to think more deep about this feature before I start the work.

liurupeng · 2024-04-11T04:30:45Z

@kerthcet , I think these are all valid points. The major reason we want to support it is because of some asks for the feature. Although it seems to be an anti-pattern for StatefulSet. But running AI/ML workload makes the MaxSurge more important, like customers without reservation which improves the hardware obtainability etc.

liurupeng · 2024-04-11T04:31:07Z

I could pick up this item if that's ok

kerthcet · 2024-04-11T10:10:33Z

Thanks @liurupeng for the feedback and kindness, I'm already working on this this afternoon, I'm free until this weekend so I think I can finish this on time.

liurupeng · 2024-04-11T17:13:41Z

sounds good @kerthcet

ahg-g · 2024-04-11T18:37:44Z

I suggest that we release 0.2.0 with what we have and leave MaxSerge to the next release. I feel we have enough for a release, and that will help us de-risk 0.2.0 a little

ahg-g · 2024-04-16T05:39:34Z

Would we actually be able to support maxSurge given that replicas have stable identities that are baked into the pod names?

btw, we should remove maxSurge parameter before releasing.

kerthcet · 2024-04-16T07:56:19Z

Would we actually be able to support maxSurge given that replicas have stable identities that are baked into the pod names?

Comparing to deployment's maxSurge, no ...

btw, we should remove maxSurge

Make sense.

kerthcet · 2024-04-16T08:02:17Z

See #107

k8s-ci-robot assigned kerthcet Mar 5, 2024

kerthcet mentioned this issue Mar 6, 2024

Feat: Rollout API #39

Merged

kerthcet mentioned this issue Mar 19, 2024

Support maxUnavailable in rollout #65

Merged

kerthcet mentioned this issue Apr 4, 2024

Add updatedReplicas to Status #83

Merged

kerthcet mentioned this issue Apr 12, 2024

Support maxSurge in rolling update #98

Merged

This was referenced Apr 17, 2024

Release v0.2.0 #110

Closed

Release v0.3.0 requirements #116

Closed

k8s-ci-robot closed this as completed in #98 May 8, 2024

kerthcet mentioned this issue May 15, 2024

Add document about how to use rollout #139

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for rolling updates #3

Add support for rolling updates #3

ahg-g commented Feb 28, 2024

kerthcet commented Mar 5, 2024

kerthcet commented Mar 5, 2024

kerthcet commented Apr 4, 2024

liurupeng commented Apr 4, 2024 •

edited

Loading

kerthcet commented Apr 5, 2024

liurupeng commented Apr 11, 2024

liurupeng commented Apr 11, 2024

kerthcet commented Apr 11, 2024

liurupeng commented Apr 11, 2024

ahg-g commented Apr 11, 2024

ahg-g commented Apr 16, 2024

kerthcet commented Apr 16, 2024

kerthcet commented Apr 16, 2024

Add support for rolling updates #3

Add support for rolling updates #3

Comments

ahg-g commented Feb 28, 2024

kerthcet commented Mar 5, 2024

kerthcet commented Mar 5, 2024

kerthcet commented Apr 4, 2024

liurupeng commented Apr 4, 2024 • edited Loading

kerthcet commented Apr 5, 2024

liurupeng commented Apr 11, 2024

liurupeng commented Apr 11, 2024

kerthcet commented Apr 11, 2024

liurupeng commented Apr 11, 2024

ahg-g commented Apr 11, 2024

ahg-g commented Apr 16, 2024

kerthcet commented Apr 16, 2024

kerthcet commented Apr 16, 2024

liurupeng commented Apr 4, 2024 •

edited

Loading