Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for rolling updates #3

Closed
ahg-g opened this issue Feb 28, 2024 · 13 comments · Fixed by #98
Closed

Add support for rolling updates #3

ahg-g opened this issue Feb 28, 2024 · 13 comments · Fixed by #98
Assignees

Comments

@ahg-g
Copy link
Contributor

ahg-g commented Feb 28, 2024

As discussed in https://bit.ly/k8s-LWS

@kerthcet
Copy link
Contributor

kerthcet commented Mar 5, 2024

/assign

@kerthcet
Copy link
Contributor

kerthcet commented Mar 5, 2024

The overall design looks good to me, some detailed technical discussions we can move to the PR. I'll work on this then.

@kerthcet
Copy link
Contributor

kerthcet commented Apr 4, 2024

About maxSurge, we know that deployment is stateless, so we can burst out extra replicas for service availability, and no need to delete them in the final as long as meet the replicas. But for statefulset, it doesn't has this capability because the replicas of statefulsets is indexed, then maxSurge seems not fit for lws as well. e.g. we have a lws with 3 replicas, and the maxSurge is 2, rolling update looks like:

  • replica-0 (old)
  • replica-1 (old)
  • replica-2 (rolling update)
  • replica-3 (new)
  • replica-4 (new)

Then when rolling update finished, we'll remove the replica-3 and replica-4? This seems like a waste especially for LLM. @ahg-g @liurupeng any other opinions?

@liurupeng
Copy link
Collaborator

liurupeng commented Apr 4, 2024

@kerthcet we could add maxSurge support later. But we may need the maxSurge feature. Right now for GPU/TPU customers, one of most critical issues is hardware stockout which means the accelerators in the zone has all been used up. If the customer use maxUnavailable, with cluster-autoscaler/nodepool-auto-provisioning enabled. It happens a lot during the rollingUpdate the hardware of the updating podgroup will be released. But to provision a new accelerator node, it will be pretty hard(due to stockout), and this may cause issues if the total number of nodes is not large. With maxSurge, this could be mitigated since we always ensure there is at least lws.spec.replicas number of pod groups.

@kerthcet
Copy link
Contributor

kerthcet commented Apr 5, 2024

Thanks for the explanation, got the idea about Reserve. Several concerns come into my mind:

  • This would require additional costs for rolling update, like we have to cleanup the extra replicas, or even trigger the unnecessary autoscaling
  • From a system/cluster level, will this intensify the resource competition because hardware is scarce and we required extra resources
  • If maxSurge is set and resource only fit for spec.Replicas, will this block the rolling update. But we can scale down the maxSurge to 0 to solve this.
  • Maybe we should support reserve pod in kubernetes or at least at kube-scheduler.
  • Would this fit the Stetefulset? Do we have similar scenarios?

Sorry, I need to think more deep about this feature before I start the work.

@liurupeng
Copy link
Collaborator

@kerthcet , I think these are all valid points. The major reason we want to support it is because of some asks for the feature. Although it seems to be an anti-pattern for StatefulSet. But running AI/ML workload makes the MaxSurge more important, like customers without reservation which improves the hardware obtainability etc.

@liurupeng
Copy link
Collaborator

I could pick up this item if that's ok

@kerthcet
Copy link
Contributor

Thanks @liurupeng for the feedback and kindness, I'm already working on this this afternoon, I'm free until this weekend so I think I can finish this on time.

@liurupeng
Copy link
Collaborator

sounds good @kerthcet

@ahg-g
Copy link
Contributor Author

ahg-g commented Apr 11, 2024

I suggest that we release 0.2.0 with what we have and leave MaxSerge to the next release. I feel we have enough for a release, and that will help us de-risk 0.2.0 a little

@ahg-g
Copy link
Contributor Author

ahg-g commented Apr 16, 2024

Would we actually be able to support maxSurge given that replicas have stable identities that are baked into the pod names?

btw, we should remove maxSurge parameter before releasing.

@kerthcet
Copy link
Contributor

Would we actually be able to support maxSurge given that replicas have stable identities that are baked into the pod names?

Comparing to deployment's maxSurge, no ...

btw, we should remove maxSurge

Make sense.

@kerthcet
Copy link
Contributor

See #107

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants