-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for rolling updates #3
Comments
/assign |
The overall design looks good to me, some detailed technical discussions we can move to the PR. I'll work on this then. |
About maxSurge, we know that deployment is stateless, so we can burst out extra replicas for service availability, and no need to delete them in the final as long as meet the replicas. But for statefulset, it doesn't has this capability because the replicas of statefulsets is indexed, then maxSurge seems not fit for lws as well. e.g. we have a lws with 3 replicas, and the maxSurge is 2, rolling update looks like:
Then when rolling update finished, we'll remove the replica-3 and replica-4? This seems like a waste especially for LLM. @ahg-g @liurupeng any other opinions? |
@kerthcet we could add maxSurge support later. But we may need the maxSurge feature. Right now for GPU/TPU customers, one of most critical issues is hardware stockout which means the accelerators in the zone has all been used up. If the customer use maxUnavailable, with cluster-autoscaler/nodepool-auto-provisioning enabled. It happens a lot during the rollingUpdate the hardware of the updating podgroup will be released. But to provision a new accelerator node, it will be pretty hard(due to stockout), and this may cause issues if the total number of nodes is not large. With maxSurge, this could be mitigated since we always ensure there is at least lws.spec.replicas number of pod groups. |
Thanks for the explanation, got the idea about Reserve. Several concerns come into my mind:
Sorry, I need to think more deep about this feature before I start the work. |
@kerthcet , I think these are all valid points. The major reason we want to support it is because of some asks for the feature. Although it seems to be an anti-pattern for StatefulSet. But running AI/ML workload makes the MaxSurge more important, like customers without reservation which improves the hardware obtainability etc. |
I could pick up this item if that's ok |
Thanks @liurupeng for the feedback and kindness, I'm already working on this this afternoon, I'm free until this weekend so I think I can finish this on time. |
sounds good @kerthcet |
I suggest that we release 0.2.0 with what we have and leave MaxSerge to the next release. I feel we have enough for a release, and that will help us de-risk 0.2.0 a little |
Would we actually be able to support maxSurge given that replicas have stable identities that are baked into the pod names? btw, we should remove maxSurge parameter before releasing. |
Comparing to deployment's maxSurge, no ...
Make sense. |
See #107 |
As discussed in https://bit.ly/k8s-LWS
The text was updated successfully, but these errors were encountered: