-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add kep-162 Colocated Placement #168
base: main
Are you sure you want to change the base?
add kep-162 Colocated Placement #168
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: vie-serendipity The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Welcome @vie-serendipity! |
Hi @vie-serendipity. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
444b604
to
f9a1557
Compare
/ok-to-test |
any suggestions? |
Will visit this API later for bandwidth, sorry for that. |
@vie-serendipity sry for the late reply, I was on vacation last week, checking it now |
know that this has succeeded? | ||
--> | ||
|
||
- Add a new way of topology scheduling and allow multiple pod groups land on the same domain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for one pod group (one multi-host inference replica), we need to colocate the pods, but for different pod groups, why we need to put them in the same domain since different replicas won't communicate with each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might confuse you with my expression, I mean allowing all leaders to schedule freely across any topology, not restricting one pod group to a topology just like exclusive placement.
|
||
### SubGroup Policy Support | ||
Colocated placement can support subgroup policy as well. | ||
Compared with [exclusive support for subgroup policy](https://github.com/kubernetes-sigs/lws/assets/86417275/ff9fc93d-c738-4c09-abc8-50a7b16d49df), the workflow is almost identical, as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the subgroup feature will only have one leader pod for all the pods in the same topology
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the subgroup policy, it seems like it just needs to ensure that the leader and workers are in the same topology, so I think colocated placement should meet its requirements.
overall lgtm, but one more part to consider is the existing lws workload that use exclusive placement. Do we want to keep two set of APIs to use different placement strategies? |
I will take a look after I come back from vacation, first week of August. |
@vie-serendipity: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@ahg-g could you help check this one when got some time? thanks! |
|
||
const ( | ||
ExclusiveTopologyPlacementPolicyType TopologyPlacementPolicyType = "Exclusive" | ||
ColocatedTopologyPlacementPolicyType TopologyPlacementPolicyType = "Colocated" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will we implement this?
If we implement this using pod-affinity (i.e., same as exclusive placement, but removing the anti-affinity term), we will run into race conditions: consider the case where two groups try to schedule on the same domain at the same time but there is capacity only for one, there is a possibility that both will partially schedule on the domain, and none will schedule fully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That‘s a tricky case I didn't consider before. It seems like there‘s already an issue when scheduling multiple pod groups with limited resources. Refer to #167
Co-scheduling plugin conflicts with LeaderReady
startup policy.
In this situation, I think it has to rely on the autoscaler or adding more resources manully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that a deadlock with the plugin only happens when the leaderready
startup policy is adopted. We can do gang scheduling with the pod group declared in the plugin. This can fix the problem you mentioned.
It's worth noting that colocated placement policy and leaderready startup policy can't be set together, and we have to validate in the webhook.
But I'm not sure whether relying on an out-of-tree scheduler plugin is appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is more related to scheduling rather than workload itself? This is an exist problem for all distributed ones. Unless we have a queueing system ourself just like WaitForPodsReady
in kueue.
Sorry @liurupeng I leave the community for couple of days for some reasons. I'll back to this ASAP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a build-in feature gate system as kubernetes or it will impact our api version a lot, like this one, I have no idea whether we're on the right path because we don't have much feedbacks, without feature gates, we'll have to bump the apiversion to avoid backwards compatibilities. WDYT? @ahg-g @liurupeng
TopologyPlacementPolicy TopologyPlacementPolicy `json:"topologyPlacementPolicy",omitempty` | ||
} | ||
|
||
type TopologyPlacementPolicy struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe GroupPlacementPolicy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
|
||
type LeaderWorkerTemplate struct { | ||
// +optional | ||
TopologyPlacementPolicy TopologyPlacementPolicy `json:"topologyPlacementPolicy",omitempty` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this a slice in case we'll append more than one policies to the lws, and the policies should be ANDed?
|
||
const ( | ||
ExclusiveTopologyPlacementPolicyType TopologyPlacementPolicyType = "Exclusive" | ||
ColocatedTopologyPlacementPolicyType TopologyPlacementPolicyType = "Colocated" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is more related to scheduling rather than workload itself? This is an exist problem for all distributed ones. Unless we have a queueing system ourself just like WaitForPodsReady
in kueue.
// +kubebuilder:default=None | ||
// +kubebuilder:validation=Enum={ExclusiveTopologyPlacementPolicyType,ColocatedTopologyPlacementPolicyType,NoneTopologyPlacementPolicyType} | ||
type TopologyPlacementPolicyType `json:"type"` | ||
topologyKey *string `json:"topologyKey",omitempty` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the policy is none, topologyKey should be nil, otherwise, it should be explicitly declared. So I think pointer is more suitable.
### User Stories (Optional) | ||
|
||
#### Story 1 | ||
Each pod group (leader and workers) is colocated in one domain(i.e., more than one group could land on the same domain) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you think of a real case, this one is somehow a more theoretical one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my understanding, assuming a nodepool is a topology domain, each nodepool contains many nodes with the same GPU(easy to scale). So, it makes sense to schedule more than one group to a nodepool naturally.
And it's reasonable to schedule a group of pods to a single nodepool since they have the same GPU, ensuring equal inference speeds.
type TopologyPlacementPolicy struct { | ||
// +kubebuilder:default=None | ||
// +kubebuilder:validation=Enum={ExclusiveTopologyPlacementPolicyType,ColocatedTopologyPlacementPolicyType,NoneTopologyPlacementPolicyType} | ||
type TopologyPlacementPolicyType `json:"type"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add the constraint mode like prefer or required? But this is not that vital, we can add later.
@vie-serendipity we are planning the release v0.5.0, do you want to proceed with this KEP? thanks! |
The proposal must have a way to address #168 (comment) |
@liurupeng I'm glad to push this work forward. |
@ahg-g I agree with kerthcet. I think it's more about scheduling and it's not the scope lws should cover. How do you think? |
But we can't offer an API that is not backed by an implementation. |
What type of PR is this?
/kind documentation
What this PR does / why we need it
Support colocated placement of multiple pod groups
Which issue(s) this PR fixes
Fixes #162
Special notes for your reviewer
I don't know how to upload image to aws and I just push image to repo. This should be fixed later.
Does this PR introduce a user-facing change?