-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance scheduling capabilities for a group of pods #162
Comments
I'm a bit curious, can such scenarios generally be resolved using node selectors or affinity? What are the shortcomings if this method is used? |
In this scenario, each set of pods uses the same template, so the entire lws corresponding pods will be dispatched to the same data center. However, due to the fact that gpu is distributed in different data centers, so different groups of pods under lws should be scheduled to different data centers. So I think affinity or selector can't fulfill such requirement. |
Do different data centers signify the same cluster, or are they assigned to different clusters? If they are within the same cluster, it might still be feasible to utilize node selectors or affinity methods. However, if they are distributed across different clusters, there might be concerns regarding communication overhead. |
They belong to the same cluster, but how to schedule lws different groups of pods to multiple datacenters and each group of pods in the same datacenter by using affinity? I can't figure out a way to do this. |
So, if I understand correctly, you want to schedule multiple pods under the workerTemplate to nodes in different data centers, is that correct? I cannot determine whether it is possible to achieve the desired functionality using node selectors or affinity. Additionally, I cannot assess whether it is a reasonable design to have scheduling decisions for the higher-level workload. |
This higher-level workload don't need to make any scheduling decisions , it just needs to make sure the workers are scheduled in the same data center as the head. The head's scheduling is entirely decided by the scheduler. One possible implementation could be to wait until the head is scheduled to a node, then retrieve a label from that node (this label's key is predefined on the lws yaml). Afterwards, the workers would add similar affinities, ensuring they get scheduled to similar nodes. |
Thanks for bring this to the community, is this you need? https://github.com/kubernetes-sigs/lws/blob/main/docs/examples/sample/README.md#exclusive-placement |
But this is exclusive, which means two groups can not be located at the same topology. |
Thanks, that's what I'm looking for.
But I have a question, in the example, there are GPU and TPU accelerator islands. Does that mean I can only have two groups of pods? I feel like it should allow for multiple groups of pods within the same topology. Is it more reasonable to just ensure that the head and workers have the same topology key? |
Accelerator is only used for integrations with cloud providers, like TPU with google cloud. So what you need is slightly different with exclusive placement. Do you have a real use case from your side? |
My usage scenario is that there are many nodes in a k8s scenario, some of them are user's and some of them are cloud vendor's, to make it simple there are two datacenters, one for the user and one for the cloud provider. (This is not a standard k8s cluster, but I wonder if there is a need to schedule a group of pods to nodes of the same gpu type if there are many gpu resource types in a cluster) I want to use lws to deploy inference services to the user's datacenter and the datacenter on the cloud. Although the network of datacenter on the cloud and the user's datacenter is interoperable, they are more expensive to communicate with and may require a public network, which is unstable. So I want to make sure that a group of pods are dispatched to a single data center, so they can communicate easily. And multiple sets of pods are supported in one data center, and business peaks also require scaling. |
Make sense to me as a group of Pods should be located at the same topology, and the group number should not be limited. cc @ahg-g @liurupeng thoughts? |
yes, we can support that by simply removing the exclusive anti-affinity term that is currently getting added. But we need to come up with a proper API first, similar to kubernetes-sigs/jobset#75 |
@ahg-g I would like to contribute to this feature. I can propose a KEP later. Is this good for you? |
@vie-serendipity could you start the KEP so that we could start the review? |
@liurupeng Okay, I will propose a KEP recently. |
/subscribe |
What would you like to be added:
Add a field like
ScheduleMode
. For scheduling a set of pods we may need more strict control. When deploying a distributed inference service, a set of pods, that is, a head plus several workers should be scheduled to neighboring nodes to reduce the communication cost between them.Why is this needed:
There are many types of resources in a k8s cluster, and if the scheduling to nodes is constrained only by
requests
field, it is possible that the final distributed inference may not be too good.An example is that a group of pods should be dispatched to nodes that have a special way of connecting to them, such as nvlink. Another example is that nodes in a cluster (a non-standard k8s cluster) may be across data centers, and a group of pods should be dispatched into the same data center.
Completion requirements:
I'm not sure about the eventual changes for the api, just an enhancement request. I'm also not sure if such a requirement is a reasonable enhancement request, and I'd be happy to contribute if it is.
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.
The text was updated successfully, but these errors were encountered: