Enhance scheduling capabilities for a group of pods #162

vie-serendipity · 2024-06-10T14:54:13Z

What would you like to be added:
Add a field like ScheduleMode. For scheduling a set of pods we may need more strict control. When deploying a distributed inference service, a set of pods, that is, a head plus several workers should be scheduled to neighboring nodes to reduce the communication cost between them.
Why is this needed:
There are many types of resources in a k8s cluster, and if the scheduling to nodes is constrained only by requests field, it is possible that the final distributed inference may not be too good.
An example is that a group of pods should be dispatched to nodes that have a special way of connecting to them, such as nvlink. Another example is that nodes in a cluster (a non-standard k8s cluster) may be across data centers, and a group of pods should be dispatched into the same data center.
Completion requirements:
I'm not sure about the eventual changes for the api, just an enhancement request. I'm also not sure if such a requirement is a reasonable enhancement request, and I'd be happy to contribute if it is.
This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

The text was updated successfully, but these errors were encountered:

googs1025 · 2024-06-10T15:05:21Z

I'm a bit curious, can such scenarios generally be resolved using node selectors or affinity? What are the shortcomings if this method is used?

vie-serendipity · 2024-06-10T15:13:03Z

@googs1025

An example is that a group of pods should be dispatched to nodes that have a special way of connecting to them, such as nvlink. Another example is that nodes in a cluster (a non-standard k8s cluster) may be across data centers, and a group of pods should be dispatched into the same data center.

In this scenario, each set of pods uses the same template, so the entire lws corresponding pods will be dispatched to the same data center. However, due to the fact that gpu is distributed in different data centers, so different groups of pods under lws should be scheduled to different data centers. So I think affinity or selector can't fulfill such requirement.

googs1025 · 2024-06-10T15:28:19Z

@googs1025

An example is that a group of pods should be dispatched to nodes that have a special way of connecting to them, such as nvlink. Another example is that nodes in a cluster (a non-standard k8s cluster) may be across data centers, and a group of pods should be dispatched into the same data center.

In this scenario, each set of pods uses the same template, so the entire lws corresponding pods will be dispatched to the same data center. However, due to the fact that gpu is distributed in different data centers, so different groups of pods under lws should be scheduled to different data centers. So I think affinity or selector can't fulfill such requirement.

Do different data centers signify the same cluster, or are they assigned to different clusters? If they are within the same cluster, it might still be feasible to utilize node selectors or affinity methods. However, if they are distributed across different clusters, there might be concerns regarding communication overhead.

vie-serendipity · 2024-06-10T15:39:16Z

If they are within the same cluster, it might still be feasible to utilize node selectors or affinity methods.

They belong to the same cluster, but how to schedule lws different groups of pods to multiple datacenters and each group of pods in the same datacenter by using affinity? I can't figure out a way to do this.

googs1025 · 2024-06-10T15:55:48Z

So, if I understand correctly, you want to schedule multiple pods under the workerTemplate to nodes in different data centers, is that correct? I cannot determine whether it is possible to achieve the desired functionality using node selectors or affinity. Additionally, I cannot assess whether it is a reasonable design to have scheduling decisions for the higher-level workload.

vie-serendipity · 2024-06-11T02:11:00Z

This higher-level workload don't need to make any scheduling decisions , it just needs to make sure the workers are scheduled in the same data center as the head. The head's scheduling is entirely decided by the scheduler.

One possible implementation could be to wait until the head is scheduled to a node, then retrieve a label from that node (this label's key is predefined on the lws yaml). Afterwards, the workers would add similar affinities, ensuring they get scheduled to similar nodes.

kerthcet · 2024-06-11T04:54:21Z

Thanks for bring this to the community, is this you need? https://github.com/kubernetes-sigs/lws/blob/main/docs/examples/sample/README.md#exclusive-placement

kerthcet · 2024-06-11T04:55:46Z

But this is exclusive, which means two groups can not be located at the same topology.

vie-serendipity · 2024-06-11T06:22:04Z

Thanks, that's what I'm looking for.

But this is exclusive, which means two groups can not be located at the same topology.

LeaderWorkerSet supports exclusive placement through pod affinity/anti-affinity where pods in the same group will be scheduled on the same accelerator island (such as a TPU slice or a GPU clique), but on different nodes. This ensures 1:1 LWS replica to accelerator island placement.

But I have a question, in the example, there are GPU and TPU accelerator islands. Does that mean I can only have two groups of pods? I feel like it should allow for multiple groups of pods within the same topology. Is it more reasonable to just ensure that the head and workers have the same topology key?

kerthcet · 2024-06-11T06:42:46Z

in the example, there are GPU and TPU accelerator islands. Does that mean I can only have two groups of pods?

Accelerator is only used for integrations with cloud providers, like TPU with google cloud.

So what you need is slightly different with exclusive placement. Do you have a real use case from your side?

vie-serendipity · 2024-06-11T07:03:05Z

My usage scenario is that there are many nodes in a k8s scenario, some of them are user's and some of them are cloud vendor's, to make it simple there are two datacenters, one for the user and one for the cloud provider. (This is not a standard k8s cluster, but I wonder if there is a need to schedule a group of pods to nodes of the same gpu type if there are many gpu resource types in a cluster)

I want to use lws to deploy inference services to the user's datacenter and the datacenter on the cloud. Although the network of datacenter on the cloud and the user's datacenter is interoperable, they are more expensive to communicate with and may require a public network, which is unstable.

So I want to make sure that a group of pods are dispatched to a single data center, so they can communicate easily. And multiple sets of pods are supported in one data center, and business peaks also require scaling.

kerthcet · 2024-06-11T07:32:56Z

Make sense to me as a group of Pods should be located at the same topology, and the group number should not be limited.

cc @ahg-g @liurupeng thoughts?

ahg-g · 2024-06-12T07:27:34Z

yes, we can support that by simply removing the exclusive anti-affinity term that is currently getting added. But we need to come up with a proper API first, similar to kubernetes-sigs/jobset#75

vie-serendipity · 2024-06-12T07:45:57Z

@ahg-g I would like to contribute to this feature. I can propose a KEP later. Is this good for you?

liurupeng · 2024-06-25T06:22:32Z

@vie-serendipity could you start the KEP so that we could start the review?

vie-serendipity · 2024-06-25T08:40:54Z

@liurupeng Okay, I will propose a KEP recently.

dims · 2024-09-16T17:06:33Z

/subscribe

vie-serendipity added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 10, 2024

vie-serendipity linked a pull request Jul 1, 2024 that will close this issue

add kep-162 Colocated Placement #168

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance scheduling capabilities for a group of pods #162

Enhance scheduling capabilities for a group of pods #162

vie-serendipity commented Jun 10, 2024 •

edited

Loading

googs1025 commented Jun 10, 2024

vie-serendipity commented Jun 10, 2024

googs1025 commented Jun 10, 2024

vie-serendipity commented Jun 10, 2024

googs1025 commented Jun 10, 2024

vie-serendipity commented Jun 11, 2024

kerthcet commented Jun 11, 2024

kerthcet commented Jun 11, 2024

vie-serendipity commented Jun 11, 2024

kerthcet commented Jun 11, 2024

vie-serendipity commented Jun 11, 2024

kerthcet commented Jun 11, 2024

ahg-g commented Jun 12, 2024

vie-serendipity commented Jun 12, 2024

liurupeng commented Jun 25, 2024

vie-serendipity commented Jun 25, 2024

dims commented Sep 16, 2024

Enhance scheduling capabilities for a group of pods #162

Enhance scheduling capabilities for a group of pods #162

Comments

vie-serendipity commented Jun 10, 2024 • edited Loading

googs1025 commented Jun 10, 2024

vie-serendipity commented Jun 10, 2024

googs1025 commented Jun 10, 2024

vie-serendipity commented Jun 10, 2024

googs1025 commented Jun 10, 2024

vie-serendipity commented Jun 11, 2024

kerthcet commented Jun 11, 2024

kerthcet commented Jun 11, 2024

vie-serendipity commented Jun 11, 2024

kerthcet commented Jun 11, 2024

vie-serendipity commented Jun 11, 2024

kerthcet commented Jun 11, 2024

ahg-g commented Jun 12, 2024

vie-serendipity commented Jun 12, 2024

liurupeng commented Jun 25, 2024

vie-serendipity commented Jun 25, 2024

dims commented Sep 16, 2024

vie-serendipity commented Jun 10, 2024 •

edited

Loading