You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since Volcano is a widely adopted scheduler for AI workloads, it could provide Trainer with more AI-specific scheduling capabilities if we integrate Volcano into Trainer, thus benefiting users who want to schedule pods with Volcano on top of Kubeflow Trainer.
In #2182, users requested for richer Volcano support in Kubeflow Training Operator V1.
AFAIK, kubeedge/sedna is waiting for the support of Volcano to enable gang-scheduling in edge-cloud environments: kubeedge/sedna#463. One of the reasons why it was paused is due to:
All training workers must have the same parameters: The PyTorchJob CRD in training-operator assumes that all training workers(pods) shared the same training parameters, while FederatedLearningJob CRD in Sedna allows training workers to have different training parameters. So we assume that all training workers have the same training parameters, which will surely put many restrictions on the applied scenarios of Senda Federated Learning V2 but we have no choice.
In Kubeflow Trainer V2, we introduce jobset as the low-level runtime for distributed training, which allows users to define multiple training parameters for different training workers. It's a good choice to adopt Kubeflow Training V2 instead of the V1 version for them.
Based on the reasons above, supporting Volcano can bring users with great values.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered:
What you would like to be added?
In Kubeflow Training Operator V1, we support Volcano for gang-scheduling, while Trainer V2 hasn't supported it yet.
Since Volcano is a widely adopted scheduler for AI workloads, it could provide Trainer with more AI-specific scheduling capabilities if we integrate Volcano into Trainer, thus benefiting users who want to schedule pods with Volcano on top of Kubeflow Trainer.
/cc @kubeflow/wg-training-leads @saileshd1402 @astefanutti @juliusvonkohout @franciscojavierarceo @varodrig @rareddy @thesuperzapper @seanlaii @deepanker13 @helenxie-bit @Doris-xm @truc0 @mahdikhashan
Why is this needed?
In #2182, users requested for richer Volcano support in Kubeflow Training Operator V1.
AFAIK, kubeedge/sedna is waiting for the support of Volcano to enable gang-scheduling in edge-cloud environments: kubeedge/sedna#463. One of the reasons why it was paused is due to:
PyTorchJob
CRD in training-operator assumes that all training workers(pods) shared the same training parameters, whileFederatedLearningJob
CRD in Sedna allows training workers to have different training parameters. So we assume that all training workers have the same training parameters, which will surely put many restrictions on the applied scenarios of Senda Federated Learning V2 but we have no choice.In Kubeflow Trainer V2, we introduce
jobset
as the low-level runtime for distributed training, which allows users to define multiple training parameters for different training workers. It's a good choice to adopt Kubeflow Training V2 instead of the V1 version for them.Based on the reasons above, supporting Volcano can bring users with great values.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered: