Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you add other machines? This example just created replicas within a single machine? Is Kubeflow not capable of adding other machines? #3706

Open
warmbasket opened this issue Apr 2, 2024 · 3 comments

Comments

@warmbasket
Copy link

How do you add other machines? This example just created replicas within a single machine? Is Kubeflow not capable of adding other machines?

@warmbasket
Copy link
Author

https://www.kubeflow.org/docs/components/training/overview/ "create a TFJob/PyTorchJob with required number PSs, workers, and GPUs using Training Operator Python SDK." But then it doesn't seem to be possible to add workers, or Ps from other machines using Training Operator Python SDK......just replicas within the single machine?

@andreyvelich
Copy link
Member

@akrupien Please can you explain what do you mean by "add workers, or Ps from other machines" ?
When you add more workers Training Operator will create more Kubernetes pods and those pods will be scheduled to the appropriate Kubernetes nodes.
You can also specify Pod Node Selector if you want Pods to be assigned to the specific Kubernetes node (machine).

@warmbasket
Copy link
Author

warmbasket commented Apr 3, 2024

@andreyvelich Thank you, I think you sort of read my mind, I would want a pod to be assigned to a single machine/computer. In my case I would want the pod to be assigned to the entire machine/computer.

By "add workers, or Ps from other machines", I mean I have multiple computers/machines, each has multiple
GPU's, and each computer/machine should be their own worker.

When I create a TFJob with the required number of workers using Training Operator, I'd expect it should match my TF config in my Tensorflow distributive training? So I should be able to add my individual computers/machines as workers?

I am using MultiWorkerMirroredStrategy in my Tensorflow distributive training with multiple computers/machines. Each Computer/Machine is their own worker.

https://www.kubeflow.org/docs/components/training/tftraining/
Tf Replica Spec in Training Operator SDK Doesn't seem to provide an option for adding individual machines as workers - only replicas within a single machine? But maybe I'm missing something under Spec if TFReplica spec is not necessary for TFJob.

Your link seems to assign a pod to a node. Is it possible in my situation to use pod affinity to add my multiple workers/computers/machines in TFJob?

In my situation a Node is a Machine which Is a Indidividual Computer which is it's own single pod.

I am essentially asking how to use Training Operator to add my workers/computers/machines as pods, to their node.

Ideally, my entire cluster would be a single pod but that doesn't seem possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: To Do
Development

No branches or pull requests

2 participants