-
Notifications
You must be signed in to change notification settings - Fork 777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do you add other machines? This example just created replicas within a single machine? Is Kubeflow not capable of adding other machines? #3706
Comments
https://www.kubeflow.org/docs/components/training/overview/ "create a TFJob/PyTorchJob with required number PSs, workers, and GPUs using Training Operator Python SDK." But then it doesn't seem to be possible to add workers, or Ps from other machines using Training Operator Python SDK......just replicas within the single machine? |
@akrupien Please can you explain what do you mean by "add workers, or Ps from other machines" ? |
@andreyvelich Thank you, I think you sort of read my mind, I would want a pod to be assigned to a single machine/computer. In my case I would want the pod to be assigned to the entire machine/computer. By "add workers, or Ps from other machines", I mean I have multiple computers/machines, each has multiple When I create a TFJob with the required number of workers using Training Operator, I'd expect it should match my TF config in my Tensorflow distributive training? So I should be able to add my individual computers/machines as workers? I am using MultiWorkerMirroredStrategy in my Tensorflow distributive training with multiple computers/machines. Each Computer/Machine is their own worker. https://www.kubeflow.org/docs/components/training/tftraining/ Your link seems to assign a pod to a node. Is it possible in my situation to use pod affinity to add my multiple workers/computers/machines in TFJob? In my situation a Node is a Machine which Is a Indidividual Computer which is it's own single pod. I am essentially asking how to use Training Operator to add my workers/computers/machines as pods, to their node. Ideally, my entire cluster would be a single pod but that doesn't seem possible. |
How do you add other machines? This example just created replicas within a single machine? Is Kubeflow not capable of adding other machines?
The text was updated successfully, but these errors were encountered: