You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PyTorch relies on the number of CPUs on the physical host when determining the "local world size" if nproc_per_node is set to auto and the node is a CPU-only device.
In that configuration, which is used by the preset torch-distributed training runtime, the number of processes equals the number of CPUs on the host, and leads to the following problems:
Out of memory issues for worker Pods scheduled on nodes with large number of CPUs
Deadlocks when the CPU limit set for the container is less than the actual number of CPUs
To mitigate these issues, nproc_per_node should default to the CPU limit when set, or fallback to 1 when the PyTorch ML policy defines numProcPerNode: auto.
PyTorch relies on the number of CPUs on the physical host when determining the "local world size" if
nproc_per_node
is set toauto
and the node is a CPU-only device.In that configuration, which is used by the preset
torch-distributed
training runtime, the number of processes equals the number of CPUs on the host, and leads to the following problems:To mitigate these issues,
nproc_per_node
should default to the CPU limit when set, or fallback to 1 when the PyTorch ML policy definesnumProcPerNode: auto
.This has been discussed in details in #2387 (comment)
The text was updated successfully, but these errors were encountered: