Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cap nproc_per_node based on the CPU resources of the node for PyTorch TrainJob #2407

Open
astefanutti opened this issue Jan 31, 2025 · 0 comments

Comments

@astefanutti
Copy link
Contributor

PyTorch relies on the number of CPUs on the physical host when determining the "local world size" if nproc_per_node is set to auto and the node is a CPU-only device.

In that configuration, which is used by the preset torch-distributed training runtime, the number of processes equals the number of CPUs on the host, and leads to the following problems:

  • Out of memory issues for worker Pods scheduled on nodes with large number of CPUs
  • Deadlocks when the CPU limit set for the container is less than the actual number of CPUs

To mitigate these issues, nproc_per_node should default to the CPU limit when set, or fallback to 1 when the PyTorch ML policy defines numProcPerNode: auto.

This has been discussed in details in #2387 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant