Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpirun all_reduce_perf hang with multi-device test #278

Open
kubepopeye opened this issue Dec 21, 2024 · 2 comments
Open

mpirun all_reduce_perf hang with multi-device test #278

kubepopeye opened this issue Dec 21, 2024 · 2 comments

Comments

@kubepopeye
Copy link

I've successfully run it on a single node and on two nodes, but after scaling to more than four nodes, it has never worked again. I hope to find a solution.

mpirun -d --mca plm_base_verbose 5 --mca oob_base_verbose 10 --mca btl_base_verbose 30 -v --allow-run-as-root --mca orte_base_help_aggregate 0
-bind-to none
-H 100.81.5.17:8,100.97.136.145:8,100.112.180.16:8,100.86.248.209:8
-np 32
-x NCCL_DEBUG=INFO
-x NCCL_IB_HCA=mlx5_0:1,mlx5_2:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1,mlx5_9:1
-x NCCL_SOCKET_IFNAME=eth0
-x LD_LIBRARY_PATH
-x NCCL_ALGO=ring
/opt/nccl_tests/build/all_reduce_perf -b 512 -e 18G -f 2 -g 1

mpirun version 4.1.7a1

Image

@kubepopeye
Copy link
Author

#26

like this, but no solution find in it

@sjeaugey
Copy link
Member

sjeaugey commented Jan 6, 2025

I've successfully run it on a single node and on two nodes, but after scaling to more than four nodes, it has never worked again.

Did you try 2 nodes, selecting different sets of nodes (e.g. -H 100.81.5.17:8,100.112.180.16:8) ? That way you can check that every single node works before adding it to the larger set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants