You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've successfully run it on a single node and on two nodes, but after scaling to more than four nodes, it has never worked again. I hope to find a solution.
I've successfully run it on a single node and on two nodes, but after scaling to more than four nodes, it has never worked again.
Did you try 2 nodes, selecting different sets of nodes (e.g. -H 100.81.5.17:8,100.112.180.16:8) ? That way you can check that every single node works before adding it to the larger set.
I've successfully run it on a single node and on two nodes, but after scaling to more than four nodes, it has never worked again. I hope to find a solution.
mpirun -d --mca plm_base_verbose 5 --mca oob_base_verbose 10 --mca btl_base_verbose 30 -v --allow-run-as-root --mca orte_base_help_aggregate 0
-bind-to none
-H 100.81.5.17:8,100.97.136.145:8,100.112.180.16:8,100.86.248.209:8
-np 32
-x NCCL_DEBUG=INFO
-x NCCL_IB_HCA=mlx5_0:1,mlx5_2:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1,mlx5_9:1
-x NCCL_SOCKET_IFNAME=eth0
-x LD_LIBRARY_PATH
-x NCCL_ALGO=ring
/opt/nccl_tests/build/all_reduce_perf -b 512 -e 18G -f 2 -g 1
mpirun version 4.1.7a1
The text was updated successfully, but these errors were encountered: