-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling P2P capability on 8 RTX 4090 GPUs results in significantly lower performance in NCCL alltoall_perf tests compared to when P2P capability is disabled. #17
Comments
I have 8 gpu rtx 4090 550.90.07-p2p working fine and some are on nvme 4x pcie cards |
I am using the 550.90.07-p2p driver and can still reproduce the issue where enabling P2P capability results in a performance drop in the alltoall_perf test. Could you please share your test data? |
Maybe it don't likey 8 cards :-) NCCL_P2P_LEVEL=SYS ./alltoall_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 370722 on ubuntu11 device 0 [0x01] NVIDIA GeForce RTX 4090
# Rank 1 Group 0 Pid 370722 on ubuntu11 device 1 [0x02] NVIDIA GeForce RTX 4090
# Rank 2 Group 0 Pid 370722 on ubuntu11 device 2 [0x2a] NVIDIA GeForce RTX 4090
# Rank 3 Group 0 Pid 370722 on ubuntu11 device 3 [0x2c] NVIDIA GeForce RTX 4090
# Rank 4 Group 0 Pid 370722 on ubuntu11 device 4 [0x41] NVIDIA GeForce RTX 4090
# Rank 5 Group 0 Pid 370722 on ubuntu11 device 5 [0x42] NVIDIA GeForce RTX 4090
# Rank 6 Group 0 Pid 370722 on ubuntu11 device 6 [0x61] NVIDIA GeForce RTX 4090
# Rank 7 Group 0 Pid 370722 on ubuntu11 device 7 [0x62] NVIDIA GeForce RTX 4090
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
0 0 float none -1 49.16 0.00 0.00 0 48.85 0.00 0.00 N/A
0 0 float none -1 49.42 0.00 0.00 0 48.91 0.00 0.00 N/A
32 1 float none -1 50.95 0.00 0.00 0 50.04 0.00 0.00 N/A
64 2 float none -1 50.87 0.00 0.00 0 50.45 0.00 0.00 N/A
128 4 float none -1 50.24 0.00 0.00 0 50.41 0.00 0.00 N/A
256 8 float none -1 50.94 0.01 0.00 0 89.63 0.00 0.00 N/A
512 16 float none -1 52.09 0.01 0.01 0 51.60 0.01 0.01 N/A
1024 32 float none -1 51.65 0.02 0.02 0 66.24 0.02 0.01 N/A
2048 64 float none -1 54.57 0.04 0.03 0 51.28 0.04 0.03 N/A
4096 128 float none -1 51.27 0.08 0.07 0 51.27 0.08 0.07 N/A
8192 256 float none -1 53.11 0.15 0.13 0 51.23 0.16 0.14 N/A
16384 512 float none -1 60.79 0.27 0.24 0 56.50 0.29 0.25 N/A
32768 1024 float none -1 113.8 0.29 0.25 0 113.5 0.29 0.25 N/A
65536 2048 float none -1 301.3 0.22 0.19 0 229.7 0.29 0.25 N/A
131072 4096 float none -1 426.3 0.31 0.27 0 427.6 0.31 0.27 N/A
262144 8192 float none -1 774.7 0.34 0.30 0 687.1 0.38 0.33 N/A
524288 16384 float none -1 599.9 0.87 0.76 0 474.4 1.11 0.97 N/A
1048576 32768 float none -1 1159.5 0.90 0.79 0 1056.1 0.99 0.87 N/A
2097152 65536 float none -1 2342.7 0.90 0.78 0 2205.7 0.95 0.83 N/A
4194304 131072 float none -1 4459.3 0.94 0.82 0 4234.0 0.99 0.87 N/A
8388608 262144 float none -1 8618.4 0.97 0.85 0 8355.6 1.00 0.88 N/A
16777216 524288 float none -1 16657 1.01 0.88 0 16446 1.02 0.89 N/A
33554432 1048576 float none -1 32761 1.02 0.90 0 32640 1.03 0.90 N/A
67108864 2097152 float none -1 65425 1.03 0.90 0 65250 1.03 0.90 N/A
134217728 4194304 float none -1 132129 1.02 0.89 0 130992 1.02 0.90 N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth : 0.374518
#
myles@ubuntu11:~/nccl-tests/build$ ./alltoall_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 370879 on ubuntu11 device 0 [0x01] NVIDIA GeForce RTX 4090
# Rank 1 Group 0 Pid 370879 on ubuntu11 device 1 [0x02] NVIDIA GeForce RTX 4090
# Rank 2 Group 0 Pid 370879 on ubuntu11 device 2 [0x2a] NVIDIA GeForce RTX 4090
# Rank 3 Group 0 Pid 370879 on ubuntu11 device 3 [0x2c] NVIDIA GeForce RTX 4090
# Rank 4 Group 0 Pid 370879 on ubuntu11 device 4 [0x41] NVIDIA GeForce RTX 4090
# Rank 5 Group 0 Pid 370879 on ubuntu11 device 5 [0x42] NVIDIA GeForce RTX 4090
# Rank 6 Group 0 Pid 370879 on ubuntu11 device 6 [0x61] NVIDIA GeForce RTX 4090
# Rank 7 Group 0 Pid 370879 on ubuntu11 device 7 [0x62] NVIDIA GeForce RTX 4090
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
0 0 float none -1 66.31 0.00 0.00 0 49.37 0.00 0.00 N/A
0 0 float none -1 50.01 0.00 0.00 0 49.44 0.00 0.00 N/A
32 1 float none -1 50.88 0.00 0.00 0 51.28 0.00 0.00 N/A
64 2 float none -1 64.19 0.00 0.00 0 51.21 0.00 0.00 N/A
128 4 float none -1 51.64 0.00 0.00 0 51.11 0.00 0.00 N/A
256 8 float none -1 51.55 0.00 0.00 0 50.90 0.01 0.00 N/A
512 16 float none -1 51.44 0.01 0.01 0 50.86 0.01 0.01 N/A
1024 32 float none -1 50.94 0.02 0.02 0 115.4 0.01 0.01 N/A
2048 64 float none -1 51.44 0.04 0.03 0 50.84 0.04 0.04 N/A
4096 128 float none -1 51.01 0.08 0.07 0 51.03 0.08 0.07 N/A
8192 256 float none -1 51.89 0.16 0.14 0 51.40 0.16 0.14 N/A
16384 512 float none -1 63.17 0.26 0.23 0 56.80 0.29 0.25 N/A
32768 1024 float none -1 110.4 0.30 0.26 0 112.7 0.29 0.25 N/A
65536 2048 float none -1 229.7 0.29 0.25 0 226.8 0.29 0.25 N/A
131072 4096 float none -1 431.9 0.30 0.27 0 427.4 0.31 0.27 N/A
262144 8192 float none -1 790.9 0.33 0.29 0 699.8 0.37 0.33 N/A
524288 16384 float none -1 580.2 0.90 0.79 0 523.1 1.00 0.88 N/A
1048576 32768 float none -1 1197.2 0.88 0.77 0 1132.2 0.93 0.81 N/A
2097152 65536 float none -1 2355.8 0.89 0.78 0 2166.9 0.97 0.85 N/A
4194304 131072 float none -1 4450.6 0.94 0.82 0 4277.7 0.98 0.86 N/A
8388608 262144 float none -1 8625.1 0.97 0.85 0 8311.6 1.01 0.88 N/A
16777216 524288 float none -1 16530 1.01 0.89 0 16491 1.02 0.89 N/A
33554432 1048576 float none -1 32384 1.04 0.91 0 32438 1.03 0.91 N/A
67108864 2097152 float none -1 65691 1.02 0.89 0 65766 1.02 0.89 N/A
134217728 4194304 float none -1 132444 1.01 0.89 0 130140 1.03 0.90 N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth : 0.372955
#
myles@ubuntu11:~/nccl-tests/build$ nvidia-smi topo -p2p rw
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X OK OK OK OK OK OK OK
GPU1 OK X OK OK OK OK OK OK
GPU2 OK OK X OK OK OK OK OK
GPU3 OK OK OK X OK OK OK OK
GPU4 OK OK OK OK X OK OK OK
GPU5 OK OK OK OK OK X OK OK
GPU6 OK OK OK OK OK OK X OK
GPU7 OK OK OK OK OK OK OK X
Legend:
X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown
myles@ubuntu11:~/nccl-tests/build$ |
I think this would be a normal result because with 8 cards would be saturating the bandwidth maybe? Wonder why this is so different betwen our cards. what motherboard do you use how do you connect the 8 cards? i use the pcie riser cable version pcie 5.0 for 7 of them and then a nvme adaptor for the 8th one. Avg bus bandwidth : 0.372955 of mine vs yours Avg bus bandwidth : 0.492673 but i have the cards all limited to 400w of power. |
My GPUs are not power-limited, and I’m using an Intel SPR motherboard with all 8 GPUs directly connected to the CPU, using PCIe 4.0. I don't think it’s related to bandwidth saturation, because even with just 3 GPUs, the performance significantly degrades when P2P is enabled. |
have a look it actually says NCCL INFO Channel 03 : 1[1] -> 0[0] via SHM/direct/direct that means its going via the cpu when more than two devices are listed. export NCCL_P2P_DISABLE=0
myles@ubuntu11:~/nccl-tests/build$ NCCL_DEBUG=INFO ./all_reduce_perf -g 3
# nThread 1 nGpus 3 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 103631 on ubuntu11 device 0 [0x01] NVIDIA GeForce RTX 4090
# Rank 1 Group 0 Pid 103631 on ubuntu11 device 1 [0x02] NVIDIA GeForce RTX 4090
# Rank 2 Group 0 Pid 103631 on ubuntu11 device 2 [0x2b] NVIDIA GeForce RTX 4090
ubuntu11:103631:103631 [0] NCCL INFO Bootstrap : Using enp37s0f0:192.168.1.32<0>
ubuntu11:103631:103631 [0] NCCL INFO cudaDriverVersion 12060
ubuntu11:103631:103631 [0] NCCL INFO NCCL version 2.23.4+cuda12.4
ubuntu11:103631:103642 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
ubuntu11:103631:103642 [0] NCCL INFO NET/IB : No device found.
ubuntu11:103631:103642 [0] NCCL INFO NET/Socket : Using [0]enp37s0f0:192.168.1.32<0> [1]enp37s0f1:192.168.1.47<0>
ubuntu11:103631:103642 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
ubuntu11:103631:103642 [0] NCCL INFO Using network Socket
ubuntu11:103631:103644 [2] NCCL INFO Using network Socket
ubuntu11:103631:103643 [1] NCCL INFO Using network Socket
ubuntu11:103631:103644 [2] NCCL INFO ncclCommInitAll comm 0x650203e649b0 rank 2 nranks 3 cudaDev 2 nvmlDev 2 busId 2b000 commId 0xfcf1664b18e49f00 - Init START
ubuntu11:103631:103643 [1] NCCL INFO ncclCommInitAll comm 0x650203e24070 rank 1 nranks 3 cudaDev 1 nvmlDev 1 busId 2000 commId 0xfcf1664b18e49f00 - Init START
ubuntu11:103631:103642 [0] NCCL INFO ncclCommInitAll comm 0x650203de37d0 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId 1000 commId 0xfcf1664b18e49f00 - Init START
ubuntu11:103631:103643 [1] NCCL INFO Bootstrap timings total 0.001324 (create 0.000039, send 0.000161, recv 0.000395, ring 0.000121, delay 0.000000)
ubuntu11:103631:103642 [0] NCCL INFO Bootstrap timings total 0.001267 (create 0.000048, send 0.000227, recv 0.000700, ring 0.000082, delay 0.000000)
ubuntu11:103631:103644 [2] NCCL INFO Bootstrap timings total 0.001342 (create 0.000042, send 0.000175, recv 0.000675, ring 0.000234, delay 0.000001)
ubuntu11:103631:103643 [1] NCCL INFO NVLS multicast support is not available on dev 1
ubuntu11:103631:103644 [2] NCCL INFO NVLS multicast support is not available on dev 2
ubuntu11:103631:103642 [0] NCCL INFO NVLS multicast support is not available on dev 0
ubuntu11:103631:103644 [2] NCCL INFO comm 0x650203e649b0 rank 2 nRanks 3 nNodes 1 localRanks 3 localRank 2 MNNVL 0
ubuntu11:103631:103643 [1] NCCL INFO comm 0x650203e24070 rank 1 nRanks 3 nNodes 1 localRanks 3 localRank 1 MNNVL 0
ubuntu11:103631:103642 [0] NCCL INFO comm 0x650203de37d0 rank 0 nRanks 3 nNodes 1 localRanks 3 localRank 0 MNNVL 0
ubuntu11:103631:103642 [0] NCCL INFO Channel 00/04 : 0 1 2
ubuntu11:103631:103642 [0] NCCL INFO Channel 01/04 : 0 1 2
ubuntu11:103631:103642 [0] NCCL INFO Channel 02/04 : 0 1 2
ubuntu11:103631:103644 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1 [2] -1/-1/-1->2->1 [3] -1/-1/-1->2->1
ubuntu11:103631:103643 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
ubuntu11:103631:103643 [1] NCCL INFO P2P Chunksize set to 131072
ubuntu11:103631:103642 [0] NCCL INFO Channel 03/04 : 0 1 2
ubuntu11:103631:103642 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
ubuntu11:103631:103642 [0] NCCL INFO P2P Chunksize set to 131072
ubuntu11:103631:103644 [2] NCCL INFO P2P Chunksize set to 131072
ubuntu11:103631:103647 [1] NCCL INFO [Proxy Service] Device 1 CPU core 118
ubuntu11:103631:103649 [2] NCCL INFO [Proxy Service] Device 2 CPU core 59
ubuntu11:103631:103652 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 7
ubuntu11:103631:103650 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 65
ubuntu11:103631:103648 [0] NCCL INFO [Proxy Service] Device 0 CPU core 39
ubuntu11:103631:103651 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 105
ubuntu11:103631:103643 [1] NCCL INFO Channel 00 : 1[1] -> 2[2] via SHM/direct/direct
ubuntu11:103631:103644 [2] NCCL INFO Channel 00 : 2[2] -> 0[0] via SHM/direct/direct
ubuntu11:103631:103642 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
ubuntu11:103631:103643 [1] NCCL INFO Channel 01 : 1[1] -> 2[2] via SHM/direct/direct
ubuntu11:103631:103644 [2] NCCL INFO Channel 01 : 2[2] -> 0[0] via SHM/direct/direct
ubuntu11:103631:103642 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
ubuntu11:103631:103643 [1] NCCL INFO Channel 02 : 1[1] -> 2[2] via SHM/direct/direct
ubuntu11:103631:103644 [2] NCCL INFO Channel 02 : 2[2] -> 0[0] via SHM/direct/direct
ubuntu11:103631:103642 [0] NCCL INFO Channel 02 : 0[0] -> 1[1] via SHM/direct/direct
ubuntu11:103631:103643 [1] NCCL INFO Channel 03 : 1[1] -> 2[2] via SHM/direct/direct
ubuntu11:103631:103644 [2] NCCL INFO Channel 03 : 2[2] -> 0[0] via SHM/direct/direct
ubuntu11:103631:103642 [0] NCCL INFO Channel 03 : 0[0] -> 1[1] via SHM/direct/direct
ubuntu11:103631:103643 [1] NCCL INFO Connected all rings
ubuntu11:103631:103644 [2] NCCL INFO Connected all rings
ubuntu11:103631:103642 [0] NCCL INFO Connected all rings
ubuntu11:103631:103644 [2] NCCL INFO Channel 00 : 2[2] -> 1[1] via SHM/direct/direct
ubuntu11:103631:103644 [2] NCCL INFO Channel 01 : 2[2] -> 1[1] via SHM/direct/direct
ubuntu11:103631:103644 [2] NCCL INFO Channel 02 : 2[2] -> 1[1] via SHM/direct/direct
ubuntu11:103631:103644 [2] NCCL INFO Channel 03 : 2[2] -> 1[1] via SHM/direct/direct
ubuntu11:103631:103643 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
ubuntu11:103631:103643 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
ubuntu11:103631:103643 [1] NCCL INFO Channel 02 : 1[1] -> 0[0] via SHM/direct/direct
ubuntu11:103631:103643 [1] NCCL INFO Channel 03 : 1[1] -> 0[0] via SHM/direct/direct
ubuntu11:103631:103642 [0] NCCL INFO Connected all trees
ubuntu11:103631:103644 [2] NCCL INFO Connected all trees
ubuntu11:103631:103643 [1] NCCL INFO Connected all trees
ubuntu11:103631:103658 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 80
ubuntu11:103631:103659 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 121
ubuntu11:103631:103643 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
ubuntu11:103631:103643 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
ubuntu11:103631:103642 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
ubuntu11:103631:103642 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
ubuntu11:103631:103642 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
ubuntu11:103631:103660 [2] NCCL INFO [Proxy Progress] Device 2 CPU core 97
ubuntu11:103631:103644 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
ubuntu11:103631:103644 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
ubuntu11:103631:103642 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
ubuntu11:103631:103642 [0] NCCL INFO ncclCommInitAll comm 0x650203de37d0 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId 1000 commId 0xfcf1664b18e49f00 - Init COMPLETE
ubuntu11:103631:103642 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 3 total 0.56 (kernels 0.26, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.21, rest 0.01)
ubuntu11:103631:103643 [1] NCCL INFO ncclCommInitAll comm 0x650203e24070 rank 1 nranks 3 cudaDev 1 nvmlDev 1 busId 2000 commId 0xfcf1664b18e49f00 - Init COMPLETE
ubuntu11:103631:103644 [2] NCCL INFO ncclCommInitAll comm 0x650203e649b0 rank 2 nranks 3 cudaDev 2 nvmlDev 2 busId 2b000 commId 0xfcf1664b18e49f00 - Init COMPLETE
ubuntu11:103631:103643 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 3 total 0.56 (kernels 0.26, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.21, rest 0.01)
ubuntu11:103631:103644 [2] NCCL INFO Init timings - ncclCommInitAll: rank 2 nranks 3 total 0.56 (kernels 0.26, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.22, rest 0.00)
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
33554432 8388608 float sum -1 12835 2.61 3.49 0 12855 2.61 3.48 0
ubuntu11:103631:103631 [0] NCCL INFO comm 0x650203de37d0 rank 0 nranks 3 cudaDev 0 busId 1000 - Destroy COMPLETE
ubuntu11:103631:103631 [2] NCCL INFO comm 0x650203e649b0 rank 2 nranks 3 cudaDev 2 busId 2b000 - Destroy COMPLETE
ubuntu11:103631:103631 [1] NCCL INFO comm 0x650203e24070 rank 1 nranks 3 cudaDev 1 busId 2000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth : 3.48293
#
myles@ubuntu11:~/nccl-tests/build$ NCCL_DEBUG=INFO ./all_reduce_perf -g 2
# nThread 1 nGpus 2 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 104198 on ubuntu11 device 0 [0x01] NVIDIA GeForce RTX 4090
# Rank 1 Group 0 Pid 104198 on ubuntu11 device 1 [0x02] NVIDIA GeForce RTX 4090
ubuntu11:104198:104198 [0] NCCL INFO Bootstrap : Using enp37s0f0:192.168.1.32<0>
ubuntu11:104198:104198 [0] NCCL INFO cudaDriverVersion 12060
ubuntu11:104198:104198 [0] NCCL INFO NCCL version 2.23.4+cuda12.4
ubuntu11:104198:104212 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
ubuntu11:104198:104212 [0] NCCL INFO NET/IB : No device found.
ubuntu11:104198:104212 [0] NCCL INFO NET/Socket : Using [0]enp37s0f0:192.168.1.32<0> [1]enp37s0f1:192.168.1.47<0>
ubuntu11:104198:104212 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
ubuntu11:104198:104212 [0] NCCL INFO Using network Socket
ubuntu11:104198:104213 [1] NCCL INFO Using network Socket
ubuntu11:104198:104213 [1] NCCL INFO ncclCommInitAll comm 0x64b9672f4710 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 2000 commId 0x91379a7e94fa7192 - Init START
ubuntu11:104198:104212 [0] NCCL INFO ncclCommInitAll comm 0x64b9672b4af0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1000 commId 0x91379a7e94fa7192 - Init START
ubuntu11:104198:104213 [1] NCCL INFO Bootstrap timings total 0.001072 (create 0.000054, send 0.000212, recv 0.000337, ring 0.000068, delay 0.000000)
ubuntu11:104198:104212 [0] NCCL INFO Bootstrap timings total 0.001006 (create 0.000046, send 0.000226, recv 0.000458, ring 0.000054, delay 0.000000)
ubuntu11:104198:104213 [1] NCCL INFO comm 0x64b9672f4710 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
ubuntu11:104198:104213 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
ubuntu11:104198:104213 [1] NCCL INFO P2P Chunksize set to 131072
ubuntu11:104198:104212 [0] NCCL INFO comm 0x64b9672b4af0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
ubuntu11:104198:104212 [0] NCCL INFO Channel 00/04 : 0 1
ubuntu11:104198:104212 [0] NCCL INFO Channel 01/04 : 0 1
ubuntu11:104198:104212 [0] NCCL INFO Channel 02/04 : 0 1
ubuntu11:104198:104212 [0] NCCL INFO Channel 03/04 : 0 1
ubuntu11:104198:104212 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
ubuntu11:104198:104212 [0] NCCL INFO P2P Chunksize set to 131072
ubuntu11:104198:104216 [1] NCCL INFO [Proxy Service] Device 1 CPU core 80
ubuntu11:104198:104217 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 43
ubuntu11:104198:104218 [0] NCCL INFO [Proxy Service] Device 0 CPU core 49
ubuntu11:104198:104219 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 22
ubuntu11:104198:104212 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/direct pointer
ubuntu11:104198:104212 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/direct pointer
ubuntu11:104198:104213 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/direct pointer
ubuntu11:104198:104212 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/direct pointer
ubuntu11:104198:104213 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/direct pointer
ubuntu11:104198:104212 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/direct pointer
ubuntu11:104198:104213 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/direct pointer
ubuntu11:104198:104213 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/direct pointer
ubuntu11:104198:104213 [1] NCCL INFO Connected all rings
ubuntu11:104198:104213 [1] NCCL INFO Connected all trees
ubuntu11:104198:104212 [0] NCCL INFO Connected all rings
ubuntu11:104198:104212 [0] NCCL INFO Connected all trees
ubuntu11:104198:104220 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 15
ubuntu11:104198:104221 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 115
ubuntu11:104198:104212 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ubuntu11:104198:104212 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
ubuntu11:104198:104212 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
ubuntu11:104198:104213 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ubuntu11:104198:104213 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
ubuntu11:104198:104212 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
ubuntu11:104198:104212 [0] NCCL INFO ncclCommInitAll comm 0x64b9672b4af0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1000 commId 0x91379a7e94fa7192 - Init COMPLETE
ubuntu11:104198:104212 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 2 total 0.29 (kernels 0.19, alloc 0.07, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.02, rest 0.00)
ubuntu11:104198:104213 [1] NCCL INFO ncclCommInitAll comm 0x64b9672f4710 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 2000 commId 0x91379a7e94fa7192 - Init COMPLETE
ubuntu11:104198:104213 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 2 total 0.29 (kernels 0.20, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.02, rest 0.00)
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
33554432 8388608 float sum -1 1398.2 24.00 24.00 0 1395.1 24.05 24.05 0
ubuntu11:104198:104198 [0] NCCL INFO comm 0x64b9672b4af0 rank 0 nranks 2 cudaDev 0 busId 1000 - Destroy COMPLETE
ubuntu11:104198:104198 [1] NCCL INFO comm 0x64b9672f4710 rank 1 nranks 2 cudaDev 1 busId 2000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth : 24.0255
# |
NCCL_P2P_LEVEL=PHB |
@ZP-AlwaysWin hey found solution. Nccl is choosing cpu transfer if you says nccl level sys if more than 2 gpu are selected. Export above command and try. I tested on the cuda 12.6 and nvidia driver 560.35.03 and it used p2p for all the transfers. With 7 gpu. I'm going test with 8 gpu in a few min when plug back in another gpu. But this command NCCL_P2P_LEVEL=PHB shows on 7 gpu very high bandwidth 24 gb s. |
Here is the log output for the command NCCL_DEBUG=INFO NCCL_P2P_LEVEL=SYS ./alltoall_perf -b 8 -e 128M -f 2 -g 8 |
Please provide detailed test logs or leave a contact information for further communication @mylesgoose |
I just tested the P2P capability using version 560.35.03, and the result is the same as the previous version, with no improvement. |
NCCL_P2P_LEVEL=PHB did you try that command? When I tested my one with SYS it failed back to cpu. When I said PHB it went to p2p with 7 cards. And my one was falling back to cpu transfer. At level before. However. If I enabled a card that does not have pcie 16x bandwidth it falls back to cpu copy. If you run with info the command and see. My point is if you have one device that is not on same bandwidth as the rest it falls back to cpu. I cannot get my pc to boot at the moment with 8 cards due to asus wrx80e motherboard pcie16x 4.0 issues. Can you try running with that command and are all your devices pxie16 4.0. @ZP-AlwaysWin try to disable with export cuda visible devices any devices that are not pcie 16. |
NCCL_P2P_LEVEL=PHB I tested it, and the result is the same as without enabling P2P capability, which means P2P is effectively not enabled. @mylesgoose |
Yeah, but did you try to disable the None pcie16x cards? So for example if card device number 2 is pcie 8x and the rest are 16x. Export cuda visible devices 0,1,3,4,5,6,7 and nccl level=PHB and test if p2p is still enabled with that level as long as all fevices on dame numa node as I have seen it works with 7 cards at 16x just fine . When I did that test with 7 cards only excluding the 8th card on pcie 4x. The system worked with full p2p. If I did the same test with 7 cards and including the 4x card it sent all via the cpu. As is shown in nccl info. If I had 8 cards and p2p enabled with one of cards being 4x it would be cpu. Because it falls back to cpu on 7 cards with one being at 4x and rest at 16x. Yet works fine with 7 cards at p2p PHB as long as the 7th card is not a 4x card. This leads me to think the problem lies with NCCl software. Which we have the source code for and could find out why. Perhaps it does a quick test to see if all devices are pcie 16x 4.0 equivalent. And I can't get my motherboard working so it's being returned, so I can't prove the output works as long as the cards where all at the same speed. Hopefully, I can order a new motherboard soon. But that asus sage wrx80e is playing up with that many cards. My point is this. It is not a driver issue this is an issue with how nccl is handling that specific request depending on your hardware configuration. Maybe you have 8 cards at pcie 16x 4.0. And if you do and you have tried nccl level PHB and still failed to show p2p in the info export. Then I maybe wrong. But i pinpoint the issue to the fact I can replicate this issue with 5 devices or 6 or 7. Depending if one of the specific devices is at a slower speed pcie bandwidth than the other 4,5,6 cards I feel that explains the culprit. And I think we should investigate that path. So if you can test that that idea. Would be good. |
Above i I show the command nccl debug level info. And nccl =PHB can you try that with your cards that are at pcie 16x 4.0 full bandwidth only. It does not work if one card is at a lower bandwidth than the others. All cards must be same speed or nccl disables p2p. Do above, And see if it says this ]" via SHM/direct/direct" in the terminal. Then you have the answer. Which shows nccl is forcing cpu transfer on those devices. It's not p2p. Which is why the degradation in performance. It's trying 8 cards to all at once copy to cpu ram then back to each other. @ZP-AlwaysWin |
Here I found the outputs proving the bandwidth uses p2p PROVIDING all the cards are all pcie16x bandwidth. If any of the cards are pcie 4x or 8x nccl default to cpu.if the cards are all on same deviceand speed it works. As you can see the speeds are excellent. And do not vary going right up to my max pcie 16x slots. Changing by export cuda visible devices to enable the 4x card and disable one of the 16x cards. Nccl says nope not happy let's go to cpu. # nThread 1 nGpus 3 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 11800 on myles-System-Product-Name device 0 [0x01] NVIDIA GeForce RTX 4090
# Rank 1 Group 0 Pid 11800 on myles-System-Product-Name device 1 [0x02] NVIDIA GeForce RTX 4090
# Rank 2 Group 0 Pid 11800 on myles-System-Product-Name device 2 [0x2b] NVIDIA GeForce RTX 4090
myles-System-Product-Name:11800:11800 [0] NCCL INFO Bootstrap : Using enp37s0f0:192.168.1.32<0>
myles-System-Product-Name:11800:11800 [0] NCCL INFO cudaDriverVersion 12060
myles-System-Product-Name:11800:11800 [0] NCCL INFO NCCL version 2.23.4+cuda12.6
myles-System-Product-Name:11800:11832 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
myles-System-Product-Name:11800:11832 [1] NCCL INFO Failed to open libibverbs.so[.1]
myles-System-Product-Name:11800:11832 [1] NCCL INFO NET/Socket : Using [0]enp37s0f0:192.168.1.32<0> [1]enp37s0f1:192.168.1.47<0>
myles-System-Product-Name:11800:11832 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
myles-System-Product-Name:11800:11832 [1] NCCL INFO Using network Socket
myles-System-Product-Name:11800:11833 [2] NCCL INFO Using network Socket
myles-System-Product-Name:11800:11831 [0] NCCL INFO Using network Socket
myles-System-Product-Name:11800:11833 [2] NCCL INFO ncclCommInitAll comm 0x5b56b2a08a00 rank 2 nranks 3 cudaDev 2 nvmlDev 2 busId 2b000 commId 0x47a48d37963338d8 - Init START
myles-System-Product-Name:11800:11831 [0] NCCL INFO ncclCommInitAll comm 0x5b56b2987820 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId 1000 commId 0x47a48d37963338d8 - Init START
myles-System-Product-Name:11800:11832 [1] NCCL INFO ncclCommInitAll comm 0x5b56b29c80c0 rank 1 nranks 3 cudaDev 1 nvmlDev 1 busId 2000 commId 0x47a48d37963338d8 - Init START
myles-System-Product-Name:11800:11831 [0] NCCL INFO Bootstrap timings total 0.001027 (create 0.000063, send 0.000146, recv 0.000453, ring 0.000187, delay 0.000000)
myles-System-Product-Name:11800:11832 [1] NCCL INFO Bootstrap timings total 0.000949 (create 0.000048, send 0.000169, recv 0.000485, ring 0.000058, delay 0.000000)
myles-System-Product-Name:11800:11833 [2] NCCL INFO Bootstrap timings total 0.001082 (create 0.000049, send 0.000179, recv 0.000312, ring 0.000073, delay 0.000000)
myles-System-Product-Name:11800:11831 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to SYS
myles-System-Product-Name:11800:11831 [0] NCCL INFO NVLS multicast support is not available on dev 0
myles-System-Product-Name:11800:11833 [2] NCCL INFO NVLS multicast support is not available on dev 2
myles-System-Product-Name:11800:11832 [1] NCCL INFO NVLS multicast support is not available on dev 1
myles-System-Product-Name:11800:11831 [0] NCCL INFO comm 0x5b56b2987820 rank 0 nRanks 3 nNodes 1 localRanks 3 localRank 0 MNNVL 0
myles-System-Product-Name:11800:11832 [1] NCCL INFO comm 0x5b56b29c80c0 rank 1 nRanks 3 nNodes 1 localRanks 3 localRank 1 MNNVL 0
myles-System-Product-Name:11800:11831 [0] NCCL INFO Channel 00/04 : 0 1 2
myles-System-Product-Name:11800:11833 [2] NCCL INFO comm 0x5b56b2a08a00 rank 2 nRanks 3 nNodes 1 localRanks 3 localRank 2 MNNVL 0
myles-System-Product-Name:11800:11831 [0] NCCL INFO Channel 01/04 : 0 1 2
myles-System-Product-Name:11800:11833 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1 [2] -1/-1/-1->2->1 [3] -1/-1/-1->2->1
myles-System-Product-Name:11800:11833 [2] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:11800:11831 [0] NCCL INFO Channel 02/04 : 0 1 2
myles-System-Product-Name:11800:11831 [0] NCCL INFO Channel 03/04 : 0 1 2
myles-System-Product-Name:11800:11832 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
myles-System-Product-Name:11800:11832 [1] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:11800:11831 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
myles-System-Product-Name:11800:11831 [0] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:11800:11836 [2] NCCL INFO [Proxy Service] Device 2 CPU core 32
myles-System-Product-Name:11800:11838 [0] NCCL INFO [Proxy Service] Device 0 CPU core 118
myles-System-Product-Name:11800:11837 [1] NCCL INFO [Proxy Service] Device 1 CPU core 41
myles-System-Product-Name:11800:11839 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 57
myles-System-Product-Name:11800:11840 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 124
myles-System-Product-Name:11800:11841 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 41
myles-System-Product-Name:11800:11833 [2] NCCL INFO Channel 00/0 : 2[2] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11800:11831 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11800:11832 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11800:11833 [2] NCCL INFO Channel 01/0 : 2[2] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11800:11831 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11800:11832 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11800:11833 [2] NCCL INFO Channel 02/0 : 2[2] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11800:11831 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11800:11832 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11800:11831 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11800:11833 [2] NCCL INFO Channel 03/0 : 2[2] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11800:11832 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11800:11831 [0] NCCL INFO Connected all rings
myles-System-Product-Name:11800:11833 [2] NCCL INFO Connected all rings
myles-System-Product-Name:11800:11832 [1] NCCL INFO Connected all rings
myles-System-Product-Name:11800:11833 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11800:11833 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11800:11833 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11800:11833 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11800:11832 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11800:11832 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11800:11832 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11800:11832 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11800:11831 [0] NCCL INFO Connected all trees
myles-System-Product-Name:11800:11833 [2] NCCL INFO Connected all trees
myles-System-Product-Name:11800:11832 [1] NCCL INFO Connected all trees
myles-System-Product-Name:11800:11842 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 118
myles-System-Product-Name:11800:11843 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 97
myles-System-Product-Name:11800:11844 [2] NCCL INFO [Proxy Progress] Device 2 CPU core 30
myles-System-Product-Name:11800:11831 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
myles-System-Product-Name:11800:11831 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:11800:11832 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
myles-System-Product-Name:11800:11832 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:11800:11833 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
myles-System-Product-Name:11800:11833 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:11800:11831 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
myles-System-Product-Name:11800:11831 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
myles-System-Product-Name:11800:11831 [0] NCCL INFO ncclCommInitAll comm 0x5b56b2987820 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId 1000 commId 0x47a48d37963338d8 - Init COMPLETE
myles-System-Product-Name:11800:11831 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 3 total 0.36 (kernels 0.25, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.03, rest 0.00)
myles-System-Product-Name:11800:11832 [1] NCCL INFO ncclCommInitAll comm 0x5b56b29c80c0 rank 1 nranks 3 cudaDev 1 nvmlDev 1 busId 2000 commId 0x47a48d37963338d8 - Init COMPLETE
myles-System-Product-Name:11800:11832 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 3 total 0.36 (kernels 0.25, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.03, rest 0.00)
myles-System-Product-Name:11800:11833 [2] NCCL INFO ncclCommInitAll comm 0x5b56b2a08a00 rank 2 nranks 3 cudaDev 2 nvmlDev 2 busId 2b000 commId 0x47a48d37963338d8 - Init COMPLETE
myles-System-Product-Name:11800:11833 [2] NCCL INFO Init timings - ncclCommInitAll: rank 2 nranks 3 total 0.36 (kernels 0.25, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.03, rest 0.00)
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
33554432 8388608 float sum -1 1988.0 16.88 22.50 0 1933.6 17.35 23.14 0
myles-System-Product-Name:11800:11800 [0] NCCL INFO comm 0x5b56b2987820 rank 0 nranks 3 cudaDev 0 busId 1000 - Destroy COMPLETE
myles-System-Product-Name:11800:11800 [2] NCCL INFO comm 0x5b56b2a08a00 rank 2 nranks 3 cudaDev 2 busId 2b000 - Destroy COMPLETE
myles-System-Product-Name:11800:11800 [1] NCCL INFO comm 0x5b56b29c80c0 rank 1 nranks 3 cudaDev 1 busId 2000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth : 22.8207
#
myles@myles-System-Product-Name:~/nccl-tests/build$ NCCL_DEBUG=INFO NCCL_P2P_LEVEL=PHB ./all_reduce_perf -g 3
# nThread 1 nGpus 3 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 11860 on myles-System-Product-Name device 0 [0x01] NVIDIA GeForce RTX 4090
# Rank 1 Group 0 Pid 11860 on myles-System-Product-Name device 1 [0x02] NVIDIA GeForce RTX 4090
# Rank 2 Group 0 Pid 11860 on myles-System-Product-Name device 2 [0x2b] NVIDIA GeForce RTX 4090
myles-System-Product-Name:11860:11860 [0] NCCL INFO Bootstrap : Using enp37s0f0:192.168.1.32<0>
myles-System-Product-Name:11860:11860 [0] NCCL INFO cudaDriverVersion 12060
myles-System-Product-Name:11860:11860 [0] NCCL INFO NCCL version 2.23.4+cuda12.6
myles-System-Product-Name:11860:11892 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
myles-System-Product-Name:11860:11892 [1] NCCL INFO Failed to open libibverbs.so[.1]
myles-System-Product-Name:11860:11892 [1] NCCL INFO NET/Socket : Using [0]enp37s0f0:192.168.1.32<0> [1]enp37s0f1:192.168.1.47<0>
myles-System-Product-Name:11860:11892 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
myles-System-Product-Name:11860:11892 [1] NCCL INFO Using network Socket
myles-System-Product-Name:11860:11891 [0] NCCL INFO Using network Socket
myles-System-Product-Name:11860:11893 [2] NCCL INFO Using network Socket
myles-System-Product-Name:11860:11891 [0] NCCL INFO ncclCommInitAll comm 0x5d1f738e9820 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId 1000 commId 0x64d31cd9129b6291 - Init START
myles-System-Product-Name:11860:11893 [2] NCCL INFO ncclCommInitAll comm 0x5d1f7396aa00 rank 2 nranks 3 cudaDev 2 nvmlDev 2 busId 2b000 commId 0x64d31cd9129b6291 - Init START
myles-System-Product-Name:11860:11892 [1] NCCL INFO ncclCommInitAll comm 0x5d1f7392a0c0 rank 1 nranks 3 cudaDev 1 nvmlDev 1 busId 2000 commId 0x64d31cd9129b6291 - Init START
myles-System-Product-Name:11860:11892 [1] NCCL INFO Bootstrap timings total 0.000900 (create 0.000051, send 0.000168, recv 0.000419, ring 0.000062, delay 0.000000)
myles-System-Product-Name:11860:11893 [2] NCCL INFO Bootstrap timings total 0.001019 (create 0.000048, send 0.000149, recv 0.000293, ring 0.000105, delay 0.000000)
myles-System-Product-Name:11860:11891 [0] NCCL INFO Bootstrap timings total 0.001061 (create 0.000040, send 0.000136, recv 0.000531, ring 0.000144, delay 0.000000)
myles-System-Product-Name:11860:11893 [2] NCCL INFO NCCL_P2P_LEVEL set by environment to PHB
myles-System-Product-Name:11860:11893 [2] NCCL INFO NVLS multicast support is not available on dev 2
myles-System-Product-Name:11860:11891 [0] NCCL INFO NVLS multicast support is not available on dev 0
myles-System-Product-Name:11860:11892 [1] NCCL INFO NVLS multicast support is not available on dev 1
myles-System-Product-Name:11860:11893 [2] NCCL INFO comm 0x5d1f7396aa00 rank 2 nRanks 3 nNodes 1 localRanks 3 localRank 2 MNNVL 0
myles-System-Product-Name:11860:11892 [1] NCCL INFO comm 0x5d1f7392a0c0 rank 1 nRanks 3 nNodes 1 localRanks 3 localRank 1 MNNVL 0
myles-System-Product-Name:11860:11893 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1 [2] -1/-1/-1->2->1 [3] -1/-1/-1->2->1
myles-System-Product-Name:11860:11893 [2] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:11860:11891 [0] NCCL INFO comm 0x5d1f738e9820 rank 0 nRanks 3 nNodes 1 localRanks 3 localRank 0 MNNVL 0
myles-System-Product-Name:11860:11891 [0] NCCL INFO Channel 00/04 : 0 1 2
myles-System-Product-Name:11860:11891 [0] NCCL INFO Channel 01/04 : 0 1 2
myles-System-Product-Name:11860:11891 [0] NCCL INFO Channel 02/04 : 0 1 2
myles-System-Product-Name:11860:11891 [0] NCCL INFO Channel 03/04 : 0 1 2
myles-System-Product-Name:11860:11892 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
myles-System-Product-Name:11860:11892 [1] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:11860:11891 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
myles-System-Product-Name:11860:11891 [0] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:11860:11898 [0] NCCL INFO [Proxy Service] Device 0 CPU core 41
myles-System-Product-Name:11860:11897 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 78
myles-System-Product-Name:11860:11899 [1] NCCL INFO [Proxy Service] Device 1 CPU core 40
myles-System-Product-Name:11860:11900 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 114
myles-System-Product-Name:11860:11896 [2] NCCL INFO [Proxy Service] Device 2 CPU core 4
myles-System-Product-Name:11860:11901 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 118
myles-System-Product-Name:11860:11893 [2] NCCL INFO Channel 00/0 : 2[2] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11860:11892 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11860:11891 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11860:11892 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11860:11893 [2] NCCL INFO Channel 01/0 : 2[2] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11860:11891 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11860:11893 [2] NCCL INFO Channel 02/0 : 2[2] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11860:11892 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11860:11891 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11860:11893 [2] NCCL INFO Channel 03/0 : 2[2] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11860:11892 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11860:11891 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11860:11893 [2] NCCL INFO Connected all rings
myles-System-Product-Name:11860:11893 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11860:11892 [1] NCCL INFO Connected all rings
myles-System-Product-Name:11860:11891 [0] NCCL INFO Connected all rings
myles-System-Product-Name:11860:11893 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11860:11893 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11860:11893 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11860:11892 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11860:11892 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11860:11892 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11860:11892 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11860:11891 [0] NCCL INFO Connected all trees
myles-System-Product-Name:11860:11893 [2] NCCL INFO Connected all trees
myles-System-Product-Name:11860:11892 [1] NCCL INFO Connected all trees
myles-System-Product-Name:11860:11902 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 40
myles-System-Product-Name:11860:11903 [2] NCCL INFO [Proxy Progress] Device 2 CPU core 5
myles-System-Product-Name:11860:11891 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
myles-System-Product-Name:11860:11891 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:11860:11904 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 57
myles-System-Product-Name:11860:11893 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
myles-System-Product-Name:11860:11893 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:11860:11892 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
myles-System-Product-Name:11860:11892 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:11860:11891 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
myles-System-Product-Name:11860:11892 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
myles-System-Product-Name:11860:11892 [1] NCCL INFO ncclCommInitAll comm 0x5d1f7392a0c0 rank 1 nranks 3 cudaDev 1 nvmlDev 1 busId 2000 commId 0x64d31cd9129b6291 - Init COMPLETE
myles-System-Product-Name:11860:11893 [2] NCCL INFO ncclCommInitAll comm 0x5d1f7396aa00 rank 2 nranks 3 cudaDev 2 nvmlDev 2 busId 2b000 commId 0x64d31cd9129b6291 - Init COMPLETE
myles-System-Product-Name:11860:11892 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 3 total 0.36 (kernels 0.25, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.03, rest 0.00)
myles-System-Product-Name:11860:11893 [2] NCCL INFO Init timings - ncclCommInitAll: rank 2 nranks 3 total 0.36 (kernels 0.25, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.03, rest 0.00)
myles-System-Product-Name:11860:11891 [0] NCCL INFO ncclCommInitAll comm 0x5d1f738e9820 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId 1000 commId 0x64d31cd9129b6291 - Init COMPLETE
myles-System-Product-Name:11860:11891 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 3 total 0.36 (kernels 0.25, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.03, rest 0.00)
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
33554432 8388608 float sum -1 1825.5 18.38 24.51 0 1820.7 18.43 24.57 0
myles-System-Product-Name:11860:11860 [0] NCCL INFO comm 0x5d1f738e9820 rank 0 nranks 3 cudaDev 0 busId 1000 - Destroy COMPLETE
myles-System-Product-Name:11860:11860 [2] NCCL INFO comm 0x5d1f7396aa00 rank 2 nranks 3 cudaDev 2 busId 2b000 - Destroy COMPLETE
myles-System-Product-Name:11860:11860 [1] NCCL INFO comm 0x5d1f7392a0c0 rank 1 nranks 3 cudaDev 1 busId 2000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth : 24.5402
#
myles@myles-System-Product-Name:~/nccl-tests/build$ NCCL_DEBUG=INFO NCCL_P2P_LEVEL=PHB ./all_reduce_perf -g 4
# nThread 1 nGpus 4 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 11905 on myles-System-Product-Name device 0 [0x01] NVIDIA GeForce RTX 4090
# Rank 1 Group 0 Pid 11905 on myles-System-Product-Name device 1 [0x02] NVIDIA GeForce RTX 4090
# Rank 2 Group 0 Pid 11905 on myles-System-Product-Name device 2 [0x2b] NVIDIA GeForce RTX 4090
# Rank 3 Group 0 Pid 11905 on myles-System-Product-Name device 3 [0x41] NVIDIA GeForce RTX 4090
myles-System-Product-Name:11905:11905 [0] NCCL INFO Bootstrap : Using enp37s0f0:192.168.1.32<0>
myles-System-Product-Name:11905:11905 [0] NCCL INFO cudaDriverVersion 12060
myles-System-Product-Name:11905:11905 [0] NCCL INFO NCCL version 2.23.4+cuda12.6
myles-System-Product-Name:11905:11941 [2] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
myles-System-Product-Name:11905:11941 [2] NCCL INFO Failed to open libibverbs.so[.1]
myles-System-Product-Name:11905:11941 [2] NCCL INFO NET/Socket : Using [0]enp37s0f0:192.168.1.32<0> [1]enp37s0f1:192.168.1.47<0>
myles-System-Product-Name:11905:11941 [2] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
myles-System-Product-Name:11905:11941 [2] NCCL INFO Using network Socket
myles-System-Product-Name:11905:11939 [0] NCCL INFO Using network Socket
myles-System-Product-Name:11905:11942 [3] NCCL INFO Using network Socket
myles-System-Product-Name:11905:11940 [1] NCCL INFO Using network Socket
myles-System-Product-Name:11905:11939 [0] NCCL INFO ncclCommInitAll comm 0x5feb380d49d0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x753b93d1e4296bfc - Init START
myles-System-Product-Name:11905:11940 [1] NCCL INFO ncclCommInitAll comm 0x5feb38115ef0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 2000 commId 0x753b93d1e4296bfc - Init START
myles-System-Product-Name:11905:11941 [2] NCCL INFO ncclCommInitAll comm 0x5feb381574b0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 2b000 commId 0x753b93d1e4296bfc - Init START
myles-System-Product-Name:11905:11942 [3] NCCL INFO ncclCommInitAll comm 0x5feb38198a70 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 41000 commId 0x753b93d1e4296bfc - Init START
myles-System-Product-Name:11905:11940 [1] NCCL INFO Bootstrap timings total 0.001192 (create 0.000050, send 0.000150, recv 0.000544, ring 0.000327, delay 0.000000)
myles-System-Product-Name:11905:11941 [2] NCCL INFO Bootstrap timings total 0.001156 (create 0.000049, send 0.000164, recv 0.000599, ring 0.000143, delay 0.000000)
myles-System-Product-Name:11905:11939 [0] NCCL INFO Bootstrap timings total 0.001247 (create 0.000056, send 0.000168, recv 0.000387, ring 0.000103, delay 0.000000)
myles-System-Product-Name:11905:11942 [3] NCCL INFO Bootstrap timings total 0.001135 (create 0.000055, send 0.000164, recv 0.000632, ring 0.000090, delay 0.000000)
myles-System-Product-Name:11905:11940 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to PHB
myles-System-Product-Name:11905:11940 [1] NCCL INFO NVLS multicast support is not available on dev 1
myles-System-Product-Name:11905:11942 [3] NCCL INFO NVLS multicast support is not available on dev 3
myles-System-Product-Name:11905:11941 [2] NCCL INFO NVLS multicast support is not available on dev 2
myles-System-Product-Name:11905:11939 [0] NCCL INFO NVLS multicast support is not available on dev 0
myles-System-Product-Name:11905:11940 [1] NCCL INFO comm 0x5feb38115ef0 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
myles-System-Product-Name:11905:11942 [3] NCCL INFO comm 0x5feb38198a70 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
myles-System-Product-Name:11905:11940 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
myles-System-Product-Name:11905:11940 [1] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:11905:11941 [2] NCCL INFO comm 0x5feb381574b0 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
myles-System-Product-Name:11905:11941 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1
myles-System-Product-Name:11905:11941 [2] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:11905:11942 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2
myles-System-Product-Name:11905:11942 [3] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:11905:11939 [0] NCCL INFO comm 0x5feb380d49d0 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
myles-System-Product-Name:11905:11939 [0] NCCL INFO Channel 00/04 : 0 1 2 3
myles-System-Product-Name:11905:11939 [0] NCCL INFO Channel 01/04 : 0 1 2 3
myles-System-Product-Name:11905:11939 [0] NCCL INFO Channel 02/04 : 0 1 2 3
myles-System-Product-Name:11905:11939 [0] NCCL INFO Channel 03/04 : 0 1 2 3
myles-System-Product-Name:11905:11939 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
myles-System-Product-Name:11905:11939 [0] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:11905:11945 [1] NCCL INFO [Proxy Service] Device 1 CPU core 118
myles-System-Product-Name:11905:11948 [0] NCCL INFO [Proxy Service] Device 0 CPU core 76
myles-System-Product-Name:11905:11950 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 27
myles-System-Product-Name:11905:11946 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 125
myles-System-Product-Name:11905:11947 [2] NCCL INFO [Proxy Service] Device 2 CPU core 10
myles-System-Product-Name:11905:11949 [3] NCCL INFO [Proxy Service] Device 3 CPU core 85
myles-System-Product-Name:11905:11951 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 32
myles-System-Product-Name:11905:11952 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 47
myles-System-Product-Name:11905:11940 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11905:11942 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11905:11941 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:11905:11940 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11905:11939 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11905:11942 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11905:11941 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:11905:11940 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11905:11939 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11905:11942 [3] NCCL INFO Channel 02/0 : 3[3] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11905:11941 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:11905:11939 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11905:11942 [3] NCCL INFO Channel 03/0 : 3[3] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11905:11940 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11905:11941 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:11905:11939 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11905:11940 [1] NCCL INFO Connected all rings
myles-System-Product-Name:11905:11939 [0] NCCL INFO Connected all rings
myles-System-Product-Name:11905:11942 [3] NCCL INFO Connected all rings
myles-System-Product-Name:11905:11941 [2] NCCL INFO Connected all rings
myles-System-Product-Name:11905:11942 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11905:11942 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11905:11942 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11905:11942 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11905:11940 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11905:11940 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11905:11940 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11905:11941 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11905:11940 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11905:11941 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11905:11941 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11905:11941 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11905:11939 [0] NCCL INFO Connected all trees
myles-System-Product-Name:11905:11940 [1] NCCL INFO Connected all trees
myles-System-Product-Name:11905:11942 [3] NCCL INFO Connected all trees
myles-System-Product-Name:11905:11941 [2] NCCL INFO Connected all trees
myles-System-Product-Name:11905:11953 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 53
myles-System-Product-Name:11905:11940 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
myles-System-Product-Name:11905:11940 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:11905:11954 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 75
myles-System-Product-Name:11905:11955 [3] NCCL INFO [Proxy Progress] Device 3 CPU core 76
myles-System-Product-Name:11905:11939 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
myles-System-Product-Name:11905:11939 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:11905:11956 [2] NCCL INFO [Proxy Progress] Device 2 CPU core 84
myles-System-Product-Name:11905:11939 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
myles-System-Product-Name:11905:11942 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
myles-System-Product-Name:11905:11942 [3] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:11905:11941 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
myles-System-Product-Name:11905:11941 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:11905:11940 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
myles-System-Product-Name:11905:11940 [1] NCCL INFO ncclCommInitAll comm 0x5feb38115ef0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 2000 commId 0x753b93d1e4296bfc - Init COMPLETE
myles-System-Product-Name:11905:11940 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 4 total 0.45 (kernels 0.32, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.04, rest 0.01)
myles-System-Product-Name:11905:11942 [3] NCCL INFO ncclCommInitAll comm 0x5feb38198a70 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 41000 commId 0x753b93d1e4296bfc - Init COMPLETE
myles-System-Product-Name:11905:11942 [3] NCCL INFO Init timings - ncclCommInitAll: rank 3 nranks 4 total 0.45 (kernels 0.31, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.05, rest 0.00)
myles-System-Product-Name:11905:11941 [2] NCCL INFO ncclCommInitAll comm 0x5feb381574b0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 2b000 commId 0x753b93d1e4296bfc - Init COMPLETE
myles-System-Product-Name:11905:11939 [0] NCCL INFO ncclCommInitAll comm 0x5feb380d49d0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x753b93d1e4296bfc - Init COMPLETE
myles-System-Product-Name:11905:11939 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 4 total 0.45 (kernels 0.31, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.05, rest 0.00)
myles-System-Product-Name:11905:11941 [2] NCCL INFO Init timings - ncclCommInitAll: rank 2 nranks 4 total 0.45 (kernels 0.31, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.05, rest 0.00)
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
33554432 8388608 float sum -1 2037.7 16.47 24.70 0 2063.9 16.26 24.39 0
myles-System-Product-Name:11905:11905 [0] NCCL INFO comm 0x5feb380d49d0 rank 0 nranks 4 cudaDev 0 busId 1000 - Destroy COMPLETE
myles-System-Product-Name:11905:11905 [3] NCCL INFO comm 0x5feb38198a70 rank 3 nranks 4 cudaDev 3 busId 41000 - Destroy COMPLETE
myles-System-Product-Name:11905:11905 [2] NCCL INFO comm 0x5feb381574b0 rank 2 nranks 4 cudaDev 2 busId 2b000 - Destroy COMPLETE
myles-System-Product-Name:11905:11905 [1] NCCL INFO comm 0x5feb38115ef0 rank 1 nranks 4 cudaDev 1 busId 2000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth : 24.543
#
myles@myles-System-Product-Name:~/nccl-tests/build$ NCCL_DEBUG=INFO NCCL_P2P_LEVEL=PHB ./all_reduce_perf -g 5
# nThread 1 nGpus 5 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 11960 on myles-System-Product-Name device 0 [0x01] NVIDIA GeForce RTX 4090
# Rank 1 Group 0 Pid 11960 on myles-System-Product-Name device 1 [0x02] NVIDIA GeForce RTX 4090
# Rank 2 Group 0 Pid 11960 on myles-System-Product-Name device 2 [0x2b] NVIDIA GeForce RTX 4090
# Rank 3 Group 0 Pid 11960 on myles-System-Product-Name device 3 [0x41] NVIDIA GeForce RTX 4090
# Rank 4 Group 0 Pid 11960 on myles-System-Product-Name device 4 [0x42] NVIDIA GeForce RTX 4090
myles-System-Product-Name:11960:11960 [0] NCCL INFO Bootstrap : Using enp37s0f0:192.168.1.32<0>
myles-System-Product-Name:11960:11960 [0] NCCL INFO cudaDriverVersion 12060
myles-System-Product-Name:11960:11960 [0] NCCL INFO NCCL version 2.23.4+cuda12.6
myles-System-Product-Name:11960:11995 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
myles-System-Product-Name:11960:11995 [0] NCCL INFO Failed to open libibverbs.so[.1]
myles-System-Product-Name:11960:11995 [0] NCCL INFO NET/Socket : Using [0]enp37s0f0:192.168.1.32<0> [1]enp37s0f1:192.168.1.47<0>
myles-System-Product-Name:11960:11995 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
myles-System-Product-Name:11960:11995 [0] NCCL INFO Using network Socket
myles-System-Product-Name:11960:11998 [3] NCCL INFO Using network Socket
myles-System-Product-Name:11960:11996 [1] NCCL INFO Using network Socket
myles-System-Product-Name:11960:11997 [2] NCCL INFO Using network Socket
myles-System-Product-Name:11960:11999 [4] NCCL INFO Using network Socket
myles-System-Product-Name:11960:11997 [2] NCCL INFO ncclCommInitAll comm 0x558ba9221be0 rank 2 nranks 5 cudaDev 2 nvmlDev 2 busId 2b000 commId 0x668cd3327447b9bd - Init START
myles-System-Product-Name:11960:11998 [3] NCCL INFO ncclCommInitAll comm 0x558ba9263e20 rank 3 nranks 5 cudaDev 3 nvmlDev 3 busId 41000 commId 0x668cd3327447b9bd - Init START
myles-System-Product-Name:11960:11995 [0] NCCL INFO ncclCommInitAll comm 0x558ba919d800 rank 0 nranks 5 cudaDev 0 nvmlDev 0 busId 1000 commId 0x668cd3327447b9bd - Init START
myles-System-Product-Name:11960:11996 [1] NCCL INFO ncclCommInitAll comm 0x558ba91df9a0 rank 1 nranks 5 cudaDev 1 nvmlDev 1 busId 2000 commId 0x668cd3327447b9bd - Init START
myles-System-Product-Name:11960:11999 [4] NCCL INFO ncclCommInitAll comm 0x558ba92a6060 rank 4 nranks 5 cudaDev 4 nvmlDev 4 busId 42000 commId 0x668cd3327447b9bd - Init START
myles-System-Product-Name:11960:11997 [2] NCCL INFO Bootstrap timings total 0.001402 (create 0.000057, send 0.000163, recv 0.000489, ring 0.000305, delay 0.000000)
myles-System-Product-Name:11960:11998 [3] NCCL INFO Bootstrap timings total 0.001350 (create 0.000054, send 0.000167, recv 0.000764, ring 0.000161, delay 0.000000)
myles-System-Product-Name:11960:11999 [4] NCCL INFO Bootstrap timings total 0.001252 (create 0.000056, send 0.000167, recv 0.000707, ring 0.000123, delay 0.000000)
myles-System-Product-Name:11960:11996 [1] NCCL INFO Bootstrap timings total 0.001282 (create 0.000054, send 0.000162, recv 0.000612, ring 0.000265, delay 0.000000)
myles-System-Product-Name:11960:11995 [0] NCCL INFO Bootstrap timings total 0.001315 (create 0.000047, send 0.000164, recv 0.000581, ring 0.000144, delay 0.000000)
myles-System-Product-Name:11960:11997 [2] NCCL INFO NCCL_P2P_LEVEL set by environment to PHB
myles-System-Product-Name:11960:11997 [2] NCCL INFO NVLS multicast support is not available on dev 2
myles-System-Product-Name:11960:11995 [0] NCCL INFO NVLS multicast support is not available on dev 0
myles-System-Product-Name:11960:11998 [3] NCCL INFO NVLS multicast support is not available on dev 3
myles-System-Product-Name:11960:11999 [4] NCCL INFO NVLS multicast support is not available on dev 4
myles-System-Product-Name:11960:11996 [1] NCCL INFO NVLS multicast support is not available on dev 1
myles-System-Product-Name:11960:11997 [2] NCCL INFO comm 0x558ba9221be0 rank 2 nRanks 5 nNodes 1 localRanks 5 localRank 2 MNNVL 0
myles-System-Product-Name:11960:11996 [1] NCCL INFO comm 0x558ba91df9a0 rank 1 nRanks 5 nNodes 1 localRanks 5 localRank 1 MNNVL 0
myles-System-Product-Name:11960:11999 [4] NCCL INFO comm 0x558ba92a6060 rank 4 nRanks 5 nNodes 1 localRanks 5 localRank 4 MNNVL 0
myles-System-Product-Name:11960:11996 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
myles-System-Product-Name:11960:11998 [3] NCCL INFO comm 0x558ba9263e20 rank 3 nRanks 5 nNodes 1 localRanks 5 localRank 3 MNNVL 0
myles-System-Product-Name:11960:11995 [0] NCCL INFO comm 0x558ba919d800 rank 0 nRanks 5 nNodes 1 localRanks 5 localRank 0 MNNVL 0
myles-System-Product-Name:11960:11998 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2
myles-System-Product-Name:11960:11998 [3] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:11960:11999 [4] NCCL INFO Trees [0] -1/-1/-1->4->3 [1] -1/-1/-1->4->3 [2] -1/-1/-1->4->3 [3] -1/-1/-1->4->3
myles-System-Product-Name:11960:11999 [4] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:11960:11996 [1] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:11960:11997 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1
myles-System-Product-Name:11960:11997 [2] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:11960:11995 [0] NCCL INFO Channel 00/04 : 0 1 2 3 4
myles-System-Product-Name:11960:11995 [0] NCCL INFO Channel 01/04 : 0 1 2 3 4
myles-System-Product-Name:11960:11995 [0] NCCL INFO Channel 02/04 : 0 1 2 3 4
myles-System-Product-Name:11960:11995 [0] NCCL INFO Channel 03/04 : 0 1 2 3 4
myles-System-Product-Name:11960:11995 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
myles-System-Product-Name:11960:11995 [0] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:11960:12002 [3] NCCL INFO [Proxy Service] Device 3 CPU core 4
myles-System-Product-Name:11960:12009 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 32
myles-System-Product-Name:11960:12010 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 39
myles-System-Product-Name:11960:12005 [1] NCCL INFO [Proxy Service] Device 1 CPU core 84
myles-System-Product-Name:11960:12006 [4] NCCL INFO [Proxy Service] Device 4 CPU core 85
myles-System-Product-Name:11960:12007 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 9
myles-System-Product-Name:11960:12004 [0] NCCL INFO [Proxy Service] Device 0 CPU core 12
myles-System-Product-Name:11960:12003 [2] NCCL INFO [Proxy Service] Device 2 CPU core 79
myles-System-Product-Name:11960:12011 [4] NCCL INFO [Proxy Service UDS] Device 4 CPU core 47
myles-System-Product-Name:11960:12008 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 24
myles-System-Product-Name:11960:11997 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:11960:11999 [4] NCCL INFO Channel 00/0 : 4[4] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11960:11995 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11960:11999 [4] NCCL INFO Channel 01/0 : 4[4] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11960:11998 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/direct pointer
myles-System-Product-Name:11960:11996 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11960:11997 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:11960:11995 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11960:11998 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/direct pointer
myles-System-Product-Name:11960:11999 [4] NCCL INFO Channel 02/0 : 4[4] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11960:11996 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11960:11995 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11960:11997 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:11960:11996 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11960:11998 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/direct pointer
myles-System-Product-Name:11960:11995 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11960:11999 [4] NCCL INFO Channel 03/0 : 4[4] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11960:11997 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:11960:11996 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11960:11998 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/direct pointer
myles-System-Product-Name:11960:11997 [2] NCCL INFO Connected all rings
myles-System-Product-Name:11960:11995 [0] NCCL INFO Connected all rings
myles-System-Product-Name:11960:11999 [4] NCCL INFO Connected all rings
myles-System-Product-Name:11960:11998 [3] NCCL INFO Connected all rings
myles-System-Product-Name:11960:11999 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:11960:11996 [1] NCCL INFO Connected all rings
myles-System-Product-Name:11960:11999 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:11960:11999 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:11960:11999 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:11960:11997 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11960:11998 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11960:11996 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11960:11997 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11960:11998 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11960:11996 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11960:11997 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11960:11998 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11960:11996 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11960:11997 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:11960:11998 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:11960:11996 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:11960:11999 [4] NCCL INFO Connected all trees
myles-System-Product-Name:11960:11998 [3] NCCL INFO Connected all trees
myles-System-Product-Name:11960:11995 [0] NCCL INFO Connected all trees
myles-System-Product-Name:11960:11997 [2] NCCL INFO Connected all trees
myles-System-Product-Name:11960:11996 [1] NCCL INFO Connected all trees
myles-System-Product-Name:11960:12012 [3] NCCL INFO [Proxy Progress] Device 3 CPU core 15
myles-System-Product-Name:11960:12013 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 84
myles-System-Product-Name:11960:12014 [2] NCCL INFO [Proxy Progress] Device 2 CPU core 3
myles-System-Product-Name:11960:12015 [4] NCCL INFO [Proxy Progress] Device 4 CPU core 28
myles-System-Product-Name:11960:12016 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 81
myles-System-Product-Name:11960:11998 [3] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 512 | 512
myles-System-Product-Name:11960:11998 [3] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:11960:11997 [2] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 512 | 512
myles-System-Product-Name:11960:11997 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:11960:11995 [0] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 512 | 512
myles-System-Product-Name:11960:11995 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:11960:11999 [4] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 512 | 512
myles-System-Product-Name:11960:11999 [4] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:11960:11996 [1] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 512 | 512
myles-System-Product-Name:11960:11996 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:11960:11995 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
myles-System-Product-Name:11960:11998 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
myles-System-Product-Name:11960:11998 [3] NCCL INFO ncclCommInitAll comm 0x558ba9263e20 rank 3 nranks 5 cudaDev 3 nvmlDev 3 busId 41000 commId 0x668cd3327447b9bd - Init COMPLETE
myles-System-Product-Name:11960:11998 [3] NCCL INFO Init timings - ncclCommInitAll: rank 3 nranks 5 total 0.53 (kernels 0.38, alloc 0.06, bootstrap 0.00, allgathers 0.01, topo 0.03, graphs 0.00, connections 0.06, rest 0.00)
myles-System-Product-Name:11960:11996 [1] NCCL INFO ncclCommInitAll comm 0x558ba91df9a0 rank 1 nranks 5 cudaDev 1 nvmlDev 1 busId 2000 commId 0x668cd3327447b9bd - Init COMPLETE
myles-System-Product-Name:11960:11997 [2] NCCL INFO ncclCommInitAll comm 0x558ba9221be0 rank 2 nranks 5 cudaDev 2 nvmlDev 2 busId 2b000 commId 0x668cd3327447b9bd - Init COMPLETE
myles-System-Product-Name:11960:11999 [4] NCCL INFO ncclCommInitAll comm 0x558ba92a6060 rank 4 nranks 5 cudaDev 4 nvmlDev 4 busId 42000 commId 0x668cd3327447b9bd - Init COMPLETE
myles-System-Product-Name:11960:11996 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 5 total 0.53 (kernels 0.38, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.03, graphs 0.00, connections 0.06, rest 0.00)
myles-System-Product-Name:11960:11999 [4] NCCL INFO Init timings - ncclCommInitAll: rank 4 nranks 5 total 0.53 (kernels 0.38, alloc 0.05, bootstrap 0.00, allgathers 0.00, topo 0.03, graphs 0.00, connections 0.06, rest 0.00)
myles-System-Product-Name:11960:11995 [0] NCCL INFO ncclCommInitAll comm 0x558ba919d800 rank 0 nranks 5 cudaDev 0 nvmlDev 0 busId 1000 commId 0x668cd3327447b9bd - Init COMPLETE
myles-System-Product-Name:11960:11995 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 5 total 0.53 (kernels 0.37, alloc 0.06, bootstrap 0.00, allgathers 0.01, topo 0.03, graphs 0.00, connections 0.06, rest 0.00)
myles-System-Product-Name:11960:11997 [2] NCCL INFO Init timings - ncclCommInitAll: rank 2 nranks 5 total 0.53 (kernels 0.38, alloc 0.06, bootstrap 0.00, allgathers 0.01, topo 0.03, graphs 0.00, connections 0.06, rest 0.00)
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
33554432 8388608 float sum -1 2174.8 15.43 24.69 0 2170.2 15.46 24.74 0
myles-System-Product-Name:11960:11960 [0] NCCL INFO comm 0x558ba919d800 rank 0 nranks 5 cudaDev 0 busId 1000 - Destroy COMPLETE
myles-System-Product-Name:11960:11960 [4] NCCL INFO comm 0x558ba92a6060 rank 4 nranks 5 cudaDev 4 busId 42000 - Destroy COMPLETE
myles-System-Product-Name:11960:11960 [3] NCCL INFO comm 0x558ba9263e20 rank 3 nranks 5 cudaDev 3 busId 41000 - Destroy COMPLETE
myles-System-Product-Name:11960:11960 [2] NCCL INFO comm 0x558ba9221be0 rank 2 nranks 5 cudaDev 2 busId 2b000 - Destroy COMPLETE
myles-System-Product-Name:11960:11960 [1] NCCL INFO comm 0x558ba91df9a0 rank 1 nranks 5 cudaDev 1 busId 2000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth : 24.7118
#
myles@myles-System-Product-Name:~/nccl-tests/build$ NCCL_DEBUG=INFO NCCL_P2P_LEVEL=PHB ./all_reduce_perf -g 6
# nThread 1 nGpus 6 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 12019 on myles-System-Product-Name device 0 [0x01] NVIDIA GeForce RTX 4090
# Rank 1 Group 0 Pid 12019 on myles-System-Product-Name device 1 [0x02] NVIDIA GeForce RTX 4090
# Rank 2 Group 0 Pid 12019 on myles-System-Product-Name device 2 [0x2b] NVIDIA GeForce RTX 4090
# Rank 3 Group 0 Pid 12019 on myles-System-Product-Name device 3 [0x41] NVIDIA GeForce RTX 4090
# Rank 4 Group 0 Pid 12019 on myles-System-Product-Name device 4 [0x42] NVIDIA GeForce RTX 4090
# Rank 5 Group 0 Pid 12019 on myles-System-Product-Name device 5 [0x61] NVIDIA GeForce RTX 4090
myles-System-Product-Name:12019:12019 [0] NCCL INFO Bootstrap : Using enp37s0f0:192.168.1.32<0>
myles-System-Product-Name:12019:12019 [0] NCCL INFO cudaDriverVersion 12060
myles-System-Product-Name:12019:12019 [0] NCCL INFO NCCL version 2.23.4+cuda12.6
myles-System-Product-Name:12019:12056 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
myles-System-Product-Name:12019:12056 [0] NCCL INFO Failed to open libibverbs.so[.1]
myles-System-Product-Name:12019:12056 [0] NCCL INFO NET/Socket : Using [0]enp37s0f0:192.168.1.32<0> [1]enp37s0f1:192.168.1.47<0>
myles-System-Product-Name:12019:12056 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
myles-System-Product-Name:12019:12056 [0] NCCL INFO Using network Socket
myles-System-Product-Name:12019:12057 [1] NCCL INFO Using network Socket
myles-System-Product-Name:12019:12061 [5] NCCL INFO Using network Socket
myles-System-Product-Name:12019:12060 [4] NCCL INFO Using network Socket
myles-System-Product-Name:12019:12059 [3] NCCL INFO Using network Socket
myles-System-Product-Name:12019:12058 [2] NCCL INFO Using network Socket
myles-System-Product-Name:12019:12060 [4] NCCL INFO ncclCommInitAll comm 0x63acc7870060 rank 4 nranks 6 cudaDev 4 nvmlDev 4 busId 42000 commId 0xbc91574eaba2750d - Init START
myles-System-Product-Name:12019:12058 [2] NCCL INFO ncclCommInitAll comm 0x63acc77ea2e0 rank 2 nranks 6 cudaDev 2 nvmlDev 2 busId 2b000 commId 0xbc91574eaba2750d - Init START
myles-System-Product-Name:12019:12057 [1] NCCL INFO ncclCommInitAll comm 0x63acc77a7420 rank 1 nranks 6 cudaDev 1 nvmlDev 1 busId 2000 commId 0xbc91574eaba2750d - Init START
myles-System-Product-Name:12019:12056 [0] NCCL INFO ncclCommInitAll comm 0x63acc7764600 rank 0 nranks 6 cudaDev 0 nvmlDev 0 busId 1000 commId 0xbc91574eaba2750d - Init START
myles-System-Product-Name:12019:12061 [5] NCCL INFO ncclCommInitAll comm 0x63acc78b2f20 rank 5 nranks 6 cudaDev 5 nvmlDev 5 busId 61000 commId 0xbc91574eaba2750d - Init START
myles-System-Product-Name:12019:12059 [3] NCCL INFO ncclCommInitAll comm 0x63acc782d1a0 rank 3 nranks 6 cudaDev 3 nvmlDev 3 busId 41000 commId 0xbc91574eaba2750d - Init START
myles-System-Product-Name:12019:12056 [0] NCCL INFO Bootstrap timings total 0.001487 (create 0.000049, send 0.000171, recv 0.000441, ring 0.000413, delay 0.000000)
myles-System-Product-Name:12019:12057 [1] NCCL INFO Bootstrap timings total 0.001549 (create 0.000063, send 0.000173, recv 0.000346, ring 0.000623, delay 0.000000)
myles-System-Product-Name:12019:12061 [5] NCCL INFO Bootstrap timings total 0.001454 (create 0.000057, send 0.000170, recv 0.000636, ring 0.000397, delay 0.000000)
myles-System-Product-Name:12019:12058 [2] NCCL INFO Bootstrap timings total 0.001582 (create 0.000060, send 0.000179, recv 0.000907, ring 0.000228, delay 0.000000)
myles-System-Product-Name:12019:12059 [3] NCCL INFO Bootstrap timings total 0.001435 (create 0.000058, send 0.000176, recv 0.000865, ring 0.000142, delay 0.000000)
myles-System-Product-Name:12019:12060 [4] NCCL INFO Bootstrap timings total 0.001640 (create 0.000060, send 0.000176, recv 0.000699, ring 0.000186, delay 0.000001)
myles-System-Product-Name:12019:12060 [4] NCCL INFO NCCL_P2P_LEVEL set by environment to PHB
myles-System-Product-Name:12019:12060 [4] NCCL INFO NVLS multicast support is not available on dev 4
myles-System-Product-Name:12019:12056 [0] NCCL INFO NVLS multicast support is not available on dev 0
myles-System-Product-Name:12019:12059 [3] NCCL INFO NVLS multicast support is not available on dev 3
myles-System-Product-Name:12019:12061 [5] NCCL INFO NVLS multicast support is not available on dev 5
myles-System-Product-Name:12019:12057 [1] NCCL INFO NVLS multicast support is not available on dev 1
myles-System-Product-Name:12019:12058 [2] NCCL INFO NVLS multicast support is not available on dev 2
myles-System-Product-Name:12019:12056 [0] NCCL INFO comm 0x63acc7764600 rank 0 nRanks 6 nNodes 1 localRanks 6 localRank 0 MNNVL 0
myles-System-Product-Name:12019:12060 [4] NCCL INFO comm 0x63acc7870060 rank 4 nRanks 6 nNodes 1 localRanks 6 localRank 4 MNNVL 0
myles-System-Product-Name:12019:12058 [2] NCCL INFO comm 0x63acc77ea2e0 rank 2 nRanks 6 nNodes 1 localRanks 6 localRank 2 MNNVL 0
myles-System-Product-Name:12019:12060 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3
myles-System-Product-Name:12019:12060 [4] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:12019:12056 [0] NCCL INFO Channel 00/04 : 0 1 2 3 4 5
myles-System-Product-Name:12019:12058 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1
myles-System-Product-Name:12019:12058 [2] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:12019:12061 [5] NCCL INFO comm 0x63acc78b2f20 rank 5 nRanks 6 nNodes 1 localRanks 6 localRank 5 MNNVL 0
myles-System-Product-Name:12019:12057 [1] NCCL INFO comm 0x63acc77a7420 rank 1 nRanks 6 nNodes 1 localRanks 6 localRank 1 MNNVL 0
myles-System-Product-Name:12019:12059 [3] NCCL INFO comm 0x63acc782d1a0 rank 3 nRanks 6 nNodes 1 localRanks 6 localRank 3 MNNVL 0
myles-System-Product-Name:12019:12056 [0] NCCL INFO Channel 01/04 : 0 1 2 3 4 5
myles-System-Product-Name:12019:12056 [0] NCCL INFO Channel 02/04 : 0 1 2 3 4 5
myles-System-Product-Name:12019:12059 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2
myles-System-Product-Name:12019:12061 [5] NCCL INFO Trees [0] -1/-1/-1->5->4 [1] -1/-1/-1->5->4 [2] -1/-1/-1->5->4 [3] -1/-1/-1->5->4
myles-System-Product-Name:12019:12061 [5] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:12019:12059 [3] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:12019:12057 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
myles-System-Product-Name:12019:12057 [1] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:12019:12056 [0] NCCL INFO Channel 03/04 : 0 1 2 3 4 5
myles-System-Product-Name:12019:12056 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
myles-System-Product-Name:12019:12056 [0] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:12019:12066 [4] NCCL INFO [Proxy Service UDS] Device 4 CPU core 30
myles-System-Product-Name:12019:12065 [2] NCCL INFO [Proxy Service] Device 2 CPU core 61
myles-System-Product-Name:12019:12064 [4] NCCL INFO [Proxy Service] Device 4 CPU core 25
myles-System-Product-Name:12019:12071 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 44
myles-System-Product-Name:12019:12074 [5] NCCL INFO [Proxy Service UDS] Device 5 CPU core 118
myles-System-Product-Name:12019:12069 [0] NCCL INFO [Proxy Service] Device 0 CPU core 115
myles-System-Product-Name:12019:12068 [5] NCCL INFO [Proxy Service] Device 5 CPU core 48
myles-System-Product-Name:12019:12072 [1] NCCL INFO [Proxy Service] Device 1 CPU core 19
myles-System-Product-Name:12019:12067 [3] NCCL INFO [Proxy Service] Device 3 CPU core 32
myles-System-Product-Name:12019:12073 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 9
myles-System-Product-Name:12019:12070 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 103
myles-System-Product-Name:12019:12075 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 4
myles-System-Product-Name:12019:12056 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:12019:12060 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/direct pointer
myles-System-Product-Name:12019:12056 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:12019:12060 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/direct pointer
myles-System-Product-Name:12019:12058 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:12019:12056 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:12019:12060 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/direct pointer
myles-System-Product-Name:12019:12059 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/direct pointer
myles-System-Product-Name:12019:12056 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:12019:12058 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:12019:12057 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:12019:12059 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/direct pointer
myles-System-Product-Name:12019:12061 [5] NCCL INFO Channel 00/0 : 5[5] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:12019:12060 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/direct pointer
myles-System-Product-Name:12019:12058 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:12019:12061 [5] NCCL INFO Channel 01/0 : 5[5] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:12019:12057 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:12019:12059 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/direct pointer
myles-System-Product-Name:12019:12058 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:12019:12059 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/direct pointer
myles-System-Product-Name:12019:12061 [5] NCCL INFO Channel 02/0 : 5[5] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:12019:12057 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:12019:12057 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:12019:12061 [5] NCCL INFO Channel 03/0 : 5[5] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:12019:12058 [2] NCCL INFO Connected all rings
myles-System-Product-Name:12019:12059 [3] NCCL INFO Connected all rings
myles-System-Product-Name:12019:12060 [4] NCCL INFO Connected all rings
myles-System-Product-Name:12019:12057 [1] NCCL INFO Connected all rings
myles-System-Product-Name:12019:12061 [5] NCCL INFO Connected all rings
myles-System-Product-Name:12019:12056 [0] NCCL INFO Connected all rings
myles-System-Product-Name:12019:12061 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/direct pointer
myles-System-Product-Name:12019:12061 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/direct pointer
myles-System-Product-Name:12019:12061 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/direct pointer
myles-System-Product-Name:12019:12058 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:12019:12061 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/direct pointer
myles-System-Product-Name:12019:12059 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:12019:12058 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:12019:12060 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:12019:12057 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:12019:12059 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:12019:12058 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:12019:12060 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:12019:12057 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:12019:12059 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:12019:12058 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:12019:12060 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:12019:12057 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:12019:12059 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:12019:12060 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:12019:12057 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:12019:12061 [5] NCCL INFO Connected all trees
myles-System-Product-Name:12019:12060 [4] NCCL INFO Connected all trees
myles-System-Product-Name:12019:12059 [3] NCCL INFO Connected all trees
myles-System-Product-Name:12019:12056 [0] NCCL INFO Connected all trees
myles-System-Product-Name:12019:12058 [2] NCCL INFO Connected all trees
myles-System-Product-Name:12019:12057 [1] NCCL INFO Connected all trees
myles-System-Product-Name:12019:12076 [5] NCCL INFO [Proxy Progress] Device 5 CPU core 25
myles-System-Product-Name:12019:12077 [4] NCCL INFO [Proxy Progress] Device 4 CPU core 32
myles-System-Product-Name:12019:12078 [2] NCCL INFO [Proxy Progress] Device 2 CPU core 115
myles-System-Product-Name:12019:12079 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 124
myles-System-Product-Name:12019:12080 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 42
myles-System-Product-Name:12019:12081 [3] NCCL INFO [Proxy Progress] Device 3 CPU core 5
myles-System-Product-Name:12019:12061 [5] NCCL INFO threadThresholds 8/8/64 | 48/8/64 | 512 | 512
myles-System-Product-Name:12019:12061 [5] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:12019:12060 [4] NCCL INFO threadThresholds 8/8/64 | 48/8/64 | 512 | 512
myles-System-Product-Name:12019:12060 [4] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:12019:12058 [2] NCCL INFO threadThresholds 8/8/64 | 48/8/64 | 512 | 512
myles-System-Product-Name:12019:12058 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:12019:12057 [1] NCCL INFO threadThresholds 8/8/64 | 48/8/64 | 512 | 512
myles-System-Product-Name:12019:12057 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:12019:12056 [0] NCCL INFO threadThresholds 8/8/64 | 48/8/64 | 512 | 512
myles-System-Product-Name:12019:12056 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:12019:12059 [3] NCCL INFO threadThresholds 8/8/64 | 48/8/64 | 512 | 512
myles-System-Product-Name:12019:12059 [3] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:12019:12056 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
myles-System-Product-Name:12019:12060 [4] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
myles-System-Product-Name:12019:12060 [4] NCCL INFO ncclCommInitAll comm 0x63acc7870060 rank 4 nranks 6 cudaDev 4 nvmlDev 4 busId 42000 commId 0xbc91574eaba2750d - Init COMPLETE
myles-System-Product-Name:12019:12060 [4] NCCL INFO Init timings - ncclCommInitAll: rank 4 nranks 6 total 0.62 (kernels 0.44, alloc 0.06, bootstrap 0.00, allgathers 0.01, topo 0.04, graphs 0.00, connections 0.07, rest 0.00)
myles-System-Product-Name:12019:12057 [1] NCCL INFO ncclCommInitAll comm 0x63acc77a7420 rank 1 nranks 6 cudaDev 1 nvmlDev 1 busId 2000 commId 0xbc91574eaba2750d - Init COMPLETE
myles-System-Product-Name:12019:12061 [5] NCCL INFO ncclCommInitAll comm 0x63acc78b2f20 rank 5 nranks 6 cudaDev 5 nvmlDev 5 busId 61000 commId 0xbc91574eaba2750d - Init COMPLETE
myles-System-Product-Name:12019:12056 [0] NCCL INFO ncclCommInitAll comm 0x63acc7764600 rank 0 nranks 6 cudaDev 0 nvmlDev 0 busId 1000 commId 0xbc91574eaba2750d - Init COMPLETE
myles-System-Product-Name:12019:12059 [3] NCCL INFO ncclCommInitAll comm 0x63acc782d1a0 rank 3 nranks 6 cudaDev 3 nvmlDev 3 busId 41000 commId 0xbc91574eaba2750d - Init COMPLETE
myles-System-Product-Name:12019:12056 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 6 total 0.62 (kernels 0.43, alloc 0.06, bootstrap 0.00, allgathers 0.01, topo 0.04, graphs 0.00, connections 0.07, rest 0.00)
myles-System-Product-Name:12019:12061 [5] NCCL INFO Init timings - ncclCommInitAll: rank 5 nranks 6 total 0.62 (kernels 0.44, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.05, graphs 0.01, connections 0.07, rest 0.00)
myles-System-Product-Name:12019:12058 [2] NCCL INFO ncclCommInitAll comm 0x63acc77ea2e0 rank 2 nranks 6 cudaDev 2 nvmlDev 2 busId 2b000 commId 0xbc91574eaba2750d - Init COMPLETE
myles-System-Product-Name:12019:12057 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 6 total 0.62 (kernels 0.44, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.05, graphs 0.01, connections 0.07, rest 0.00)
myles-System-Product-Name:12019:12058 [2] NCCL INFO Init timings - ncclCommInitAll: rank 2 nranks 6 total 0.62 (kernels 0.45, alloc 0.05, bootstrap 0.00, allgathers 0.00, topo 0.05, graphs 0.01, connections 0.07, rest 0.00)
myles-System-Product-Name:12019:12059 [3] NCCL INFO Init timings - ncclCommInitAll: rank 3 nranks 6 total 0.62 (kernels 0.44, alloc 0.06, bootstrap 0.00, allgathers 0.01, topo 0.04, graphs 0.01, connections 0.07, rest 0.00)
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
33554432 8388608 float sum -1 2274.8 14.75 24.58 0 2279.9 14.72 24.53 0
myles-System-Product-Name:12019:12019 [0] NCCL INFO comm 0x63acc7764600 rank 0 nranks 6 cudaDev 0 busId 1000 - Destroy COMPLETE
myles-System-Product-Name:12019:12019 [5] NCCL INFO comm 0x63acc78b2f20 rank 5 nranks 6 cudaDev 5 busId 61000 - Destroy COMPLETE
myles-System-Product-Name:12019:12019 [4] NCCL INFO comm 0x63acc7870060 rank 4 nranks 6 cudaDev 4 busId 42000 - Destroy COMPLETE
myles-System-Product-Name:12019:12019 [3] NCCL INFO comm 0x63acc782d1a0 rank 3 nranks 6 cudaDev 3 busId 41000 - Destroy COMPLETE
myles-System-Product-Name:12019:12019 [2] NCCL INFO comm 0x63acc77ea2e0 rank 2 nranks 6 cudaDev 2 busId 2b000 - Destroy COMPLETE
myles-System-Product-Name:12019:12019 [1] NCCL INFO comm 0x63acc77a7420 rank 1 nranks 6 cudaDev 1 busId 2000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth : 24.557
#
myles@myles-System-Product-Name:~/nccl-tests/build$ NCCL_DEBUG=INFO NCCL_P2P_LEVEL=PHB ./all_reduce_perf -g 7
# nThread 1 nGpus 7 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 12085 on myles-System-Product-Name device 0 [0x01] NVIDIA GeForce RTX 4090
# Rank 1 Group 0 Pid 12085 on myles-System-Product-Name device 1 [0x02] NVIDIA GeForce RTX 4090
# Rank 2 Group 0 Pid 12085 on myles-System-Product-Name device 2 [0x2b] NVIDIA GeForce RTX 4090
# Rank 3 Group 0 Pid 12085 on myles-System-Product-Name device 3 [0x41] NVIDIA GeForce RTX 4090
# Rank 4 Group 0 Pid 12085 on myles-System-Product-Name device 4 [0x42] NVIDIA GeForce RTX 4090
# Rank 5 Group 0 Pid 12085 on myles-System-Product-Name device 5 [0x61] NVIDIA GeForce RTX 4090
# Rank 6 Group 0 Pid 12085 on myles-System-Product-Name device 6 [0x62] NVIDIA GeForce RTX 4090
myles-System-Product-Name:12085:12085 [0] NCCL INFO Bootstrap : Using enp37s0f0:192.168.1.32<0>
myles-System-Product-Name:12085:12085 [0] NCCL INFO cudaDriverVersion 12060
myles-System-Product-Name:12085:12085 [0] NCCL INFO NCCL version 2.23.4+cuda12.6
myles-System-Product-Name:12085:12128 [4] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
myles-System-Product-Name:12085:12128 [4] NCCL INFO Failed to open libibverbs.so[.1]
myles-System-Product-Name:12085:12128 [4] NCCL INFO NET/Socket : Using [0]enp37s0f0:192.168.1.32<0> [1]enp37s0f1:192.168.1.47<0>
myles-System-Product-Name:12085:12128 [4] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
myles-System-Product-Name:12085:12128 [4] NCCL INFO Using network Socket
myles-System-Product-Name:12085:12129 [5] NCCL INFO Using network Socket
myles-System-Product-Name:12085:12124 [0] NCCL INFO Using network Socket
myles-System-Product-Name:12085:12126 [2] NCCL INFO Using network Socket
myles-System-Product-Name:12085:12127 [3] NCCL INFO Using network Socket
myles-System-Product-Name:12085:12125 [1] NCCL INFO Using network Socket
myles-System-Product-Name:12085:12130 [6] NCCL INFO Using network Socket
myles-System-Product-Name:12085:12129 [5] NCCL INFO ncclCommInitAll comm 0x6074a4688650 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 61000 commId 0x278945f61c095e7c - Init START
myles-System-Product-Name:12085:12124 [0] NCCL INFO ncclCommInitAll comm 0x6074a4535eb0 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 1000 commId 0x278945f61c095e7c - Init START
myles-System-Product-Name:12085:12128 [4] NCCL INFO ncclCommInitAll comm 0x6074a4644b10 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 42000 commId 0x278945f61c095e7c - Init START
myles-System-Product-Name:12085:12130 [6] NCCL INFO ncclCommInitAll comm 0x6074a46cc190 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId 62000 commId 0x278945f61c095e7c - Init START
myles-System-Product-Name:12085:12125 [1] NCCL INFO ncclCommInitAll comm 0x6074a4579950 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 2000 commId 0x278945f61c095e7c - Init START
myles-System-Product-Name:12085:12126 [2] NCCL INFO ncclCommInitAll comm 0x6074a45bd490 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 2b000 commId 0x278945f61c095e7c - Init START
myles-System-Product-Name:12085:12127 [3] NCCL INFO ncclCommInitAll comm 0x6074a4600fd0 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 41000 commId 0x278945f61c095e7c - Init START
myles-System-Product-Name:12085:12129 [5] NCCL INFO Bootstrap timings total 0.001611 (create 0.000054, send 0.000183, recv 0.000613, ring 0.000557, delay 0.000000)
myles-System-Product-Name:12085:12128 [4] NCCL INFO Bootstrap timings total 0.001527 (create 0.000050, send 0.000184, recv 0.000371, ring 0.000196, delay 0.000000)
myles-System-Product-Name:12085:12130 [6] NCCL INFO Bootstrap timings total 0.001496 (create 0.000057, send 0.000187, recv 0.000555, ring 0.000491, delay 0.000000)
myles-System-Product-Name:12085:12124 [0] NCCL INFO Bootstrap timings total 0.001573 (create 0.000057, send 0.000184, recv 0.000725, ring 0.000396, delay 0.000000)
myles-System-Product-Name:12085:12126 [2] NCCL INFO Bootstrap timings total 0.001412 (create 0.000060, send 0.000148, recv 0.000775, ring 0.000209, delay 0.000000)
myles-System-Product-Name:12085:12125 [1] NCCL INFO Bootstrap timings total 0.001473 (create 0.000054, send 0.000159, recv 0.000745, ring 0.000299, delay 0.000000)
myles-System-Product-Name:12085:12127 [3] NCCL INFO Bootstrap timings total 0.001377 (create 0.000060, send 0.000170, recv 0.000776, ring 0.000176, delay 0.000000)
myles-System-Product-Name:12085:12125 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to PHB
myles-System-Product-Name:12085:12125 [1] NCCL INFO NVLS multicast support is not available on dev 1
myles-System-Product-Name:12085:12130 [6] NCCL INFO NVLS multicast support is not available on dev 6
myles-System-Product-Name:12085:12128 [4] NCCL INFO NVLS multicast support is not available on dev 4
myles-System-Product-Name:12085:12129 [5] NCCL INFO NVLS multicast support is not available on dev 5
myles-System-Product-Name:12085:12124 [0] NCCL INFO NVLS multicast support is not available on dev 0
myles-System-Product-Name:12085:12127 [3] NCCL INFO NVLS multicast support is not available on dev 3
myles-System-Product-Name:12085:12126 [2] NCCL INFO NVLS multicast support is not available on dev 2
myles-System-Product-Name:12085:12125 [1] NCCL INFO comm 0x6074a4579950 rank 1 nRanks 7 nNodes 1 localRanks 7 localRank 1 MNNVL 0
myles-System-Product-Name:12085:12126 [2] NCCL INFO comm 0x6074a45bd490 rank 2 nRanks 7 nNodes 1 localRanks 7 localRank 2 MNNVL 0
myles-System-Product-Name:12085:12129 [5] NCCL INFO comm 0x6074a4688650 rank 5 nRanks 7 nNodes 1 localRanks 7 localRank 5 MNNVL 0
myles-System-Product-Name:12085:12125 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
myles-System-Product-Name:12085:12127 [3] NCCL INFO comm 0x6074a4600fd0 rank 3 nRanks 7 nNodes 1 localRanks 7 localRank 3 MNNVL 0
myles-System-Product-Name:12085:12126 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1
myles-System-Product-Name:12085:12130 [6] NCCL INFO comm 0x6074a46cc190 rank 6 nRanks 7 nNodes 1 localRanks 7 localRank 6 MNNVL 0
myles-System-Product-Name:12085:12127 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2
myles-System-Product-Name:12085:12130 [6] NCCL INFO Trees [0] -1/-1/-1->6->5 [1] -1/-1/-1->6->5 [2] -1/-1/-1->6->5 [3] -1/-1/-1->6->5
myles-System-Product-Name:12085:12129 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4
myles-System-Product-Name:12085:12128 [4] NCCL INFO comm 0x6074a4644b10 rank 4 nRanks 7 nNodes 1 localRanks 7 localRank 4 MNNVL 0
myles-System-Product-Name:12085:12124 [0] NCCL INFO comm 0x6074a4535eb0 rank 0 nRanks 7 nNodes 1 localRanks 7 localRank 0 MNNVL 0
myles-System-Product-Name:12085:12125 [1] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:12085:12127 [3] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:12085:12130 [6] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:12085:12129 [5] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:12085:12128 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3
myles-System-Product-Name:12085:12128 [4] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:12085:12126 [2] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:12085:12124 [0] NCCL INFO Channel 00/04 : 0 1 2 3 4 5 6
myles-System-Product-Name:12085:12124 [0] NCCL INFO Channel 01/04 : 0 1 2 3 4 5 6
myles-System-Product-Name:12085:12124 [0] NCCL INFO Channel 02/04 : 0 1 2 3 4 5 6
myles-System-Product-Name:12085:12124 [0] NCCL INFO Channel 03/04 : 0 1 2 3 4 5 6
myles-System-Product-Name:12085:12124 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
myles-System-Product-Name:12085:12124 [0] NCCL INFO P2P Chunksize set to 131072
myles-System-Product-Name:12085:12133 [1] NCCL INFO [Proxy Service] Device 1 CPU core 117
myles-System-Product-Name:12085:12136 [5] NCCL INFO [Proxy Service] Device 5 CPU core 85
myles-System-Product-Name:12085:12139 [3] NCCL INFO [Proxy Service] Device 3 CPU core 28
myles-System-Product-Name:12085:12137 [0] NCCL INFO [Proxy Service] Device 0 CPU core 91
myles-System-Product-Name:12085:12134 [6] NCCL INFO [Proxy Service] Device 6 CPU core 107
myles-System-Product-Name:12085:12138 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 100
myles-System-Product-Name:12085:12144 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 52
myles-System-Product-Name:12085:12135 [4] NCCL INFO [Proxy Service] Device 4 CPU core 20
myles-System-Product-Name:12085:12141 [5] NCCL INFO [Proxy Service UDS] Device 5 CPU core 81
myles-System-Product-Name:12085:12143 [2] NCCL INFO [Proxy Service] Device 2 CPU core 105
myles-System-Product-Name:12085:12140 [6] NCCL INFO [Proxy Service UDS] Device 6 CPU core 47
myles-System-Product-Name:12085:12142 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 67
myles-System-Product-Name:12085:12145 [4] NCCL INFO [Proxy Service UDS] Device 4 CPU core 91
myles-System-Product-Name:12085:12146 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 29
myles-System-Product-Name:12085:12126 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:12085:12129 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/direct pointer
myles-System-Product-Name:12085:12127 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/direct pointer
myles-System-Product-Name:12085:12130 [6] NCCL INFO Channel 00/0 : 6[6] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:12085:12126 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:12085:12125 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:12085:12127 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/direct pointer
myles-System-Product-Name:12085:12124 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:12085:12128 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/direct pointer
myles-System-Product-Name:12085:12125 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:12085:12130 [6] NCCL INFO Channel 01/0 : 6[6] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:12085:12129 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/direct pointer
myles-System-Product-Name:12085:12126 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:12085:12124 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:12085:12127 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/direct pointer
myles-System-Product-Name:12085:12128 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/direct pointer
myles-System-Product-Name:12085:12124 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:12085:12127 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/direct pointer
myles-System-Product-Name:12085:12126 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:12085:12129 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/direct pointer
myles-System-Product-Name:12085:12130 [6] NCCL INFO Channel 02/0 : 6[6] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:12085:12125 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:12085:12128 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/direct pointer
myles-System-Product-Name:12085:12130 [6] NCCL INFO Channel 03/0 : 6[6] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:12085:12125 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:12085:12129 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/direct pointer
myles-System-Product-Name:12085:12124 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:12085:12128 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/direct pointer
myles-System-Product-Name:12085:12130 [6] NCCL INFO Connected all rings
myles-System-Product-Name:12085:12130 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/direct pointer
myles-System-Product-Name:12085:12128 [4] NCCL INFO Connected all rings
myles-System-Product-Name:12085:12129 [5] NCCL INFO Connected all rings
myles-System-Product-Name:12085:12124 [0] NCCL INFO Connected all rings
myles-System-Product-Name:12085:12125 [1] NCCL INFO Connected all rings
myles-System-Product-Name:12085:12127 [3] NCCL INFO Connected all rings
myles-System-Product-Name:12085:12126 [2] NCCL INFO Connected all rings
myles-System-Product-Name:12085:12130 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/direct pointer
myles-System-Product-Name:12085:12130 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/direct pointer
myles-System-Product-Name:12085:12130 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/direct pointer
myles-System-Product-Name:12085:12128 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:12085:12129 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/direct pointer
myles-System-Product-Name:12085:12125 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:12085:12126 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:12085:12127 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:12085:12128 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:12085:12129 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/direct pointer
myles-System-Product-Name:12085:12126 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:12085:12125 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:12085:12127 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:12085:12128 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:12085:12129 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/direct pointer
myles-System-Product-Name:12085:12126 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:12085:12125 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:12085:12127 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:12085:12128 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/direct pointer
myles-System-Product-Name:12085:12129 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/direct pointer
myles-System-Product-Name:12085:12126 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-System-Product-Name:12085:12125 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-System-Product-Name:12085:12127 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/direct pointer
myles-System-Product-Name:12085:12129 [5] NCCL INFO Connected all trees
myles-System-Product-Name:12085:12125 [1] NCCL INFO Connected all trees
myles-System-Product-Name:12085:12124 [0] NCCL INFO Connected all trees
myles-System-Product-Name:12085:12126 [2] NCCL INFO Connected all trees
myles-System-Product-Name:12085:12128 [4] NCCL INFO Connected all trees
myles-System-Product-Name:12085:12130 [6] NCCL INFO Connected all trees
myles-System-Product-Name:12085:12127 [3] NCCL INFO Connected all trees
myles-System-Product-Name:12085:12148 [2] NCCL INFO [Proxy Progress] Device 2 CPU core 106
myles-System-Product-Name:12085:12153 [4] NCCL INFO [Proxy Progress] Device 4 CPU core 30
myles-System-Product-Name:12085:12147 [6] NCCL INFO [Proxy Progress] Device 6 CPU core 86
myles-System-Product-Name:12085:12150 [3] NCCL INFO [Proxy Progress] Device 3 CPU core 8
myles-System-Product-Name:12085:12149 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 90
myles-System-Product-Name:12085:12151 [5] NCCL INFO [Proxy Progress] Device 5 CPU core 119
myles-System-Product-Name:12085:12152 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 28
myles-System-Product-Name:12085:12127 [3] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
myles-System-Product-Name:12085:12127 [3] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:12085:12130 [6] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
myles-System-Product-Name:12085:12130 [6] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:12085:12129 [5] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
myles-System-Product-Name:12085:12129 [5] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:12085:12125 [1] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
myles-System-Product-Name:12085:12125 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:12085:12128 [4] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
myles-System-Product-Name:12085:12128 [4] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:12085:12126 [2] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
myles-System-Product-Name:12085:12126 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:12085:12124 [0] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512
myles-System-Product-Name:12085:12124 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
myles-System-Product-Name:12085:12124 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
myles-System-Product-Name:12085:12128 [4] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
myles-System-Product-Name:12085:12128 [4] NCCL INFO ncclCommInitAll comm 0x6074a4644b10 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 42000 commId 0x278945f61c095e7c - Init COMPLETE
myles-System-Product-Name:12085:12128 [4] NCCL INFO Init timings - ncclCommInitAll: rank 4 nranks 7 total 0.72 (kernels 0.50, alloc 0.06, bootstrap 0.00, allgathers 0.02, topo 0.05, graphs 0.01, connections 0.08, rest 0.00)
myles-System-Product-Name:12085:12130 [6] NCCL INFO ncclCommInitAll comm 0x6074a46cc190 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId 62000 commId 0x278945f61c095e7c - Init COMPLETE
myles-System-Product-Name:12085:12130 [6] NCCL INFO Init timings - ncclCommInitAll: rank 6 nranks 7 total 0.72 (kernels 0.52, alloc 0.05, bootstrap 0.00, allgathers 0.01, topo 0.05, graphs 0.01, connections 0.08, rest 0.00)
myles-System-Product-Name:12085:12125 [1] NCCL INFO ncclCommInitAll comm 0x6074a4579950 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 2000 commId 0x278945f61c095e7c - Init COMPLETE
myles-System-Product-Name:12085:12125 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 7 total 0.74 (kernels 0.52, alloc 0.05, bootstrap 0.00, allgathers 0.02, topo 0.05, graphs 0.01, connections 0.08, rest 0.02)
myles-System-Product-Name:12085:12127 [3] NCCL INFO ncclCommInitAll comm 0x6074a4600fd0 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 41000 commId 0x278945f61c095e7c - Init COMPLETE
myles-System-Product-Name:12085:12127 [3] NCCL INFO Init timings - ncclCommInitAll: rank 3 nranks 7 total 0.74 (kernels 0.52, alloc 0.05, bootstrap 0.00, allgathers 0.00, topo 0.06, graphs 0.01, connections 0.08, rest 0.02)
myles-System-Product-Name:12085:12129 [5] NCCL INFO ncclCommInitAll comm 0x6074a4688650 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 61000 commId 0x278945f61c095e7c - Init COMPLETE
myles-System-Product-Name:12085:12129 [5] NCCL INFO Init timings - ncclCommInitAll: rank 5 nranks 7 total 0.74 (kernels 0.51, alloc 0.06, bootstrap 0.00, allgathers 0.01, topo 0.05, graphs 0.01, connections 0.08, rest 0.02)
myles-System-Product-Name:12085:12124 [0] NCCL INFO ncclCommInitAll comm 0x6074a4535eb0 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 1000 commId 0x278945f61c095e7c - Init COMPLETE
myles-System-Product-Name:12085:12124 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 7 total 0.74 (kernels 0.51, alloc 0.06, bootstrap 0.00, allgathers 0.01, topo 0.05, graphs 0.01, connections 0.08, rest 0.02)
myles-System-Product-Name:12085:12126 [2] NCCL INFO ncclCommInitAll comm 0x6074a45bd490 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 2b000 commId 0x278945f61c095e7c - Init COMPLETE
myles-System-Product-Name:12085:12126 [2] NCCL INFO Init timings - ncclCommInitAll: rank 2 nranks 7 total 0.74 (kernels 0.51, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.06, graphs 0.01, connections 0.08, rest 0.02)
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
33554432 8388608 float sum -1 2343.2 14.32 24.55 0 2338.6 14.35 24.60 0
myles-System-Product-Name:12085:12085 [0] NCCL INFO comm 0x6074a4535eb0 rank 0 nranks 7 cudaDev 0 busId 1000 - Destroy COMPLETE
myles-System-Product-Name:12085:12085 [6] NCCL INFO comm 0x6074a46cc190 rank 6 nranks 7 cudaDev 6 busId 62000 - Destroy COMPLETE
myles-System-Product-Name:12085:12085 [5] NCCL INFO comm 0x6074a4688650 rank 5 nranks 7 cudaDev 5 busId 61000 - Destroy COMPLETE
myles-System-Product-Name:12085:12085 [4] NCCL INFO comm 0x6074a4644b10 rank 4 nranks 7 cudaDev 4 busId 42000 - Destroy COMPLETE
myles-System-Product-Name:12085:12085 [3] NCCL INFO comm 0x6074a4600fd0 rank 3 nranks 7 cudaDev 3 busId 41000 - Destroy COMPLETE
myles-System-Product-Name:12085:12085 [2] NCCL INFO comm 0x6074a45bd490 rank 2 nranks 7 cudaDev 2 busId 2b000 - Destroy COMPLETE
myles-System-Product-Name:12085:12085 [1] NCCL INFO comm 0x6074a4579950 rank 1 nranks 7 cudaDev 1 busId 2000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus ``` |
I have 8 PCIe 16x 4.0 GPU cards, but as long as P2P capability is enabled, the data transfer speed drastically decreases when using more than two cards. I tested with |
hello. i am not Chinese and don't have we chat. but its helpful to resolve issue here i think,also. also which motherboard dare you using. because i don't know of many motherboards that offer pcie 16x at 16x speeds with 8 cards. for example you can still have a pcie 16x slot that is running at 8x speed or 4x speed. and bifurcating the lanes. my theory is that when you run nccl it checks how many lanes each device has and if they don't have the same lane number it defaults to cpu. if you run this command you can see if nccl is telling your system to run on cpu or p2p NCCL_DEBUG=INFO ./all_reduce_perf -g 3 the relevant section is here : 1[1] -> 0[0] via P2P/direct pointer or here : 1[1] -> 2[2] via SHM/direct/direct In order to make it work for your system. if the SHM is showing you will have to modify the nccl source code which is available on github to facilitate to always use p2p instead of cpu. from my test above you can see there is no issue running p2p with many gpus rtx 4090 with tinygrad driver providing that the cards are all on the same bandwidth. a 16x card at pcie 4x speeds is not working or 8x speeds. if all devices on 16x speeds it works. I think you maybe being confused by the bandwidth of your cards. obviously it is a 16x card but the mother board must be seriously great to have 128 pcie lanes available to the cpu excluding other devices. even if the transfer is happening via p2p there must be enough balanced lanes to each gpu. if you look here https://tinygrad.org/#tinybox you can see he is selling a computer with 8 gpus in it. so why the need for such a powerful motherboard he is using a motherboard with 2x AMD GENOA why? this is why 128 lanes of PCIe 4.0 support per cpu with a single amd 3995x threadripper. can only handle 8 cards if nothing else is using the other lanes. you are asking nccl to use 128 lanes 16 per gpu to do pcie transfer. this will fail and fall back to cpu if you don't have a cpu with enough lanes. some maybe used for other things also like nvme drives. each using 4. So i believe your issue is your ether trying to mix and match 8 devices some 16x cards but the cpu is allocating less lanes. or you have not got enough lanes to accommodate that many pcie lanes at one time. i have proven the driver on cuda 12.6, 560 NVIDIA works with 7 gpus at full p2p providing the gpus are all on 16x lanes with supported cpu. even though the cpu ram is not used for the transfer the cpu still mediates the process. and i believe each lane must still be active. will it work if we take 8 gpu pcie 16 4.0 and bifurcate to put them all onto 8x lanes. maybe we have to try this. moral of the story this is a hardware issue. or nccl issue not a driver issue |
also i notice your bus bandwidth here # Out of bounds values : 0 OK
# Avg bus bandwidth : 4.85887
# running only on 2 cards proves your not using 16x lanes this should be 24gb per second at least with 16x lanes |
p2p.txt |
I’m using an Intel SPR motherboard with all 8 GPUs directly connected to the CPU, using PCIe 4.0 x16. Sapphire Rapids has more than 8 PCIe 4.0 x16 lanes. |
@ZP-AlwaysWin looks good now.. what did ya change? See bandwidth gone up to 10 above 3 where it was.and shows is using p2p |
Is that this one 1 PCIe 5.0 x8,2 PCIe 5.0 x16,4 PCIe 5.0 x8 MCIO |
I haven’t changed anything, but my test results still didn’t meet expectations, so I’ve decided to temporarily abandon this P2P capability. |
NVIDIA Open GPU Kernel Modules Version
NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Description: Ubuntu 22.04.1 LTS
Kernel Release
5.15.0-60-generic #66-Ubuntu SMP Fri Jan 20 14:29:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
NVIDIA GeForce RTX 4090
Describe the bug
Enabling P2P capability on 8 RTX 4090 GPUs results in significantly lower performance in NCCL alltoall_perf tests compared to when P2P capability is disabled.
To Reproduce
Enabling P2P capability on two RTX 4090 GPUs significantly improves performance in the NCCL
alltoall_perf
tests compared to when P2P is disabled. However, when testing with eight GPUs, the performance gap between enabling and disabling P2P is much larger, with a severe performance drop when P2P is enabled. The relevant test data is as follows:simpleP2P
test passes.The
alltoall_perf
test data for two GPUs with P2P disabled:The
alltoall_perf
test data for two GPUs with P2P enabled:From the two-GPU test, it's evident that enabling P2P results in a significant performance boost.
The
alltoall_perf
test data for eight GPUs with P2P disabled:The
alltoall_perf
test data for eight GPUs with P2P enabled:From the eight-GPU test, it's clear that enabling P2P causes a severe performance drop.
Does anyone have experience in addressing this performance degradation when enabling P2P for eight GPUs?
Bug Incidence
Always
nvidia-bug-report.log.gz
~
More Info
If more information is needed, I can provide it at any time.
The text was updated successfully, but these errors were encountered: