-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
It is not possible to determine whether this bandwidth value is normal. #293
Comments
Why are you setting:
What is the performance you get if you unset those two variables? |
NCCL_SHM_DISABLE=1 <- Setting this variable doesn't make much difference compared to not setting it. Someone recommended it to me. NCCL_NET_GDR_LEVEL=SYS <- If you don't set this variable, the bandwidth will only be a maximum of 1.1GB/s.
|
Oh, I missed you were running with 2 GPUs on different nodes. Then indeed Regarding the 1.1GB/s performance without GDR, that is usually due to your CPU being configured with one Numa domain Per Socket (NPS=1) in the BIOS. You should set NPS to 4, hopefully performance will be much better, with default settings. |
Thank you. When I set NUMA Nodes Per Socket to 4 in Dell Bios Processor Setting as instructed and ran TEST, I got up to 9GB/s.
The topology is still SYS. The performance has improved, but it doesn't seem to be a good figure. Shouldn't it be at least 20GB/s in a PCI Gen4 environment? |
Hello.
Let me inform you of the equipment information first.
Server: R7525
CPU:AMD EPYC 7352 24-Core Processor 2EA
MEM:256GB
GPU:A100 40GB PCIe 1EA
Compute Network : ConnectX6 HDR Infiniband adapter 1EA
PCI: Gen4
OS:ubuntu22.04
kernel:5.15.0-134-generic
nvidia-driver:550.54.14
CUDA tool kit:12.4
OFED:MLNX_OFED_LINUX-24.10-1.1.4.0
NCCL:2.21.5-1+cuda12.4
HPC-X:hpcx-v2.18-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64
topo
We are conducting an nccl-test with the two equipment above, but the bandwidth is not coming out as well as I thought.
Is there a way for bandwidth to come out better?
The text was updated successfully, but these errors were encountered: