Known issues with NCCL. #11154

trivialfis · 2025-01-09T16:28:43Z

2.24.x and 2.25.x might crash XGBoost due to the RAS module.
Workaround: export NCCL_RAS_ENABLE=0
- xgboost/ops/pipeline/test-python-wheel-impl.sh
  
  Line 48 in 3a2a85d
  
  export NCCL_RAS_ENABLE=0
- xgboost/ops/docker_run.py
  
  Line 73 in 461d27c
  
  docker_run_cli_args.extend(["-e", "NCCL_RAS_ENABLE=0"])
Linux GPU driver 560 might hang p2p communication in nccl, or make it extremely slow.
Workaround:
- ~~Update to 570~~ (not solved) Disable P2P for now NCCL_P2P_DISABLE=1.
- Use cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest to check your p2p channel is working correctly.
With NCCL >= 2.24, you might run into an unhandled CUDA error reported by NCCL if your system doesn't support P2P communication.
Workaround:
Check CUDA_VISIBLE_DEVICES is correctly set. You should not split GPUs into different partitions:
- good: CUDA_VISIBLE_DEVICES=1,0 to use the GPU with ordinal 1 as default (instead of 0)
- bad: CUDA_VISBLE_DEVICES=1

The text was updated successfully, but these errors were encountered:

trivialfis · 2025-01-25T06:54:11Z

This should not affect conda build.

jakirkham · 2025-01-30T19:57:48Z

How do things look with NCCL 2.25.1-1?

hcho3 · 2025-02-04T19:50:09Z

@jakirkham Just tried 2.25.1-1 (as part of #11202). I get the same error. I had to set the env var NCCL_RAS_ENABLE=0.

jakirkham · 2025-02-04T20:12:21Z

Thanks Hyunsu! 🙏

This is with conda, pip, or both?

hcho3 · 2025-02-04T20:53:08Z

@jakirkham The issue only arises if NCCL was installed from pip. The issue does not arise if:

NCCL is installed from Conda
XGBoost was built with CMake flags: -DUSE_DLOPEN_NCCL=OFF (don't use dlopen for NCCL)

So this issue won't arise for the Conda package of XGBoost.

trivialfis · 2025-02-12T11:43:43Z

Need to remove CI workarounds once the new nccl is released.

trivialfis · 2025-02-20T07:36:25Z

Updated the issue to reflect the current status of NCCL.

This was referenced Jan 9, 2025

[dask] Fix LTR with empty partition and NCCL error. #11152

Merged

More sklearn tag support. #11162

Merged

hcho3 added a commit to hcho3/xgboost that referenced this issue Feb 4, 2025

[CI] Work around dmlc#11154

c7e2d7f

hcho3 added a commit to hcho3/xgboost that referenced this issue Feb 5, 2025

[CI] Work around dmlc#11154

93d54ca

trivialfis changed the title ~~Latest NCCL 2.24.3 might crash XGBoost.~~ Known issue with NCCL. Feb 20, 2025

trivialfis changed the title ~~Known issue with NCCL.~~ Known issues with NCCL. Feb 20, 2025

Provide feedback