Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Known issues with NCCL. #11154

Open
trivialfis opened this issue Jan 9, 2025 · 7 comments
Open

Known issues with NCCL. #11154

trivialfis opened this issue Jan 9, 2025 · 7 comments

Comments

@trivialfis
Copy link
Member

trivialfis commented Jan 9, 2025

  • 2.24.x and 2.25.x might crash XGBoost due to the RAS module.
    Workaround: export NCCL_RAS_ENABLE=0

  • Linux GPU driver 560 might hang p2p communication in nccl, or make it extremely slow.
    Workaround:

    • Update to 570 (not solved) Disable P2P for now NCCL_P2P_DISABLE=1.
    • Use cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest to check your p2p channel is working correctly.
  • With NCCL >= 2.24, you might run into an unhandled CUDA error reported by NCCL if your system doesn't support P2P communication.
    Workaround:
    Check CUDA_VISIBLE_DEVICES is correctly set. You should not split GPUs into different partitions:

    • good: CUDA_VISIBLE_DEVICES=1,0 to use the GPU with ordinal 1 as default (instead of 0)
    • bad: CUDA_VISBLE_DEVICES=1
@trivialfis
Copy link
Member Author

This should not affect conda build.

@jakirkham
Copy link
Contributor

How do things look with NCCL 2.25.1-1?

@hcho3
Copy link
Collaborator

hcho3 commented Feb 4, 2025

@jakirkham Just tried 2.25.1-1 (as part of #11202). I get the same error. I had to set the env var NCCL_RAS_ENABLE=0.

@jakirkham
Copy link
Contributor

Thanks Hyunsu! 🙏

This is with conda, pip, or both?

hcho3 added a commit to hcho3/xgboost that referenced this issue Feb 4, 2025
@hcho3
Copy link
Collaborator

hcho3 commented Feb 4, 2025

@jakirkham The issue only arises if NCCL was installed from pip. The issue does not arise if:

  1. NCCL is installed from Conda
  2. XGBoost was built with CMake flags: -DUSE_DLOPEN_NCCL=OFF (don't use dlopen for NCCL)

So this issue won't arise for the Conda package of XGBoost.

hcho3 added a commit to hcho3/xgboost that referenced this issue Feb 5, 2025
@trivialfis
Copy link
Member Author

Need to remove CI workarounds once the new nccl is released.

@trivialfis trivialfis changed the title Latest NCCL 2.24.3 might crash XGBoost. Known issue with NCCL. Feb 20, 2025
@trivialfis trivialfis changed the title Known issue with NCCL. Known issues with NCCL. Feb 20, 2025
@trivialfis
Copy link
Member Author

Updated the issue to reflect the current status of NCCL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants