-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Known issues with NCCL. #11154
Comments
This should not affect conda build. |
How do things look with NCCL 2.25.1-1? |
@jakirkham Just tried 2.25.1-1 (as part of #11202). I get the same error. I had to set the env var |
Thanks Hyunsu! 🙏 This is with conda, pip, or both? |
@jakirkham The issue only arises if NCCL was installed from pip. The issue does not arise if:
So this issue won't arise for the Conda package of XGBoost. |
Need to remove CI workarounds once the new nccl is released. |
Updated the issue to reflect the current status of NCCL. |
2.24.x and 2.25.x might crash XGBoost due to the RAS module.
Workaround:
export NCCL_RAS_ENABLE=0
xgboost/ops/pipeline/test-python-wheel-impl.sh
Line 48 in 3a2a85d
xgboost/ops/docker_run.py
Line 73 in 461d27c
Linux GPU driver 560 might hang p2p communication in nccl, or make it extremely slow.
Workaround:
Update to 570(not solved) Disable P2P for nowNCCL_P2P_DISABLE=1
.cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest
to check your p2p channel is working correctly.With NCCL >= 2.24, you might run into an unhandled CUDA error reported by NCCL if your system doesn't support P2P communication.
Workaround:
Check
CUDA_VISIBLE_DEVICES
is correctly set. You should not split GPUs into different partitions:CUDA_VISIBLE_DEVICES=1,0
to use the GPU with ordinal1
as default (instead of0
)CUDA_VISBLE_DEVICES=1
The text was updated successfully, but these errors were encountered: