-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing #120
Comments
That diagnostic is harmless. It is related to a mitigation for issue N.4272659, see https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-129-03/index.html. The full fix for issue N.4272659 is present in R560TRD1 and newer. With those drivers, that diagnostic should never appear. |
Hi @drossetti We've had NVIDIA driver crashes, and we saw a lot of these kernel logs before the crash. And the second point is the quality of the results computed by the GPUs, with so many errors (sometimes as many as 20k messages in the logs) we also wondered if this altered the quality of the generated models. Anyway, we'll try an update again I'll keep you posted thanks again |
as the latest driver fixes the issue ref. Mellanox/nv_peer_memory#120 Signed-off-by: Gyuho Lee <[email protected]>
as the latest driver fixes the issue ref. Mellanox/nv_peer_memory#120 Signed-off-by: Gyuho Lee <[email protected]>
as the latest driver fixes the issue ref. Mellanox/nv_peer_memory#120 Signed-off-by: Gyuho Lee <[email protected]>
as the latest driver fixes the issue ref. Mellanox/nv_peer_memory#120 --------- Signed-off-by: Gyuho Lee <[email protected]>
Hello,
On my HGX GPU cluster I have the following error which occurs when the trainings start to run.
This causes problems with the reliability of the AI models.
Have you ever had this error? Do you have any ideas?
Thanks for help.
Best
cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.2 LTS" ofed_info -s MLNX_OFED_LINUX-5.8-5.1.1.2: uname -r 5.19.0-45-generic
The text was updated successfully, but these errors were encountered: