Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing #120

Open
hmeScaler opened this issue Oct 10, 2024 · 2 comments

Comments

@hmeScaler
Copy link

hmeScaler commented Oct 10, 2024

Hello,

On my HGX GPU cluster I have the following error which occurs when the trainings start to run.
This causes problems with the reliability of the AI models.

Have you ever had this error? Do you have any ideas?

Thanks for help.

Best

[25033.266922] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.280589] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.300821] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.320988] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.342081] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.360507] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.380740] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.400553] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.420777] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.440911] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.461063] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.481198] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.501350] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
modinfo nvidia_peermem
filename:       /lib/modules/5.19.0-45-generic/kernel/drivers/video/nvidia-peermem.ko
version:        550.90.07
license:        Linux-OpenIB
description:    NVIDIA GPU memory plug-in
author:         Yishai Hadas
srcversion:     4F8B460B3801C5451579324
depends:        nvidia,ib_core
retpoline:      Y
name:           nvidia_peermem
vermagic:       5.19.0-45-generic SMP preempt mod_unload modversions 
parm:           peerdirect_support:Set level of support for Peer-direct, 0 [default] or 1 [legacy, for example MLNX_OFED 4.9 LTS] (int)
nvidia-smi 
Thu Oct 10 11:39:22 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:04:00.0 Off |                    0 |
| N/A   26C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:23:00.0 Off |                    0 |
| N/A   22C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:43:00.0 Off |                    0 |
| N/A   24C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:64:00.0 Off |                    0 |
| N/A   23C    P0             67W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  |   00000000:84:00.0 Off |                    0 |
| N/A   23C    P0             68W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  |   00000000:A3:00.0 Off |                    0 |
| N/A   23C    P0             68W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  |   00000000:C3:00.0 Off |                    0 |
| N/A   24C    P0             68W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  |   00000000:E4:00.0 Off |                    0 |
| N/A   23C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.2 LTS"

ofed_info -s
MLNX_OFED_LINUX-5.8-5.1.1.2:

uname -r
5.19.0-45-generic
@drossetti
Copy link

drossetti commented Nov 15, 2024

[25033.266922] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing

That diagnostic is harmless.

It is related to a mitigation for issue N.4272659, see https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-129-03/index.html.

The full fix for issue N.4272659 is present in R560TRD1 and newer. With those drivers, that diagnostic should never appear.

@hmeScaler
Copy link
Author

hmeScaler commented Nov 15, 2024

Hi @drossetti

We've had NVIDIA driver crashes, and we saw a lot of these kernel logs before the crash.

And the second point is the quality of the results computed by the GPUs, with so many errors (sometimes as many as 20k messages in the logs) we also wondered if this altered the quality of the generated models.

Anyway, we'll try an update again

I'll keep you posted

thanks again

gyuho added a commit to leptonai/gpud that referenced this issue Jan 6, 2025
gyuho added a commit to leptonai/gpud that referenced this issue Jan 6, 2025
as the latest driver fixes the issue

ref. Mellanox/nv_peer_memory#120

Signed-off-by: Gyuho Lee <[email protected]>
gyuho added a commit to leptonai/gpud that referenced this issue Jan 7, 2025
as the latest driver fixes the issue

ref. Mellanox/nv_peer_memory#120

Signed-off-by: Gyuho Lee <[email protected]>
gyuho added a commit to leptonai/gpud that referenced this issue Jan 8, 2025
as the latest driver fixes the issue

ref. Mellanox/nv_peer_memory#120

---------

Signed-off-by: Gyuho Lee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants