nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing #120

hmeScaler · 2024-10-10T16:41:22Z

Hello,

On my HGX GPU cluster I have the following error which occurs when the trainings start to run.
This causes problems with the reliability of the AI models.

Have you ever had this error? Do you have any ideas?

Thanks for help.

Best

[25033.266922] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.280589] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.300821] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.320988] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.342081] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.360507] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.380740] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.400553] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.420777] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.440911] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.461063] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.481198] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.501350] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing

modinfo nvidia_peermem
filename:       /lib/modules/5.19.0-45-generic/kernel/drivers/video/nvidia-peermem.ko
version:        550.90.07
license:        Linux-OpenIB
description:    NVIDIA GPU memory plug-in
author:         Yishai Hadas
srcversion:     4F8B460B3801C5451579324
depends:        nvidia,ib_core
retpoline:      Y
name:           nvidia_peermem
vermagic:       5.19.0-45-generic SMP preempt mod_unload modversions 
parm:           peerdirect_support:Set level of support for Peer-direct, 0 [default] or 1 [legacy, for example MLNX_OFED 4.9 LTS] (int)

nvidia-smi 
Thu Oct 10 11:39:22 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:04:00.0 Off |                    0 |
| N/A   26C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:23:00.0 Off |                    0 |
| N/A   22C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:43:00.0 Off |                    0 |
| N/A   24C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:64:00.0 Off |                    0 |
| N/A   23C    P0             67W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  |   00000000:84:00.0 Off |                    0 |
| N/A   23C    P0             68W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  |   00000000:A3:00.0 Off |                    0 |
| N/A   23C    P0             68W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  |   00000000:C3:00.0 Off |                    0 |
| N/A   24C    P0             68W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  |   00000000:E4:00.0 Off |                    0 |
| N/A   23C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.2 LTS"

ofed_info -s
MLNX_OFED_LINUX-5.8-5.1.1.2:

uname -r
5.19.0-45-generic

drossetti · 2024-11-15T01:57:09Z

[25033.266922] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing

That diagnostic is harmless.

It is related to a mitigation for issue N.4272659, see https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-129-03/index.html.

The full fix for issue N.4272659 is present in R560TRD1 and newer. With those drivers, that diagnostic should never appear.

hmeScaler · 2024-11-15T11:13:17Z

Hi @drossetti

We've had NVIDIA driver crashes, and we saw a lot of these kernel logs before the crash.

And the second point is the quality of the results computed by the GPUs, with so many errors (sometimes as many as 20k messages in the logs) we also wondered if this altered the quality of the generated models.

Anyway, we'll try an update again

I'll keep you posted

thanks again

as the latest driver fixes the issue ref. Mellanox/nv_peer_memory#120 Signed-off-by: Gyuho Lee <[email protected]>

as the latest driver fixes the issue ref. Mellanox/nv_peer_memory#120 --------- Signed-off-by: Gyuho Lee <[email protected]>

drossetti mentioned this issue Nov 15, 2024

nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing NVIDIA/gdrcopy#306

Closed

gyuho added a commit to leptonai/gpud that referenced this issue Jan 6, 2025

feat(dmesg): skip "detected invalid context" from peermem events

521e9b5

as the latest driver fixes the issue ref. Mellanox/nv_peer_memory#120 Signed-off-by: Gyuho Lee <[email protected]>

gyuho mentioned this issue Jan 6, 2025

feat(nvidia/peermem): explicitly skip "invalid context" errors leptonai/gpud#288

Merged

gyuho added a commit to leptonai/gpud that referenced this issue Jan 6, 2025

feat(nvidia/peermem): explicitly skip "invalid context" errors

814b42a

as the latest driver fixes the issue ref. Mellanox/nv_peer_memory#120 Signed-off-by: Gyuho Lee <[email protected]>

gyuho added a commit to leptonai/gpud that referenced this issue Jan 7, 2025

feat(nvidia/peermem): explicitly skip "invalid context" errors

2524eee

as the latest driver fixes the issue ref. Mellanox/nv_peer_memory#120 Signed-off-by: Gyuho Lee <[email protected]>

gyuho added a commit to leptonai/gpud that referenced this issue Jan 8, 2025

feat(nvidia/peermem): explicitly skip "invalid context" errors (#288)

49a1916

as the latest driver fixes the issue ref. Mellanox/nv_peer_memory#120 --------- Signed-off-by: Gyuho Lee <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing #120

nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing #120

hmeScaler commented Oct 10, 2024 •

edited

Loading

drossetti commented Nov 15, 2024 •

edited

Loading

hmeScaler commented Nov 15, 2024 •

edited

Loading

nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing #120

nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing #120

Comments

hmeScaler commented Oct 10, 2024 • edited Loading

drossetti commented Nov 15, 2024 • edited Loading

hmeScaler commented Nov 15, 2024 • edited Loading

hmeScaler commented Oct 10, 2024 •

edited

Loading

drossetti commented Nov 15, 2024 •

edited

Loading

hmeScaler commented Nov 15, 2024 •

edited

Loading