Failure after suspend/resume? #253

bmartin427 · 2023-10-11T02:05:12Z

I have acceleration working fine on my media PC, as long as I try it soon after boot. However I suspend this PC in between uses, and acceleration never works following such a cycle until I reboot. Every other GPU function I've tested continues working after the failure: OpenGL, VDPAU, etc are all fine. Hardware is a GeForce GT 1030, OS is Ubuntu 22.04, nvidia driver version is 535.113.01, and nvidia-vaapi-driver version is git 0a924c.

The first time I try running vainfo after a resume, I get:

$ NVD_LOG=1 NVD_BACKEND=egl vainfo
libva info: VA-API version 1.14.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
libva info: Found init function __vaDriverInit_1_0
      4007.609815912 [1538-1538] ../src/vabackend.c:2171       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver: 10
      4007.609902233 [1538-1538] ../src/vabackend.c:2180       __vaDriverInit_1_0 Now have 0 (0 max) instances
      4007.609961457 [1538-1538] ../src/vabackend.c:2203       __vaDriverInit_1_0 Selecting EGL backend
      4007.624392478 [1538-1538] ../src/export-buf.c: 132       findGPUIndexFromFd Defaulting to CUDA GPU ID 0. Use NVD_GPU to select a specific CUDA GPU
      4007.624415595 [1538-1538] ../src/export-buf.c: 149       findGPUIndexFromFd Looking for GPU index: 0
      4007.627540148 [1538-1538] ../src/export-buf.c: 161       findGPUIndexFromFd Found 3 EGL devices
      4007.628336459 [1538-1538] ../src/export-buf.c: 170       findGPUIndexFromFd Got EGL_CUDA_DEVICE_NV value '0' for EGLDevice 0
      4007.628348471 [1538-1538] ../src/export-buf.c: 191       findGPUIndexFromFd Selecting EGLDevice 0
      4007.630274926 [1538-1538] ../src/export-buf.c: 260         egl_initExporter Driver supports 16-bit surfaces
      4007.631365261 [1538-1538] ../src/vabackend.c:2236       __vaDriverInit_1_0 CUDA ERROR 'unknown error' (999)

      4007.631377762 [1538-1538] ../src/export-buf.c:  61      egl_releaseExporter Releasing exporter, 0 outstanding frames
      4007.631391172 [1538-1538] ../src/export-buf.c:  78      egl_releaseExporter Done releasing frames
libva error: /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so init failed

Also, the following lines appear in dmesg during that first vainfo query:

[ 4007.631181] NVRM: GPU at PCI:0000:01:00: GPU-cd29aa0b-44a2-8266-14a3-1f03d08167a1
[ 4007.631188] NVRM: Xid (PCI:0000:01:00): 31, pid=538, name=modprobe, Ch 00000002, intr 10000000. MMU Fault: ENGINE HOST6 HUBCLIENT_HOST faulted @ 0x1_01011000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

Subsequent calls to vainfo produce no more dmesg output, and the console output changes somewhat:

$ NVD_LOG=1 NVD_BACKEND=egl vainfo
libva info: VA-API version 1.14.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
      4812.162229425 [2037-2037] ../src/vabackend.c: 138                     init CUDA ERROR 'unknown error' (999)

libva info: Found init function __vaDriverInit_1_0
      4812.162304641 [2037-2037] ../src/vabackend.c:2171       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver: 10
      4812.162318470 [2037-2037] ../src/vabackend.c:2180       __vaDriverInit_1_0 Now have 0 (0 max) instances
      4812.162330552 [2037-2037] ../src/vabackend.c:2203       __vaDriverInit_1_0 Selecting EGL backend
      4812.175101148 [2037-2037] ../src/export-buf.c: 132       findGPUIndexFromFd Defaulting to CUDA GPU ID 0. Use NVD_GPU to select a specific CUDA GPU
      4812.175124754 [2037-2037] ../src/export-buf.c: 149       findGPUIndexFromFd Looking for GPU index: 0
      4812.178137619 [2037-2037] ../src/export-buf.c: 161       findGPUIndexFromFd Found 3 EGL devices
      4812.180277494 [2037-2037] ../src/export-buf.c: 196       findGPUIndexFromFd No EGL_CUDA_DEVICE_NV support for EGLDevice 0
      4812.180296001 [2037-2037] ../src/export-buf.c: 196       findGPUIndexFromFd No EGL_CUDA_DEVICE_NV support for EGLDevice 1
      4812.180308433 [2037-2037] ../src/export-buf.c: 199       findGPUIndexFromFd No DRM device file for EGLDevice 2
      4812.180317372 [2037-2037] ../src/export-buf.c: 202       findGPUIndexFromFd No match found, falling back to default device
      4812.180326521 [2037-2037] ../src/vabackend.c:2231       __vaDriverInit_1_0 Exporter failed
libva error: /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so init failed
libva info: va_openDriver() returns 1
vaInitialize failed with error code 1 (operation failed),exit

I have tried direct backend instead of egl, and get no different results, aside from some slightly different error text.

I'm not 100% certain the suspend and resume is the cause. I have attempted a quick suspend/resume cycle in order to troubleshoot this problem and been unable to reproduce; but it always happens if I leave it suspended for a normal amount of time (hours). So possibly something else about the elapsed time is involved.

I also have tried to leave firefox running during a suspend/resume, thinking that acceleration might continue to function if I just didn't have to repeat the initialization process, however firefox seems to explode immediately upon resume, so this is not an option.

The text was updated successfully, but these errors were encountered:

bmartin427 · 2023-10-11T17:01:35Z

For reference here's a session using the direct backend. The first query was before a suspend/resume, the latter two were after.

brad@fx2:~$ NVD_LOG=1 NVD_BACKEND=direct vainfo
libva info: VA-API version 1.14.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
libva info: Found init function __vaDriverInit_1_0
      4089.149695354 [3287-3287] ../src/vabackend.c:2171       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver: 10
      4089.149724484 [3287-3287] ../src/vabackend.c:2180       __vaDriverInit_1_0 Now have 0 (0 max) instances
      4089.149746525 [3287-3287] ../src/vabackend.c:2206       __vaDriverInit_1_0 Selecting Direct backend
      4089.163510502 [3287-3287] ../src/direct/direct-export-buf.c:  85      direct_initExporter Found NVIDIA GPU 0 at /dev/dri/renderD128
      4089.163532980 [3287-3287] ../src/direct/nv-driver.c: 223            init_nvdriver Initing nvdriver...
      4089.163541389 [3287-3287] ../src/direct/nv-driver.c: 228            init_nvdriver Got dev info: 100 1 0 fe
      4089.163612291 [3287-3287] ../src/direct/nv-driver.c: 246            init_nvdriver NVIDIA kernel driver version: 535.113.01, major version: 535
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.14 (libva 2.12.0)
vainfo: Driver version: VA-API NVDEC driver [direct backend]
vainfo: Supported profile and entrypoints
      VAProfileMPEG2Simple            :	VAEntrypointVLD
      VAProfileMPEG2Main              :	VAEntrypointVLD
      VAProfileVC1Simple              :	VAEntrypointVLD
      VAProfileVC1Main                :	VAEntrypointVLD
      VAProfileVC1Advanced            :	VAEntrypointVLD
      VAProfileH264Main               :	VAEntrypointVLD
      VAProfileH264High               :	VAEntrypointVLD
      VAProfileH264ConstrainedBaseline:	VAEntrypointVLD
      VAProfileHEVCMain               :	VAEntrypointVLD
      VAProfileVP9Profile0            :	VAEntrypointVLD
      VAProfileHEVCMain10             :	VAEntrypointVLD
      VAProfileHEVCMain12             :	VAEntrypointVLD
      VAProfileVP9Profile2            :	VAEntrypointVLD
      4089.308220963 [3287-3287] ../src/vabackend.c:2081              nvTerminate Terminating 0x55933e7e4d40
      4089.308325527 [3287-3287] ../src/vabackend.c:2095              nvTerminate Now have 0 (0 max) instances
brad@fx2:~$ NVD_LOG=1 NVD_BACKEND=direct vainfo
libva info: VA-API version 1.14.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
libva info: Found init function __vaDriverInit_1_0
      4221.457787068 [3540-3540] ../src/vabackend.c:2171       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver: 10
      4221.457808648 [3540-3540] ../src/vabackend.c:2180       __vaDriverInit_1_0 Now have 0 (0 max) instances
      4221.457820940 [3540-3540] ../src/vabackend.c:2206       __vaDriverInit_1_0 Selecting Direct backend
      4221.472699819 [3540-3540] ../src/direct/direct-export-buf.c:  85      direct_initExporter Found NVIDIA GPU 0 at /dev/dri/renderD128
      4221.472724892 [3540-3540] ../src/direct/nv-driver.c: 223            init_nvdriver Initing nvdriver...
      4221.472737114 [3540-3540] ../src/direct/nv-driver.c: 228            init_nvdriver Got dev info: 100 1 0 fe
      4221.472851581 [3540-3540] ../src/direct/nv-driver.c: 246            init_nvdriver NVIDIA kernel driver version: 535.113.01, major version: 535
      4221.474599881 [3540-3540] ../src/vabackend.c:2236       __vaDriverInit_1_0 CUDA ERROR 'unknown error' (999)

libva error: /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so init failed
libva info: va_openDriver() returns 1
vaInitialize failed with error code 1 (operation failed),exit
brad@fx2:~$ NVD_LOG=1 NVD_BACKEND=direct vainfo
libva info: VA-API version 1.14.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
      4226.566012274 [3543-3543] ../src/vabackend.c: 138                     init CUDA ERROR 'unknown error' (999)

libva info: Found init function __vaDriverInit_1_0
      4226.566085396 [3543-3543] ../src/vabackend.c:2171       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver: 10
      4226.566098805 [3543-3543] ../src/vabackend.c:2180       __vaDriverInit_1_0 Now have 0 (0 max) instances
      4226.566110469 [3543-3543] ../src/vabackend.c:2206       __vaDriverInit_1_0 Selecting Direct backend
      4226.578729192 [3543-3543] ../src/direct/direct-export-buf.c:  85      direct_initExporter Found NVIDIA GPU 0 at /dev/dri/renderD128
      4226.578750354 [3543-3543] ../src/direct/nv-driver.c: 223            init_nvdriver Initing nvdriver...
      4226.578759782 [3543-3543] ../src/direct/nv-driver.c: 228            init_nvdriver Got dev info: 100 1 0 fe
      4226.578826339 [3543-3543] ../src/direct/nv-driver.c: 246            init_nvdriver NVIDIA kernel driver version: 535.113.01, major version: 535
      4226.578960222 [3543-3543] ../src/direct/direct-export-buf.c:  23       findGPUIndexFromFd CUDA ERROR 'initialization error' (3)

      4226.578971746 [3543-3543] ../src/vabackend.c:2236       __vaDriverInit_1_0 CUDA ERROR 'initialization error' (3)

libva error: /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so init failed
libva info: va_openDriver() returns 1
vaInitialize failed with error code 1 (operation failed),exit

I also have the same two dmesg lines as before.

rcoacci · 2023-10-19T20:17:18Z

I'm seeing something related to this, but in my case Firefox crashes upon resuming. I've just disabled nvidia-vaapi-driver completely and will see if the crashes continue.
I've tried setting up NVIDIA's PreserveVideoMemoryAllocations, also but it made gnome-shell become impossible to use after resume (which is even worse...)

elFarto · 2023-10-29T10:27:04Z

Unfortunately this is an issue with the NVIDIA driver, and there's not much I can do about it. The driver really doesn't like having any sort of NVDEC context that's left active over the suspend/resume causes it to break the driver until a reboot is done.

bmartin427 · 2023-10-29T16:50:38Z

Hmm. If firefox is closed before I suspend, then is there anything else I can do to prevent NVDEC context from being left active? Is there something else I need to explicitly kill, or is it really just that I've ever used it at all?

hhfeuer · 2023-11-10T09:26:48Z

Know issue of the nvidia driver. After suspend/resume, the nvidia-uvm module is defunct even if not used. The workaround being unloading/reloading it.

mikejaques · 2023-11-22T21:26:27Z

Can confirm this. I wrote up a specific "how to" for Pop!_OS users just yesterday, but after resume from suspend HW acceleration in Firefox is broken. Only a reboot fixes it. I haven't tried unloading/reloading but that's not really a solution for the average user.

Question, it's a "known issue" with the NVIDIA driver, but is there any actual confirmation or bug tracking within NVIDIA as a company? Does this bug affect Wayland or only X11 windowing systems? I ask that because, and I'm only moderately knowledgeable about Linux with nearly ZERO experience with Wayland, so I don't know if Wayland even requires a vaapi layer for hardware acceleration of video codecs.

elFarto · 2023-12-17T10:01:16Z

I'm not sure if there's an actual NVIDIA bug for it. I've bumped the issue[1] in the NVIDIA forums and we'll see if we get a response.

[1] https://forums.developer.nvidia.com/t/xid-31-after-wakeup-from-sleep/139870/6

MageSlayer · 2024-01-09T18:01:30Z

Having the same issue under laptop in secondary nvidia card in PRIME configuration.
Hardware acceleration fails after resume from suspend.

$ NVD_LOG=1 NVD_BACKEND=direct vainfo
libva info: VA-API version 1.20.0
libva error: vaGetDriverNames() failed with unknown libva error
libva info: User environment variable requested driver 'nvidia'
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
    135775.283643377 [30120-30120] ../src/vabackend.c: 130                     init CUDA ERROR 'unknown error' (999)

libva info: Found init function __vaDriverInit_1_0
    135775.283662988 [30120-30120] ../src/vabackend.c:2145       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver: 10
    135775.283665133 [30120-30120] ../src/vabackend.c:2154       __vaDriverInit_1_0 Now have 0 (0 max) instances
    135775.283667649 [30120-30120] ../src/vabackend.c:2180       __vaDriverInit_1_0 Selecting Direct backend
    135775.286633777 [30120-30120] ../src/backend-common.c:  31            isNvidiaDrmFd Invalid driver for DRM device: i915
    135775.286665005 [30120-30120] ../src/direct/direct-export-buf.c:  85      direct_initExporter Found NVIDIA GPU 0 at /dev/dri/renderD129
    135775.286668121 [30120-30120] ../src/direct/nv-driver.c: 246            init_nvdriver Initing nvdriver...
    135775.286683125 [30120-30120] ../src/direct/nv-driver.c: 264            init_nvdriver NVIDIA kernel driver version: , major version: 0, minor version: 0
    135775.286685882 [30120-30120] ../src/direct/nv-driver.c: 271            init_nvdriver Got dev info: 100 1 2 6
    135775.286771896 [30120-30120] ../src/direct/direct-export-buf.c:  23       findGPUIndexFromFd CUDA ERROR 'initialization error' (3)

    135775.286774654 [30120-30120] ../src/vabackend.c:2210       __vaDriverInit_1_0 CUDA ERROR 'initialization error' (3)

libva error: /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so init failed
libva info: va_openDriver() returns 1
vaInitialize failed with error code 1 (operation failed),exit

Doing nvidia-uvm reloading solves the issue:

# rmmod nvidia-uvm
# modprobe nvidia-uvm

mirh · 2024-01-28T21:40:16Z

~~Aren't standby problems related to the stuff discussed in #182? And isn't it all fixed in 545+?~~

MageSlayer · 2024-01-29T08:41:00Z

Last time I tried some 535 driver, it refused to decrease cooler speed after some video playback. My laptop sounded like a jet-plane & never stopped unless rebooted.

I'll try 545 this time. Thanks for suggestion.

MageSlayer · 2024-01-30T08:58:50Z

I checked 545.23.08 version and looks like they've fixed both cooler speed & hw acceleration after suspend/resume issues.

I think the issue might be closed now.

MageSlayer · 2024-02-01T21:06:42Z

I checked 545.23.08 version and looks like they've fixed both cooler speed & hw acceleration after suspend/resume issues.

I think the issue might be closed now.

Looks like I was too quick.
The suspend/resume hw acceleration bug is still there in driver 545.23.08.
vainfo emits error & Firefox acceleration is missing after 3-4th resume from suspend.

strahe · 2024-05-30T06:08:59Z

This bug is still there in driver 550.78

strahe · 2024-06-13T05:10:04Z

I am using Archlinux, the instructions here solved my problem, I hope it will be useful to you.

tashrifbillah · 2024-09-27T14:02:54Z

My NVIDIA driver is 550.54.14. I am on a Redhat 9 environment. I have the same issue. Does anyone actually have a solution or workaround? I know @strahe posted something but it is unclear in his link what the instruction was.

MageSlayer · 2024-09-27T18:03:56Z

NVidia driver 535.183.06-1
Linux 6.6.41

Uncommenting

options nvidia-current NVreg_PreserveVideoMemoryAllocations=1

... in /etc/modprobe.d/nvidia-options.conf results in errors in syslog and my laptop just stops suspending at all :)
Perhaps some other magic is required.

Commenting that line back brings back Firefox crashes, but suspend starts working.
I guess I'll stick to suspend for now :)

mirh · 2024-09-27T19:52:17Z

The instructions clearly mentions that you have to enable the services too.
On top of that I'm not sure modprobe is reliable 100% of times, so try nvidia.NVreg_PreserveVideoMemoryAllocations=1 directly in the command line.

MageSlayer · 2024-09-27T20:03:42Z

It's Arch wiki and I am under Devuan. So I don't have any of those *.service daemons.
Starting nvidia-persistenced just fails with some strange error.

hhfeuer · 2024-09-27T22:47:33Z

The Nvidia suspend/resume mechanism relies on systemd hacks. Since Devuan promises to stay systemd-free they should get in contact with the Gentoo devs who maintain elogind for the same pupose incorporating the needed Nvidia hooks. Please support your Distro.

MageSlayer · 2024-09-28T08:08:26Z

The Nvidia suspend/resume mechanism relies on systemd hacks. Since Devuan promises to stay systemd-free they should get in contact with the Gentoo devs who maintain elogind for the same pupose incorporating the needed Nvidia hooks. Please support your Distro.

https://dev1galaxy.org/viewtopic.php?id=6860

mirh · 2024-09-28T12:47:37Z

See gentoo/gentoo#38482

igravious · 2024-10-04T17:06:15Z

Know issue of the nvidia driver. After suspend/resume, the nvidia-uvm module is defunct even if not used. The workaround being unloading/reloading it.

this works! :)

MageSlayer · 2024-10-14T17:41:43Z

The Nvidia suspend/resume mechanism relies on systemd hacks. Since Devuan promises to stay systemd-free they should get in contact with the Gentoo devs who maintain elogind for the same pupose incorporating the needed Nvidia hooks. Please support your Distro.

https://dev1galaxy.org/viewtopic.php?id=6860

Devuan can be fixed quite easily.
See https://dev1galaxy.org/viewtopic.php?pid=52640#p52640

mirh · 2024-10-25T12:45:36Z

@elFarto it would be really nice if you could mention in the readme this detail about suspending in the readme

nerijus · 2024-10-25T14:50:23Z

You could do a PR for this :)

Certainty1396 · 2024-10-27T03:05:45Z

In fedora 40 I found a workaround method: shutdown any process using nvidia decoding function before suspend or hibernate.
vim /etc/systemd/system/systemd-suspend.service.wants/nvidia-suspend.service

[Unit]
Description=NVIDIA system suspend actions
Before=systemd-suspend.service

[Service]
Type=oneshot
ExecStart="pkill -f firefox && pkill -f VLC"
ExecStart=/usr/bin/logger -t suspend -s "nvidia-suspend.service"
ExecStart=/usr/bin/nvidia-sleep.sh "suspend"

[Install]
WantedBy=systemd-suspend.service

elFarto added the nvidia-issue This is an issue with the NVIDIA GPU driver label Oct 29, 2023

mirh mentioned this issue Jan 30, 2024

this driver breaks hibernation #236

Open

mirh mentioned this issue Jul 14, 2024

libva error: init failed #299

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure after suspend/resume? #253

Failure after suspend/resume? #253

bmartin427 commented Oct 11, 2023

bmartin427 commented Oct 11, 2023

rcoacci commented Oct 19, 2023

elFarto commented Oct 29, 2023

bmartin427 commented Oct 29, 2023

hhfeuer commented Nov 10, 2023

mikejaques commented Nov 22, 2023

elFarto commented Dec 17, 2023

MageSlayer commented Jan 9, 2024 •

edited

Loading

mirh commented Jan 28, 2024 •

edited

Loading

MageSlayer commented Jan 29, 2024

MageSlayer commented Jan 30, 2024

MageSlayer commented Feb 1, 2024 •

edited

Loading

strahe commented May 30, 2024

strahe commented Jun 13, 2024

tashrifbillah commented Sep 27, 2024

MageSlayer commented Sep 27, 2024

mirh commented Sep 27, 2024

MageSlayer commented Sep 27, 2024

hhfeuer commented Sep 27, 2024

MageSlayer commented Sep 28, 2024

mirh commented Sep 28, 2024

igravious commented Oct 4, 2024

MageSlayer commented Oct 14, 2024

mirh commented Oct 25, 2024

nerijus commented Oct 25, 2024

Certainty1396 commented Oct 27, 2024

Failure after suspend/resume? #253

Failure after suspend/resume? #253

Comments

bmartin427 commented Oct 11, 2023

bmartin427 commented Oct 11, 2023

rcoacci commented Oct 19, 2023

elFarto commented Oct 29, 2023

bmartin427 commented Oct 29, 2023

hhfeuer commented Nov 10, 2023

mikejaques commented Nov 22, 2023

elFarto commented Dec 17, 2023

MageSlayer commented Jan 9, 2024 • edited Loading

mirh commented Jan 28, 2024 • edited Loading

MageSlayer commented Jan 29, 2024

MageSlayer commented Jan 30, 2024

MageSlayer commented Feb 1, 2024 • edited Loading

strahe commented May 30, 2024

strahe commented Jun 13, 2024

tashrifbillah commented Sep 27, 2024

MageSlayer commented Sep 27, 2024

mirh commented Sep 27, 2024

MageSlayer commented Sep 27, 2024

hhfeuer commented Sep 27, 2024

MageSlayer commented Sep 28, 2024

mirh commented Sep 28, 2024

igravious commented Oct 4, 2024

MageSlayer commented Oct 14, 2024

mirh commented Oct 25, 2024

nerijus commented Oct 25, 2024

Certainty1396 commented Oct 27, 2024

MageSlayer commented Jan 9, 2024 •

edited

Loading

mirh commented Jan 28, 2024 •

edited

Loading

MageSlayer commented Feb 1, 2024 •

edited

Loading