Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure after suspend/resume? #253

Open
bmartin427 opened this issue Oct 11, 2023 · 26 comments
Open

Failure after suspend/resume? #253

bmartin427 opened this issue Oct 11, 2023 · 26 comments
Labels
nvidia-issue This is an issue with the NVIDIA GPU driver

Comments

@bmartin427
Copy link

I have acceleration working fine on my media PC, as long as I try it soon after boot. However I suspend this PC in between uses, and acceleration never works following such a cycle until I reboot. Every other GPU function I've tested continues working after the failure: OpenGL, VDPAU, etc are all fine. Hardware is a GeForce GT 1030, OS is Ubuntu 22.04, nvidia driver version is 535.113.01, and nvidia-vaapi-driver version is git 0a924c.

The first time I try running vainfo after a resume, I get:

$ NVD_LOG=1 NVD_BACKEND=egl vainfo
libva info: VA-API version 1.14.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
libva info: Found init function __vaDriverInit_1_0
      4007.609815912 [1538-1538] ../src/vabackend.c:2171       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver: 10
      4007.609902233 [1538-1538] ../src/vabackend.c:2180       __vaDriverInit_1_0 Now have 0 (0 max) instances
      4007.609961457 [1538-1538] ../src/vabackend.c:2203       __vaDriverInit_1_0 Selecting EGL backend
      4007.624392478 [1538-1538] ../src/export-buf.c: 132       findGPUIndexFromFd Defaulting to CUDA GPU ID 0. Use NVD_GPU to select a specific CUDA GPU
      4007.624415595 [1538-1538] ../src/export-buf.c: 149       findGPUIndexFromFd Looking for GPU index: 0
      4007.627540148 [1538-1538] ../src/export-buf.c: 161       findGPUIndexFromFd Found 3 EGL devices
      4007.628336459 [1538-1538] ../src/export-buf.c: 170       findGPUIndexFromFd Got EGL_CUDA_DEVICE_NV value '0' for EGLDevice 0
      4007.628348471 [1538-1538] ../src/export-buf.c: 191       findGPUIndexFromFd Selecting EGLDevice 0
      4007.630274926 [1538-1538] ../src/export-buf.c: 260         egl_initExporter Driver supports 16-bit surfaces
      4007.631365261 [1538-1538] ../src/vabackend.c:2236       __vaDriverInit_1_0 CUDA ERROR 'unknown error' (999)

      4007.631377762 [1538-1538] ../src/export-buf.c:  61      egl_releaseExporter Releasing exporter, 0 outstanding frames
      4007.631391172 [1538-1538] ../src/export-buf.c:  78      egl_releaseExporter Done releasing frames
libva error: /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so init failed

Also, the following lines appear in dmesg during that first vainfo query:

[ 4007.631181] NVRM: GPU at PCI:0000:01:00: GPU-cd29aa0b-44a2-8266-14a3-1f03d08167a1
[ 4007.631188] NVRM: Xid (PCI:0000:01:00): 31, pid=538, name=modprobe, Ch 00000002, intr 10000000. MMU Fault: ENGINE HOST6 HUBCLIENT_HOST faulted @ 0x1_01011000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

Subsequent calls to vainfo produce no more dmesg output, and the console output changes somewhat:

$ NVD_LOG=1 NVD_BACKEND=egl vainfo
libva info: VA-API version 1.14.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
      4812.162229425 [2037-2037] ../src/vabackend.c: 138                     init CUDA ERROR 'unknown error' (999)

libva info: Found init function __vaDriverInit_1_0
      4812.162304641 [2037-2037] ../src/vabackend.c:2171       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver: 10
      4812.162318470 [2037-2037] ../src/vabackend.c:2180       __vaDriverInit_1_0 Now have 0 (0 max) instances
      4812.162330552 [2037-2037] ../src/vabackend.c:2203       __vaDriverInit_1_0 Selecting EGL backend
      4812.175101148 [2037-2037] ../src/export-buf.c: 132       findGPUIndexFromFd Defaulting to CUDA GPU ID 0. Use NVD_GPU to select a specific CUDA GPU
      4812.175124754 [2037-2037] ../src/export-buf.c: 149       findGPUIndexFromFd Looking for GPU index: 0
      4812.178137619 [2037-2037] ../src/export-buf.c: 161       findGPUIndexFromFd Found 3 EGL devices
      4812.180277494 [2037-2037] ../src/export-buf.c: 196       findGPUIndexFromFd No EGL_CUDA_DEVICE_NV support for EGLDevice 0
      4812.180296001 [2037-2037] ../src/export-buf.c: 196       findGPUIndexFromFd No EGL_CUDA_DEVICE_NV support for EGLDevice 1
      4812.180308433 [2037-2037] ../src/export-buf.c: 199       findGPUIndexFromFd No DRM device file for EGLDevice 2
      4812.180317372 [2037-2037] ../src/export-buf.c: 202       findGPUIndexFromFd No match found, falling back to default device
      4812.180326521 [2037-2037] ../src/vabackend.c:2231       __vaDriverInit_1_0 Exporter failed
libva error: /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so init failed
libva info: va_openDriver() returns 1
vaInitialize failed with error code 1 (operation failed),exit

I have tried direct backend instead of egl, and get no different results, aside from some slightly different error text.

I'm not 100% certain the suspend and resume is the cause. I have attempted a quick suspend/resume cycle in order to troubleshoot this problem and been unable to reproduce; but it always happens if I leave it suspended for a normal amount of time (hours). So possibly something else about the elapsed time is involved.

I also have tried to leave firefox running during a suspend/resume, thinking that acceleration might continue to function if I just didn't have to repeat the initialization process, however firefox seems to explode immediately upon resume, so this is not an option.

@bmartin427
Copy link
Author

For reference here's a session using the direct backend. The first query was before a suspend/resume, the latter two were after.

brad@fx2:~$ NVD_LOG=1 NVD_BACKEND=direct vainfo
libva info: VA-API version 1.14.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
libva info: Found init function __vaDriverInit_1_0
      4089.149695354 [3287-3287] ../src/vabackend.c:2171       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver: 10
      4089.149724484 [3287-3287] ../src/vabackend.c:2180       __vaDriverInit_1_0 Now have 0 (0 max) instances
      4089.149746525 [3287-3287] ../src/vabackend.c:2206       __vaDriverInit_1_0 Selecting Direct backend
      4089.163510502 [3287-3287] ../src/direct/direct-export-buf.c:  85      direct_initExporter Found NVIDIA GPU 0 at /dev/dri/renderD128
      4089.163532980 [3287-3287] ../src/direct/nv-driver.c: 223            init_nvdriver Initing nvdriver...
      4089.163541389 [3287-3287] ../src/direct/nv-driver.c: 228            init_nvdriver Got dev info: 100 1 0 fe
      4089.163612291 [3287-3287] ../src/direct/nv-driver.c: 246            init_nvdriver NVIDIA kernel driver version: 535.113.01, major version: 535
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.14 (libva 2.12.0)
vainfo: Driver version: VA-API NVDEC driver [direct backend]
vainfo: Supported profile and entrypoints
      VAProfileMPEG2Simple            :	VAEntrypointVLD
      VAProfileMPEG2Main              :	VAEntrypointVLD
      VAProfileVC1Simple              :	VAEntrypointVLD
      VAProfileVC1Main                :	VAEntrypointVLD
      VAProfileVC1Advanced            :	VAEntrypointVLD
      VAProfileH264Main               :	VAEntrypointVLD
      VAProfileH264High               :	VAEntrypointVLD
      VAProfileH264ConstrainedBaseline:	VAEntrypointVLD
      VAProfileHEVCMain               :	VAEntrypointVLD
      VAProfileVP9Profile0            :	VAEntrypointVLD
      VAProfileHEVCMain10             :	VAEntrypointVLD
      VAProfileHEVCMain12             :	VAEntrypointVLD
      VAProfileVP9Profile2            :	VAEntrypointVLD
      4089.308220963 [3287-3287] ../src/vabackend.c:2081              nvTerminate Terminating 0x55933e7e4d40
      4089.308325527 [3287-3287] ../src/vabackend.c:2095              nvTerminate Now have 0 (0 max) instances
brad@fx2:~$ NVD_LOG=1 NVD_BACKEND=direct vainfo
libva info: VA-API version 1.14.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
libva info: Found init function __vaDriverInit_1_0
      4221.457787068 [3540-3540] ../src/vabackend.c:2171       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver: 10
      4221.457808648 [3540-3540] ../src/vabackend.c:2180       __vaDriverInit_1_0 Now have 0 (0 max) instances
      4221.457820940 [3540-3540] ../src/vabackend.c:2206       __vaDriverInit_1_0 Selecting Direct backend
      4221.472699819 [3540-3540] ../src/direct/direct-export-buf.c:  85      direct_initExporter Found NVIDIA GPU 0 at /dev/dri/renderD128
      4221.472724892 [3540-3540] ../src/direct/nv-driver.c: 223            init_nvdriver Initing nvdriver...
      4221.472737114 [3540-3540] ../src/direct/nv-driver.c: 228            init_nvdriver Got dev info: 100 1 0 fe
      4221.472851581 [3540-3540] ../src/direct/nv-driver.c: 246            init_nvdriver NVIDIA kernel driver version: 535.113.01, major version: 535
      4221.474599881 [3540-3540] ../src/vabackend.c:2236       __vaDriverInit_1_0 CUDA ERROR 'unknown error' (999)

libva error: /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so init failed
libva info: va_openDriver() returns 1
vaInitialize failed with error code 1 (operation failed),exit
brad@fx2:~$ NVD_LOG=1 NVD_BACKEND=direct vainfo
libva info: VA-API version 1.14.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
      4226.566012274 [3543-3543] ../src/vabackend.c: 138                     init CUDA ERROR 'unknown error' (999)

libva info: Found init function __vaDriverInit_1_0
      4226.566085396 [3543-3543] ../src/vabackend.c:2171       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver: 10
      4226.566098805 [3543-3543] ../src/vabackend.c:2180       __vaDriverInit_1_0 Now have 0 (0 max) instances
      4226.566110469 [3543-3543] ../src/vabackend.c:2206       __vaDriverInit_1_0 Selecting Direct backend
      4226.578729192 [3543-3543] ../src/direct/direct-export-buf.c:  85      direct_initExporter Found NVIDIA GPU 0 at /dev/dri/renderD128
      4226.578750354 [3543-3543] ../src/direct/nv-driver.c: 223            init_nvdriver Initing nvdriver...
      4226.578759782 [3543-3543] ../src/direct/nv-driver.c: 228            init_nvdriver Got dev info: 100 1 0 fe
      4226.578826339 [3543-3543] ../src/direct/nv-driver.c: 246            init_nvdriver NVIDIA kernel driver version: 535.113.01, major version: 535
      4226.578960222 [3543-3543] ../src/direct/direct-export-buf.c:  23       findGPUIndexFromFd CUDA ERROR 'initialization error' (3)

      4226.578971746 [3543-3543] ../src/vabackend.c:2236       __vaDriverInit_1_0 CUDA ERROR 'initialization error' (3)

libva error: /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so init failed
libva info: va_openDriver() returns 1
vaInitialize failed with error code 1 (operation failed),exit

I also have the same two dmesg lines as before.

@rcoacci
Copy link

rcoacci commented Oct 19, 2023

I'm seeing something related to this, but in my case Firefox crashes upon resuming. I've just disabled nvidia-vaapi-driver completely and will see if the crashes continue.
I've tried setting up NVIDIA's PreserveVideoMemoryAllocations, also but it made gnome-shell become impossible to use after resume (which is even worse...)

@elFarto
Copy link
Owner

elFarto commented Oct 29, 2023

Unfortunately this is an issue with the NVIDIA driver, and there's not much I can do about it. The driver really doesn't like having any sort of NVDEC context that's left active over the suspend/resume causes it to break the driver until a reboot is done.

@elFarto elFarto added the nvidia-issue This is an issue with the NVIDIA GPU driver label Oct 29, 2023
@bmartin427
Copy link
Author

Hmm. If firefox is closed before I suspend, then is there anything else I can do to prevent NVDEC context from being left active? Is there something else I need to explicitly kill, or is it really just that I've ever used it at all?

@hhfeuer
Copy link

hhfeuer commented Nov 10, 2023

Know issue of the nvidia driver. After suspend/resume, the nvidia-uvm module is defunct even if not used. The workaround being unloading/reloading it.

@mikejaques
Copy link

Can confirm this. I wrote up a specific "how to" for Pop!_OS users just yesterday, but after resume from suspend HW acceleration in Firefox is broken. Only a reboot fixes it. I haven't tried unloading/reloading but that's not really a solution for the average user.

Question, it's a "known issue" with the NVIDIA driver, but is there any actual confirmation or bug tracking within NVIDIA as a company? Does this bug affect Wayland or only X11 windowing systems? I ask that because, and I'm only moderately knowledgeable about Linux with nearly ZERO experience with Wayland, so I don't know if Wayland even requires a vaapi layer for hardware acceleration of video codecs.

@elFarto
Copy link
Owner

elFarto commented Dec 17, 2023

I'm not sure if there's an actual NVIDIA bug for it. I've bumped the issue[1] in the NVIDIA forums and we'll see if we get a response.

[1] https://forums.developer.nvidia.com/t/xid-31-after-wakeup-from-sleep/139870/6

@MageSlayer
Copy link

MageSlayer commented Jan 9, 2024

Having the same issue under laptop in secondary nvidia card in PRIME configuration.
Hardware acceleration fails after resume from suspend.

$ NVD_LOG=1 NVD_BACKEND=direct vainfo
libva info: VA-API version 1.20.0
libva error: vaGetDriverNames() failed with unknown libva error
libva info: User environment variable requested driver 'nvidia'
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
    135775.283643377 [30120-30120] ../src/vabackend.c: 130                     init CUDA ERROR 'unknown error' (999)

libva info: Found init function __vaDriverInit_1_0
    135775.283662988 [30120-30120] ../src/vabackend.c:2145       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver: 10
    135775.283665133 [30120-30120] ../src/vabackend.c:2154       __vaDriverInit_1_0 Now have 0 (0 max) instances
    135775.283667649 [30120-30120] ../src/vabackend.c:2180       __vaDriverInit_1_0 Selecting Direct backend
    135775.286633777 [30120-30120] ../src/backend-common.c:  31            isNvidiaDrmFd Invalid driver for DRM device: i915
    135775.286665005 [30120-30120] ../src/direct/direct-export-buf.c:  85      direct_initExporter Found NVIDIA GPU 0 at /dev/dri/renderD129
    135775.286668121 [30120-30120] ../src/direct/nv-driver.c: 246            init_nvdriver Initing nvdriver...
    135775.286683125 [30120-30120] ../src/direct/nv-driver.c: 264            init_nvdriver NVIDIA kernel driver version: , major version: 0, minor version: 0
    135775.286685882 [30120-30120] ../src/direct/nv-driver.c: 271            init_nvdriver Got dev info: 100 1 2 6
    135775.286771896 [30120-30120] ../src/direct/direct-export-buf.c:  23       findGPUIndexFromFd CUDA ERROR 'initialization error' (3)

    135775.286774654 [30120-30120] ../src/vabackend.c:2210       __vaDriverInit_1_0 CUDA ERROR 'initialization error' (3)

libva error: /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so init failed
libva info: va_openDriver() returns 1
vaInitialize failed with error code 1 (operation failed),exit

Doing nvidia-uvm reloading solves the issue:

# rmmod nvidia-uvm
# modprobe nvidia-uvm

@mirh
Copy link

mirh commented Jan 28, 2024

Aren't standby problems related to the stuff discussed in #182? And isn't it all fixed in 545+?

@MageSlayer
Copy link

Last time I tried some 535 driver, it refused to decrease cooler speed after some video playback. My laptop sounded like a jet-plane & never stopped unless rebooted.

I'll try 545 this time. Thanks for suggestion.

@MageSlayer
Copy link

I checked 545.23.08 version and looks like they've fixed both cooler speed & hw acceleration after suspend/resume issues.

I think the issue might be closed now.

@MageSlayer
Copy link

MageSlayer commented Feb 1, 2024

I checked 545.23.08 version and looks like they've fixed both cooler speed & hw acceleration after suspend/resume issues.

I think the issue might be closed now.

Looks like I was too quick.
The suspend/resume hw acceleration bug is still there in driver 545.23.08.
vainfo emits error & Firefox acceleration is missing after 3-4th resume from suspend.

@strahe
Copy link

strahe commented May 30, 2024

This bug is still there in driver 550.78

@strahe
Copy link

strahe commented Jun 13, 2024

I am using Archlinux, the instructions here solved my problem, I hope it will be useful to you.

@tashrifbillah
Copy link

My NVIDIA driver is 550.54.14. I am on a Redhat 9 environment. I have the same issue. Does anyone actually have a solution or workaround? I know @strahe posted something but it is unclear in his link what the instruction was.

@MageSlayer
Copy link

NVidia driver 535.183.06-1
Linux 6.6.41

Uncommenting

options nvidia-current NVreg_PreserveVideoMemoryAllocations=1

... in /etc/modprobe.d/nvidia-options.conf results in errors in syslog and my laptop just stops suspending at all :)
Perhaps some other magic is required.

Commenting that line back brings back Firefox crashes, but suspend starts working.
I guess I'll stick to suspend for now :)

@mirh
Copy link

mirh commented Sep 27, 2024

The instructions clearly mentions that you have to enable the services too.
On top of that I'm not sure modprobe is reliable 100% of times, so try nvidia.NVreg_PreserveVideoMemoryAllocations=1 directly in the command line.

@MageSlayer
Copy link

It's Arch wiki and I am under Devuan. So I don't have any of those *.service daemons.
Starting nvidia-persistenced just fails with some strange error.

@hhfeuer
Copy link

hhfeuer commented Sep 27, 2024

The Nvidia suspend/resume mechanism relies on systemd hacks. Since Devuan promises to stay systemd-free they should get in contact with the Gentoo devs who maintain elogind for the same pupose incorporating the needed Nvidia hooks. Please support your Distro.

@MageSlayer
Copy link

The Nvidia suspend/resume mechanism relies on systemd hacks. Since Devuan promises to stay systemd-free they should get in contact with the Gentoo devs who maintain elogind for the same pupose incorporating the needed Nvidia hooks. Please support your Distro.

https://dev1galaxy.org/viewtopic.php?id=6860

@mirh
Copy link

mirh commented Sep 28, 2024

See gentoo/gentoo#38482

@igravious
Copy link

Know issue of the nvidia driver. After suspend/resume, the nvidia-uvm module is defunct even if not used. The workaround being unloading/reloading it.

this works! :)

@MageSlayer
Copy link

The Nvidia suspend/resume mechanism relies on systemd hacks. Since Devuan promises to stay systemd-free they should get in contact with the Gentoo devs who maintain elogind for the same pupose incorporating the needed Nvidia hooks. Please support your Distro.

https://dev1galaxy.org/viewtopic.php?id=6860

Devuan can be fixed quite easily.
See https://dev1galaxy.org/viewtopic.php?pid=52640#p52640

@mirh
Copy link

mirh commented Oct 25, 2024

@elFarto it would be really nice if you could mention in the readme this detail about suspending in the readme

@nerijus
Copy link
Contributor

nerijus commented Oct 25, 2024

You could do a PR for this :)

@Certainty1396
Copy link

In fedora 40 I found a workaround method: shutdown any process using nvidia decoding function before suspend or hibernate.
vim /etc/systemd/system/systemd-suspend.service.wants/nvidia-suspend.service

[Unit]
Description=NVIDIA system suspend actions
Before=systemd-suspend.service

[Service]
Type=oneshot
ExecStart="pkill -f firefox && pkill -f VLC"
ExecStart=/usr/bin/logger -t suspend -s "nvidia-suspend.service"
ExecStart=/usr/bin/nvidia-sleep.sh "suspend"

[Install]
WantedBy=systemd-suspend.service

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nvidia-issue This is an issue with the NVIDIA GPU driver
Projects
None yet
Development

No branches or pull requests