Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Debian 13] vainfo fails: can't open /dev/dri/render128 - invalid argument #351

Open
itsmeknt opened this issue Jan 15, 2025 · 9 comments
Open

Comments

@itsmeknt
Copy link

itsmeknt commented Jan 15, 2025

I am running X11 on Debian Trixie (kernel 6.12.9) on my desktop with a RTX 3090 GPU and a very old AMD Ryzen Threadripper 1950X CPU.

lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux trixie/sid
Release:	n/a
Codename:	trixie

uname -a
Linux debian 6.12.9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.12.9-1 (2025-01-10) x86_64 GNU/Linux

echo $XDG_SESSION_TYPE
x11

I have installed Nvidia 565 drivers according to the following guides:

Nvidia driver installation: https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/
Nvidia cuda installation: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/

The gist of the installation process is:

sudo apt-get install -V nvidia-open
sudo apt-get install -V cuda-drivers
sudo apt-get install -V cuda-toolkit

My Nvidia drivers seems to be working. nvidia-smi prints normally, and here is my graphic settings:

inxi -Ga
Graphics:
  Device-1: NVIDIA GA102 [GeForce RTX 3090] vendor: eVga.com. driver: nvidia
    v: 565.57.01 alternate: nouveau,nvidia_drm non-free: 550.xx+ status: current
    (as of 2024-09; EOL~2026-12-xx) arch: Ampere code: GAxxx
    process: TSMC n7 (7nm) built: 2020-2023 pcie: gen: 1 speed: 2.5 GT/s
    lanes: 8 link-max: gen: 4 speed: 16 GT/s lanes: 16 ports: active: none
    off: HDMI-A-1 empty: DP-1,DP-2,DP-3 bus-ID: 42:00.0 chip-ID: 10de:2204
    class-ID: 0300
  Device-2: Logitech C922 Pro Stream Webcam driver: snd-usb-audio,uvcvideo
    type: USB rev: 2.0 speed: 480 Mb/s lanes: 1 mode: 2.0 bus-ID: 3-4:3
    chip-ID: 046d:085c class-ID: 0102 serial: DD741E8F
  Display: x11 server: X.Org v: 21.1.15 with: Xwayland v: 24.1.4
    compositor: gnome-shell v: 47.2 driver: X: loaded: fbdev,nouveau
    unloaded: modesetting,vesa alternate: nv dri: swrast
    gpu: nvidia,nvidia-nvswitch display-ID: :1 screens: 1
  Screen-1: 0 s-res: 3840x2160 s-dpi: 96 s-size: 1016x572mm (40.00x22.52")
    s-diag: 1166mm (45.9")
  Monitor-1: HDMI-A-1 mapped: default note: disabled model: Dell S2817Q
    serial: MTKT17AK960I built: 2017 res: 3840x2160 gamma: 1.2
    diag: 708mm (27.9") ratio: 16:9 modes: max: 3840x2160 min: 640x480
  API: EGL v: 1.5 hw: drv: nvidia platforms: device: 0 drv: nvidia device: 2
    drv: swrast surfaceless: drv: nvidia x11: drv: swrast
    inactive: gbm,wayland,device-1
  API: OpenGL v: 4.6.0 compat-v: 4.5 vendor: mesa v: 24.3.3-1 glx-v: 1.4
    direct-render: yes renderer: llvmpipe (LLVM 19.1.6 256 bits)
    device-ID: ffffffff:ffffffff memory: 45.92 GiB unified: yes
ffmpeg -encoders 2>/dev/null | grep nvenc
 V....D av1_nvenc            NVIDIA NVENC av1 encoder (codec av1)
 V....D h264_nvenc           NVIDIA NVENC H.264 encoder (codec h264)
 V....D hevc_nvenc           NVIDIA NVENC hevc encoder (codec hevc)

One weird thing about my Nvidia installation: even though I only have 1 GPU, the Nvidia drivers seem to be using Optimus management I think? At least I can't launch firefox onto Nvidia without setting these env: __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia, otherwise it would default to software CPU rendering with Mesa and LLVM drivers. My 3090 is the only GPU I am using, so ideally I want everything to run on my 3090 (including X11), but I can't get it to work for some reason.

Setting __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia also changes the inxi -Ga output to avoid using Mesa/LLVM as the OpenGL API:

__NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia inxi -Ga
Graphics:
  Device-1: NVIDIA GA102 [GeForce RTX 3090] vendor: eVga.com. driver: nvidia
    v: 565.57.01 alternate: nouveau,nvidia_drm non-free: 550.xx+ status: current
    (as of 2024-09; EOL~2026-12-xx) arch: Ampere code: GAxxx
    process: TSMC n7 (7nm) built: 2020-2023 pcie: gen: 1 speed: 2.5 GT/s
    lanes: 8 link-max: gen: 4 speed: 16 GT/s lanes: 16 ports: active: none
    off: HDMI-A-1 empty: DP-1,DP-2,DP-3 bus-ID: 42:00.0 chip-ID: 10de:2204
    class-ID: 0300
  Device-2: Logitech C922 Pro Stream Webcam driver: snd-usb-audio,uvcvideo
    type: USB rev: 2.0 speed: 480 Mb/s lanes: 1 mode: 2.0 bus-ID: 3-4:3
    chip-ID: 046d:085c class-ID: 0102 serial: DD741E8F
  Display: x11 server: X.Org v: 21.1.15 with: Xwayland v: 24.1.4
    compositor: gnome-shell v: 47.2 driver: X: loaded: fbdev,nouveau
    unloaded: modesetting,vesa alternate: nv gpu: nvidia,nvidia-nvswitch
    display-ID: :1 screens: 1
  Screen-1: 0 s-res: 3840x2160 s-dpi: 96 s-size: 1016x572mm (40.00x22.52")
    s-diag: 1166mm (45.9")
  Monitor-1: HDMI-A-1 mapped: default note: disabled model: Dell S2817Q
    serial: MTKT17AK960I built: 2017 res: 3840x2160 gamma: 1.2
    diag: 708mm (27.9") ratio: 16:9 modes: max: 3840x2160 min: 640x480
  API: EGL v: 1.5 hw: drv: nvidia platforms: device: 0 drv: nvidia device: 2
    drv: swrast surfaceless: drv: nvidia x11: drv: swrast
    inactive: gbm,wayland,device-1
  API: OpenGL v: 4.6.0 compat-v: 4.5 vendor: nvidia mesa v: 565.57.01
    glx-v: 1.4 direct-render: yes renderer: NVIDIA GeForce RTX 3090/PCIe/SSE2
    memory: 23.44 GiB

Anyway, I was trying to get firefox video decoding hardware acceleration to work. I installed nvidia-vaapi-driver=v0.0.13, but I am having some issues with poor performance still. Launching Firefox with the recommended about:config settings and some additional env I found online gives and EGL and VA-API error:

__NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia __VK_LAYER_NV_optimus=NVIDIA_only __EGL_VENDOR_LIBRARY_FILENAMES=/usr/share/glvnd/egl_vendor.d/10_nvidia.json LIBVA_DRIVER_NAME=nvidia LIBVA_DEVICE=/dev/dri/renderD128 VDPAU_DRIVER=nvidia NVD_BACKEND=direct NVD_GPU="/dev/dri/renderD128" MOZ_ENABLE_WAYLAND=0 MOZ_DISABLE_RDD_SANDBOX=1 MOZ_DRM_DEVICE=/dev/dri/renderD128 NVD_LOG=1 firefox
[GFX1-]: glxtest: libEGL no display
[GFX1-]: vaapitest: ERROR
[GFX1-]: vaapitest: VA-API test failed: failed to open renderDeviceFD.

So I don't think the VA-API is working correctly. When I watch a video on Firefox, it's choppy and while nvtop does have Firefox, it doesn't show any dec decoding utilization (which it does when I watch a local video with VLC).

So I installed vainfo and tried to debug it, but vainfo doesn't work either:

__NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia NVD_LOG=1 LIBVA_DRIVER_NAME=nvidia VDPAU_DRIVER=nvidia vainfo
Trying display: wayland
Trying display: x11
libva info: VA-API version 1.22.0
libva error: vaGetDriverNames() failed with unknown libva error
libva info: User environment variable requested driver 'nvidia'
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
libva info: Found init function __vaDriverInit_1_0
     45673.974193197 [68826-68826] ../src/vabackend.c:2187       __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver: 10
     45673.974200250 [68826-68826] ../src/vabackend.c:2196       __vaDriverInit_1_0 Now have 0 (0 max) instances
     45673.974210940 [68826-68826] ../src/vabackend.c:2222       __vaDriverInit_1_0 Selecting Direct backend
     45674.006006810 [68826-68826] ../src/direct/direct-export-buf.c:  68      direct_initExporter Searching for GPU: 0 0 128
     45674.006151956 [68826-68826] ../src/direct/direct-export-buf.c:  72      direct_initExporter Unable to find NVIDIA GPU 0

     45674.006161053 [68826-68826] ../src/vabackend.c:2247       __vaDriverInit_1_0 Exporter failed
libva error: /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so init failed
libva info: va_openDriver() returns 1
vaInitialize failed with error code 1 (operation failed),exit

I modified the source code of direct-export-buf.c:72 to dig further, and can confirm a few things:

  • node=/dev/dri/renderD128 which is the correct file path
  • the file does exist. stat() returns the following struct: st_dev=6, st_ino=944, st_mode=8624, st_uid=0, st_gid=105, st_rdev=57984
  • when open() returns fd == -1, this gives errno = 22 (Invalid argument)

Here is the file properties of /dev/dri/renderD128:

ls -last /dev/dri/renderD128
0 crw-rw----+ 1 root render 226, 128 Jan 14 04:39 /dev/dri/renderD128

I searched online for Invalid Argument error and found this link: https://stackoverflow.com/questions/11055060/possible-reasons-of-linux-open-call-returning-einval
I believe what is happening is that the /dev/dri/renderD128 file is created with some special async behavior, but the glibc library packaged with Debian 13 kernel 6.12.9-amd64 is not able to open it. So even though the file exists and can be stat(), it cannot be opened via open(). This might also explain why I can't get X11 to load with Nvidia drivers as well (the logs complained that it couldn't open some device properly).

I'm not sure where to go from here. How can we debug this behavior further? Any help would be appreciated, Firefox and Chrome is basically unusable for any video-related content, and I probably spent over 40 hours on this already lol.

@elFarto
Copy link
Owner

elFarto commented Jan 15, 2025

The common issue here would be that your user isn't in the render group. You could also just give everything access to that device node to see if that fixes it chmod o+rwx /dev/dri/renderD128, although I wouldn't recommend this as a permanent solution. Another option would be to run vainfo as root.

There is a + on the file mode, so there might be some ACLs on that node, so you could take a look at those with: getfacl /dev/dri/renderD128.

@itsmeknt
Copy link
Author

I appreciate you taking a look into this.

I can confirm I am in the render group:

groups kevin
kevin : kevin cdrom floppy sudo audio dip video plugdev users render netdev bluetooth lpadmin scanner bumblebee

I just gave everything access to renderD128, but the issue still persists.
I also just tried running vainfo as root, but same thing.

Here is the outputs of getfacl:

getfacl renderD128 
# file: renderD128
# owner: root
# group: render
user::rw-
user:kevin:rw-
group::rw-
mask::rw-
other::---

For what it's worth, I don't believe this is a permission issue, since errno did not return EACCESS Permission Denied, but specifically returned EINVAL Invalid Argument. I'm not sure what this error means exactly -- my only clue was the stackoverflow link in my post which seems to indicate some sort of fsync issue?

@thesword53
Copy link
Contributor

Hello @itsmeknt,

What are the outputs of glxinfo and eglinfo ?

@itsmeknt
Copy link
Author

Hello @thesword53

Here is the output of eglinfo -B:

eglinfo -B
GBM platform:
eglinfo: eglInitialize failed

Wayland platform:
eglinfo: eglInitialize failed

X11 platform:
EGL API version: 1.5
EGL vendor string: Mesa Project
EGL version string: 1.5
EGL client APIs: OpenGL OpenGL_ES 
OpenGL core profile vendor: Mesa
OpenGL core profile renderer: llvmpipe (LLVM 19.1.6, 256 bits)
OpenGL core profile version: 4.5 (Core Profile) Mesa 24.3.3-1
OpenGL core profile shading language version: 4.50
OpenGL compatibility profile vendor: Mesa
OpenGL compatibility profile renderer: llvmpipe (LLVM 19.1.6, 256 bits)
OpenGL compatibility profile version: 4.5 (Compatibility Profile) Mesa 24.3.3-1
OpenGL compatibility profile shading language version: 4.50
OpenGL ES profile vendor: Mesa
OpenGL ES profile renderer: llvmpipe (LLVM 19.1.6, 256 bits)
OpenGL ES profile version: OpenGL ES 3.2 Mesa 24.3.3-1
OpenGL ES profile shading language version: OpenGL ES GLSL ES 3.20

Surfaceless platform:
EGL API version: 1.5
EGL vendor string: NVIDIA
EGL version string: 1.5
EGL client APIs: OpenGL_ES OpenGL
OpenGL core profile vendor: NVIDIA Corporation
OpenGL core profile renderer: NVIDIA GeForce RTX 3090/PCIe/SSE2
OpenGL core profile version: 4.6.0 NVIDIA 565.57.01
OpenGL core profile shading language version: 4.60 NVIDIA
OpenGL compatibility profile vendor: NVIDIA Corporation
OpenGL compatibility profile renderer: NVIDIA GeForce RTX 3090/PCIe/SSE2
OpenGL compatibility profile version: 4.6.0 NVIDIA 565.57.01
OpenGL compatibility profile shading language version: 4.60 NVIDIA
OpenGL ES profile vendor: NVIDIA Corporation
OpenGL ES profile renderer: NVIDIA GeForce RTX 3090/PCIe/SSE2
OpenGL ES profile version: OpenGL ES 3.2 NVIDIA 565.57.01
OpenGL ES profile shading language version: OpenGL ES GLSL ES 3.20

Device platform:
Device #0:

Platform Device platform:
EGL API version: 1.5
EGL vendor string: NVIDIA
EGL version string: 1.5
EGL client APIs: OpenGL_ES OpenGL
OpenGL core profile vendor: NVIDIA Corporation
OpenGL core profile renderer: NVIDIA GeForce RTX 3090/PCIe/SSE2
OpenGL core profile version: 4.6.0 NVIDIA 565.57.01
OpenGL core profile shading language version: 4.60 NVIDIA
OpenGL compatibility profile vendor: NVIDIA Corporation
OpenGL compatibility profile renderer: NVIDIA GeForce RTX 3090/PCIe/SSE2
OpenGL compatibility profile version: 4.6.0 NVIDIA 565.57.01
OpenGL compatibility profile shading language version: 4.60 NVIDIA
OpenGL ES profile vendor: NVIDIA Corporation
OpenGL ES profile renderer: NVIDIA GeForce RTX 3090/PCIe/SSE2
OpenGL ES profile version: OpenGL ES 3.2 NVIDIA 565.57.01
OpenGL ES profile shading language version: OpenGL ES GLSL ES 3.20

Device #1:

Platform Device platform:
eglinfo: eglInitialize failed

Device #2:

Platform Device platform:
EGL API version: 1.5
EGL vendor string: Mesa Project
EGL version string: 1.5
EGL client APIs: OpenGL OpenGL_ES 
OpenGL core profile vendor: Mesa
OpenGL core profile renderer: llvmpipe (LLVM 19.1.6, 256 bits)
OpenGL core profile version: 4.5 (Core Profile) Mesa 24.3.3-1
OpenGL core profile shading language version: 4.50
OpenGL compatibility profile vendor: Mesa
OpenGL compatibility profile renderer: llvmpipe (LLVM 19.1.6, 256 bits)
OpenGL compatibility profile version: 4.5 (Compatibility Profile) Mesa 24.3.3-1
OpenGL compatibility profile shading language version: 4.50
OpenGL ES profile vendor: Mesa
OpenGL ES profile renderer: llvmpipe (LLVM 19.1.6, 256 bits)
OpenGL ES profile version: OpenGL ES 3.2 Mesa 24.3.3-1
OpenGL ES profile shading language version: OpenGL ES GLSL ES 3.20

For glxinfo, the outputs differ depending on the Nvidia environment variables as mentioned in my post:

glxinfo -B
name of display: :1
display: :1  screen: 0
direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
    Vendor: Mesa (0xffffffff)
    Device: llvmpipe (LLVM 19.1.6, 256 bits) (0xffffffff)
    Version: 24.3.3
    Accelerated: no
    Video memory: 48152MB
    Unified memory: yes
    Preferred profile: core (0x1)
    Max core profile version: 4.5
    Max compat profile version: 4.5
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
Memory info (GL_ATI_meminfo):
    VBO free memory - total: 31 MB, largest block: 31 MB
    VBO free aux. memory - total: 20947 MB, largest block: 20947 MB
    Texture free memory - total: 31 MB, largest block: 31 MB
    Texture free aux. memory - total: 20947 MB, largest block: 20947 MB
    Renderbuffer free memory - total: 31 MB, largest block: 31 MB
    Renderbuffer free aux. memory - total: 20947 MB, largest block: 20947 MB
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 880577 MB
    Total available memory: 928729 MB
    Currently available dedicated video memory: 31 MB
OpenGL vendor string: Mesa
OpenGL renderer string: llvmpipe (LLVM 19.1.6, 256 bits)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 24.3.3-1
OpenGL core profile shading language version string: 4.50
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.5 (Compatibility Profile) Mesa 24.3.3-1
OpenGL shading language version string: 4.50
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile

OpenGL ES profile version string: OpenGL ES 3.2 Mesa 24.3.3-1
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20
__NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia glxinfo -B
name of display: :1
display: :1  screen: 0
direct rendering: Yes
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 24576 MB
    Total available memory: 24576 MB
    Currently available dedicated video memory: 5751 MB
OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: NVIDIA GeForce RTX 3090/PCIe/SSE2
OpenGL core profile version string: 4.6.0 NVIDIA 565.57.01
OpenGL core profile shading language version string: 4.60 NVIDIA
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.6.0 NVIDIA 565.57.01
OpenGL shading language version string: 4.60 NVIDIA
OpenGL context flags: (none)
OpenGL profile mask: (none)

OpenGL ES profile version string: OpenGL ES 3.2 NVIDIA 565.57.01
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

@itsmeknt
Copy link
Author

itsmeknt commented Jan 16, 2025

Also, not sure if this is helpful, but here is the Xorg.0.log when I tried to get X server to boot directly on the Nvidia GPU with Nvidia drivers. I think it might be relevant because it tried to open the /dev/dri/card0 but also failed, so it might be the same failure as vainfo trying to open the /dev/dri/render128. Note that in both Xorg and vainfo, it failed to open their respective /dev/dri/ files with error Invalid Argument

cat /var/log/Xorg.0.log

[    11.084] (--) Log file renamed from "/var/log/Xorg.pid-1538.log" to "/var/log/Xorg.0.log"
[    11.085] 
X.Org X Server 1.21.1.15
X Protocol Version 11, Revision 0
[    11.085] Current Operating System: Linux debian 6.12.9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.12.9-1 (2025-01-10) x86_64
[    11.085] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.12.9-amd64 root=UUID=b426f26c-1c7f-4d2a-a22e-ebf8b5d70339 ro nvidia-drm.modeset=1 pcie_aspm=off pcie_port_pm=off quiet
[    11.085] xorg-server 2:21.1.15-2 (https://www.debian.org/support) 
[    11.085] Current version of pixman: 0.44.0
[    11.085] 	Before reporting problems, check http://wiki.x.org
	to make sure that you have the latest version.
[    11.085] Markers: (--) probed, (**) from config file, (==) default setting,
	(++) from command line, (!!) notice, (II) informational,
	(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
[    11.085] (==) Log file: "/var/log/Xorg.0.log", Time: Tue Jan 14 00:42:51 2025
[    11.085] (==) Using config file: "/etc/X11/xorg.conf"
[    11.085] (==) Using config directory: "/etc/X11/xorg.conf.d"
[    11.085] (==) Using system config directory "/usr/share/X11/xorg.conf.d"
[    11.085] (==) ServerLayout "layout"
[    11.085] (**) |-->Screen "nvidia" (0)
[    11.085] (**) |   |-->Monitor "<default monitor>"
[    11.086] (**) |   |-->Device "nvidia"
[    11.086] (==) No monitor specified for screen "nvidia".
	Using a default monitor configuration.
[    11.086] (**) Allowing byte-swapped clients
[    11.086] (==) Automatically adding devices
[    11.086] (==) Automatically enabling devices
[    11.086] (==) Automatically adding GPU devices
[    11.086] (==) Automatically binding GPU devices
[    11.086] (==) Max clients allowed: 256, resource mask: 0x1fffff
[    11.086] (WW) The directory "/usr/share/fonts/X11/cyrillic" does not exist.
[    11.086] 	Entry deleted from font path.
[    11.086] (==) FontPath set to:
	/usr/share/fonts/X11/misc,
	/usr/share/fonts/X11/100dpi/:unscaled,
	/usr/share/fonts/X11/75dpi/:unscaled,
	/usr/share/fonts/X11/Type1,
	/usr/share/fonts/X11/100dpi,
	/usr/share/fonts/X11/75dpi,
	built-ins
[    11.086] (==) ModulePath set to "/usr/lib/xorg/modules"
[    11.086] (II) The server relies on udev to provide the list of input devices.
	If no devices become available, reconfigure udev or disable AutoAddDevices.
[    11.086] (II) Loader magic: 0x557cfcca2f20
[    11.086] (II) Module ABI versions:
[    11.086] 	X.Org ANSI C Emulation: 0.4
[    11.086] 	X.Org Video Driver: 25.2
[    11.086] 	X.Org XInput driver : 24.4
[    11.086] 	X.Org Server Extension : 10.0
[    11.087] (++) using VT number 1

[    11.088] (II) systemd-logind: took control of session /org/freedesktop/login1/session/c6
[    11.090] (II) xfree86: Adding drm device (/dev/dri/card0)
[    11.090] (II) Platform probe for /sys/devices/pci0000:40/0000:40:01.3/0000:42:00.0/drm/card0
[    11.091] (EE) systemd-logind: failed to take device /dev/dri/card0: Invalid argument
[    11.102] (--) PCI:*(66@0:0:0) 10de:2204:3842:3987 rev 161, Mem @ 0x9e000000/16777216, 0x80000000/268435456, 0x90000000/33554432, I/O @ 0x00003000/128, BIOS @ 0x????????/131072
[    11.102] (II) LoadModule: "glx"
[    11.102] (II) Loading /usr/lib/xorg/modules/extensions/libglx.so
[    11.103] (II) Module glx: vendor="X.Org Foundation"
[    11.103] 	compiled for 1.21.1.15, module version = 1.0.0
[    11.103] 	ABI class: X.Org Server Extension, version 10.0
[    11.103] (II) LoadModule: "nvidia"
[    11.103] (II) Loading /usr/lib/xorg/modules/drivers/nvidia_drv.so
[    11.104] (II) Module nvidia: vendor="NVIDIA Corporation"
[    11.104] 	compiled for 1.6.99.901, module version = 1.0.0
[    11.104] 	Module class: X.Org Video Driver
[    11.104] (II) NVIDIA dlloader X Driver  565.57.01  Thu Oct 10 12:05:50 UTC 2024
[    11.104] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs
[    11.104] (EE) No devices detected.
[    11.104] (EE) 
Fatal server error:
[    11.104] (EE) no screens found(EE) 
[    11.104] (EE) 
Please consult the The X.Org Foundation support 
	 at http://wiki.x.org
 for help. 
[    11.104] (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
[    11.104] (EE) 
[    11.201] (EE) Server terminated with error (1). Closing log file.

And now that I think about it, this might be why Wayland is failing on my system too. I ideally want to use Wayland, but for some reason it keeps on crashing after I installed my 565 Nvidia drivers, which is why I am resorting to X11. I tried to find out why Wayland was crashing, but neither dmesg nor journalctl logged any helpful error messages for Wayland, so I gave up on that.

In the chance that this is a Debian 13 bug and not an Nvidia Vaapi driver bug, where should I go to get more help?

@itsmeknt
Copy link
Author

itsmeknt commented Jan 16, 2025

One other piece of info about Debian 13 that may or may not be relevant to this:

This is not Debian stable. The current stable Debian release is Debian 12 (bookworm).

The current status of trixie is that it is the testing branch. It is expected to become stable in mid 2025, as Debian 13.

I didn't realize it when I picked this OS. Maybe all of this is in fact due to an OS-level bug?

@itsmeknt
Copy link
Author

/dev/dri/Render18 and /dev/dri/card0 are both character device files, and both fail to open using C's open() function, giving Invalid Error on Debian 13 kernel 6.12.9.

What I am confused about is, how are the Nvidia 565 drivers able to read and write these files (assuming they require these files)? Nvidia drivers are working just fine, since nvidia-smi, nvtop, and GPU applications all work fine (VLC, training neural networks on GPU, launching programs with __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia, etc), and I'm assuming these character device files are required to do that.

Maybe nvidia-vaapi-driver can be made more robust if it uses the same file IO mechanism as the Nvidia 565 drivers on the /dev/dri/Render128 and /dev/dri/card0 character device files? The Nvidia 565 drivers are built on top of nvidia-open so I am assuming they're open sourced.

@elFarto
Copy link
Owner

elFarto commented Jan 18, 2025

Honestly I'm not sure there much we can help you with. You might be able to get more assistance in the NVIDIA forums. The issue you have isn't directly related to this driver, but something fundamentally wrong with the driver install/system configuration.

@itsmeknt
Copy link
Author

itsmeknt commented Jan 19, 2025

Thanks @elFarto @thesword53 and team, I appreciate the time invested. I'll move this issue to the NVIDIA forums.

Would you like me to keep this issue open and give updates, in case the correct solution (starting with the feature branch 565 and onward) involves changing the way to open /dev/dri/Render128 other than using C's open()?

The reason I ask is because I saw this comment in the Nvidia open Linux GPU kernel source code:

         /*                                                                                                                                                                                                                                                                                                                                                                             
         * On T234, the 'fd' provided is allocated outside of RM whereas on                                                                                                                                                                                                                                                                                                            
         * dGPU it is allocated by RM. So we check whether the fd is associated                                                                                                                                                                                                                                                                                                        
         * with an nvidia character device, and if it is, then we consider that                                                                                                                                                                                                                                                                                                        
         * it belongs to RM. Based on whether it belongs to RM or not we need                                                                                                                                                                                                                                                                                                          
         * to call different mechanisms to import it.                                                                                                                                                                                                                                                                                                                                  
         */

and I don't know the source code well, but perhaps it might be related? I wonder if the ideal solution may be as simple as calling an Nvidia fopen function instead of glibc's open.

Also because the Nvidia 656 driver installation seems to work fine for me without any hiccups for most Nvidia clients (CUDA, VLC, etc), and only some Nvidia clients (X, Wayland) are having issues, specifically with opening the character device files, so maybe Nvidia's solution might be relevant to this library somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants