Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

startup service failing due to not being able to read from sysfs #119

Closed
dorianbrown opened this issue Jun 20, 2024 · 7 comments
Closed
Assignees

Comments

@dorianbrown
Copy link

dorianbrown commented Jun 20, 2024

I've been a happy user for a while, but started to notice some issues recently. I noticed that the egpu.service in systemd has been failing on boot, with these logs:

> journalctl -u egpu.service

Jun 20 11:25:28 z13 systemd[1]: Starting egpu.service - EGPU Service...
Jun 20 11:25:28 z13 egpu-switcher[1333]: [error] unable to read pci information from sysfs: got error while scanning device '0000:03:00.0': the pci 'config' file has an invalid format
Jun 20 11:25:28 z13 egpu-switcher[1333]: panic: unable to read pci information from sysfs
Jun 20 11:25:28 z13 egpu-switcher[1333]: goroutine 1 [running]:
Jun 20 11:25:28 z13 egpu-switcher[1333]: github.com/hertg/egpu-switcher/internal/pci.ReadGPUs()
Jun 20 11:25:28 z13 egpu-switcher[1333]:         /home/runner/work/egpu-switcher/egpu-switcher/internal/pci/pci.go:98 +0x1be
Jun 20 11:25:28 z13 egpu-switcher[1333]: github.com/hertg/egpu-switcher/internal/pci.Find(0x10de1e041458400f)
Jun 20 11:25:28 z13 egpu-switcher[1333]:         /home/runner/work/egpu-switcher/egpu-switcher/internal/pci/pci.go:108 +0x25
Jun 20 11:25:28 z13 egpu-switcher[1333]: github.com/hertg/egpu-switcher/cmd.glob..func5(0xac7400?, {0xc000193ff0?, 0x1?, 0x1?})
Jun 20 11:25:28 z13 egpu-switcher[1333]:         /home/runner/work/egpu-switcher/egpu-switcher/cmd/switch.go:72 +0x17c
Jun 20 11:25:28 z13 egpu-switcher[1333]: github.com/spf13/cobra.(*Command).execute(0xac7400, {0xc000193fc0, 0x1, 0x1})
Jun 20 11:25:28 z13 egpu-switcher[1333]:         /home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:872 +0x694
Jun 20 11:25:28 z13 egpu-switcher[1333]: github.com/spf13/cobra.(*Command).ExecuteC(0xac6780)
Jun 20 11:25:28 z13 egpu-switcher[1333]:         /home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:990 +0x3bd
Jun 20 11:25:28 z13 egpu-switcher[1333]: github.com/spf13/cobra.(*Command).Execute(...)
Jun 20 11:25:28 z13 egpu-switcher[1333]:         /home/runner/go/pkg/mod/github.com/spf13/[email protected]/command.go:918
Jun 20 11:25:28 z13 egpu-switcher[1333]: github.com/hertg/egpu-switcher/cmd.Execute()
Jun 20 11:25:28 z13 egpu-switcher[1333]:         /home/runner/work/egpu-switcher/egpu-switcher/cmd/root.go:79 +0x25
Jun 20 11:25:28 z13 egpu-switcher[1333]: main.main()
Jun 20 11:25:28 z13 egpu-switcher[1333]:         /home/runner/work/egpu-switcher/egpu-switcher/main.go:8 +0x17
Jun 20 11:25:28 z13 systemd[1]: egpu.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jun 20 11:25:28 z13 systemd[1]: egpu.service: Failed with result 'exit-code'.
Jun 20 11:25:28 z13 systemd[1]: Failed to start egpu.service - EGPU Service.

extra info:

  • egpu-switcher 0.19.0
  • Running Fedora 40 (6.9.4-200) on X11
  • Lenovo Z13 G1 laptop (Ryzen CPU/GPU) + RTX 2080Ti EGPU (550.90.07)

Not sure what the cause might be, or how to fix this. So far the tips from the troubleshooting guide haven't helped.

@dorianbrown
Copy link
Author

Just saw this related issue, but seems like whatever fixed it for them doesn't work for me for some reason: #91

@hertg
Copy link
Owner

hertg commented Jun 20, 2024

Hi Dorian, There appears to be an issue parsing the config of a certain PCI device in your computer. It's not the same cause as in #91. This might be something that has to be addressed in the library I created for parsing PCI information (https://github.com/hertg/gopci).

What kind of PCI device is at 0000:03:00.0?

You can print more detailed information by running the following command:

lspci -s 0000:03:00.0 -vv

If you could provide me with the original config file for this device, I could try to reproduce the issue. This is a binary file that you cannot open in a text editor, and I am not 100% certain whether it may include sensitive information, although I'm fairly certain that it doesn't. You can find this file at /sys/bus/pci/devices/0000\:03\:00.0/config.

You could send me the file in encrypted form if you like, by downloading and using my PGP key directly from Github.

# download my public pgp key
curl https://github.com/hertg.gpg > hertg.key

# encrypt the file with the downloaded key and print the encrypted message in ASCII form to stdout
sudo gpg --armor --encrypt --recipient-file hertg.key -o- /sys/bus/pci/devices/0000\:03\:00.0/config

@dorianbrown
Copy link
Author

For the output of lspci:

03:00.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02) (prog-if 00 [Normal decode])
	Subsystem: Gigabyte Technology Co., Ltd Device 0003

There's a lot more information provided, but thought this might be enough.

And for the config, here's the encrypted message. I hope this contains the contents of that config file, but let me know if not:

-----BEGIN PGP MESSAGE-----

hF4Dt8HaY/7x4X8SAQdAqzz8KEjhm3kyyR3GU18+aocSvBDbvbR7HJW2ZRlySi0w
JkxlwqRCTzsZzZEQamzs2/cZdQ7ciuO1UD3Y7jL7/l5kpFNR0H+F7nYngCy7O8kt
0sC1AQ+DIXT4TZwPzz72E5urgAL+w9tkc79J2Nk8k2fqc5SGU1zQeNP8rft8EcTD
Bw9H1v2qs4uGnsQ2/IPiiVvCETT+HU3+Z6bW0E/7Aana6sBKcNiyFNHw1TT6GjtX
Vv+2v2FPzq7SyGeg+ZIDylsYGKmFAD7S7HXGVxhQqgxSOUzX1vbgmVq7pgIaxO22
MY27LhGtTMFCuheCrKZzYIk8RUfVqBbgPwhatVlQANkWyM6xAWZpg355s663oxMb
UCLBydebsL8Xrl3YiXxkQDx/QBtiJjdHyisKlWOBgiLK3zTSttbyQAhaARtRL0KP
t/Y1O1xvMUT88I/WDQHRwxGYVoCegamHao+BagB/F3pdXY4UNh1Wp2LO3kFHCX81
mONF51nbE8nG6nkS/aEUwkyuB8Ixqqh3W6ZQ+O8y87sxATykFClpH5ZvLn0xJk41
Fzc/v8vRPHvpK39GcVJ6vU59swyl+6G4He7GO0HIZKuGKI/UeRfn9Q==
=nDKV
-----END PGP MESSAGE-----

@hertg
Copy link
Owner

hertg commented Jun 21, 2024

Thanks for providing the config file. Unfortunately, I am unable to reproduce this issue, the file parses just fine in my tests. Could you check whether it still prints the error for the device on bus 03:00.0, or did the number change? (PCI devices don't always have the same bus number, this can change between reboots if there is a change in the device topology).

Or does the issue only occur when systemd service runs? If you manually run egpu-switcher switch auto, do you also get the error? If not, then I'm not sure what's going on.

@dorianbrown
Copy link
Author

dorianbrown commented Jun 21, 2024

So running it myself with sudo egpu-switcher switch auto works just fine, ie

[info] looking for eGPU...
[info] the egpu is connected
[info] egpu has been added to X.Org config
[ok] switch completed

If I check journalctl -u egpu.service, it shows the same error (same 03:00.0 PCI device) for the current boot. So that's a bit strange

@hertg
Copy link
Owner

hertg commented Jun 21, 2024

hmmm, that is really strange, it appears like the config file may be different at the time egpu.service is running... That's a hard one to debug...

One hacky way to do it might be for you to temporarily change the egpu.service so that it doesn't run egpu-switcher, but actually just copies the problematic config file to a location where you can later retrieve it after the bootup completed.

sudo vim /etc/systemd/system/egpu.service

And then set the ExecStart to something like

ExecStart=/bin/sh -c 'cp /sys/bus/pci/devices/0000\:03\:00.0/config /home/<your-username>/config.bin'

I am not totally sure whether that will work tho.

@dorianbrown
Copy link
Author

Yeah, seems like an issue outside of the domain of egpu-switcher, but thanks for taking a look!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants