Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quasar Unreachable #45

Open
morey-tech opened this issue Aug 24, 2024 · 14 comments
Open

Quasar Unreachable #45

morey-tech opened this issue Aug 24, 2024 · 14 comments
Labels
bug Something isn't working hardware

Comments

@morey-tech
Copy link
Owner

While streaming episode v0.5.0 of StruggleOps, I found that host quasar (192.168.1.30:8006) was unreachable.

@morey-tech morey-tech added bug Something isn't working hardware labels Aug 24, 2024
@morey-tech
Copy link
Owner Author

I could not access the host terminal, and the display would not turn on. Pressing the power button momentarily did not shut down the host. I was able to shut it down by pressing and holding the power button. Then rebooted it and it was available again on the network.

@morey-tech
Copy link
Owner Author

Last journal entry was Aug 23 14:01:16, almost 7 hours before the reboot.

Aug 23 14:01:16 quasar pvestatd[1247]: status update time (7.190 seconds)
-- Boot 670a31232df344d8b371cc771df5ac38 --
Aug 23 20:46:06 quasar kernel: Linux version 6.8.4-2-pve (build@proxmox)

@morey-tech
Copy link
Owner Author

The system stats in Proxmox stop around the same time.
2024-08-23-20-56-22

@morey-tech
Copy link
Owner Author

morey-tech commented Aug 24, 2024

All the journalctl error entries around the incident. Nothing really happening around the last entry at 14:01.

Aug 19 04:39:38 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 19 04:39:45 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 19 04:39:55 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 19 04:40:12 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 20 04:30:03 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 20 04:39:31 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 20 04:39:43 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 20 04:39:51 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 20 04:40:01 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 20 04:40:17 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 21 04:30:00 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 21 04:39:32 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 21 04:39:45 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 21 04:39:52 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 21 04:40:02 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 21 04:40:18 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 22 04:30:01 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 22 04:39:35 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 22 04:39:48 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 22 04:39:55 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 22 04:40:05 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 22 04:40:22 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 23 04:30:03 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 23 04:39:38 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 23 04:39:51 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 23 04:39:58 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 23 04:40:08 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 23 04:40:25 quasar kernel: EXT4-fs (dm-20): write access unavailable, skipping orphan cleanup
Aug 23 12:00:30 quasar pveproxy[4183702]: problem with client ::ffff:172.16.0.2; Connection reset by peer
Aug 23 12:33:38 quasar pveproxy[2150496]: problem with client ::ffff:172.16.0.2; Connection reset by peer
Aug 23 13:44:07 quasar pveproxy[2169430]: problem with client ::ffff:172.16.0.2; Connection reset by peer
-- Boot 670a31232df344d8b371cc771df5ac38 --
Aug 23 20:46:06 quasar kernel: pci 0000:00:07.2: DPC: RP PIO log size 0 is invalid
Aug 23 20:46:06 quasar kernel: Bluetooth: hci0: Failed to load firmware file (-2)
Aug 23 20:46:06 quasar kernel: Bluetooth: hci0: Failed to set up firmware (-2)

@morey-tech
Copy link
Owner Author

morey-tech commented Aug 24, 2024

The timing of the EXT4-fs errors corresponds with the backup schedule for the guests on Proxmox (04:30 daily), which is likely related to that and unrelated to this issue.

@morey-tech
Copy link
Owner Author

morey-tech commented Sep 2, 2024

It happened again. I found plex unavailable today at 16:00. The last log entry was at Sep 02 01:20:50

Sep 02 01:04:50 quasar postfix/qmgr[1205]: E8BAC201162: from=<[email protected]>, size=69186, nrcpt=1 (queue active)
Sep 02 01:05:20 quasar postfix/smtp[265130]: connect to smtp.google.com[142.250.31.27]:25: Connection timed out
Sep 02 01:05:50 quasar postfix/smtp[265130]: connect to smtp.google.com[142.250.31.26]:25: Connection timed out
Sep 02 01:06:20 quasar postfix/smtp[265130]: connect to smtp.google.com[142.251.111.27]:25: Connection timed out
Sep 02 01:06:20 quasar postfix/smtp[265130]: connect to smtp.google.com[2607:f8b0:4004:c19::1b]:25: Network is unreachable
Sep 02 01:06:20 quasar postfix/smtp[265130]: connect to smtp.google.com[2607:f8b0:4004:c19::1a]:25: Network is unreachable
Sep 02 01:06:20 quasar postfix/smtp[265130]: E8BAC201162: to=<[email protected]>, relay=none, delay=246342, delays=246252/0.01/90/0, dsn=4.4.1, status=deferred (connect to smtp.google.com[26>
Sep 02 01:09:50 quasar postfix/qmgr[1205]: 185D320111D: from=<[email protected]>, size=69180, nrcpt=1 (queue active)
Sep 02 01:10:20 quasar postfix/smtp[266934]: connect to smtp.google.com[142.250.31.26]:25: Connection timed out
Sep 02 01:10:50 quasar postfix/smtp[266934]: connect to smtp.google.com[142.251.111.27]:25: Connection timed out
Sep 02 01:11:20 quasar postfix/smtp[266934]: connect to smtp.google.com[142.251.111.26]:25: Connection timed out
Sep 02 01:11:20 quasar postfix/smtp[266934]: connect to smtp.google.com[2607:f8b0:4004:c19::1a]:25: Network is unreachable
Sep 02 01:11:20 quasar postfix/smtp[266934]: connect to smtp.google.com[2607:f8b0:4004:c0b::1a]:25: Network is unreachable
Sep 02 01:11:20 quasar postfix/smtp[266934]: 185D320111D: to=<[email protected]>, relay=none, delay=73828, delays=73738/0.01/90/0, dsn=4.4.1, status=deferred (connect to smtp.google.com[2607>
Sep 02 01:17:01 quasar CRON[269595]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 02 01:17:01 quasar CRON[269597]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 02 01:17:01 quasar CRON[269595]: pam_unix(cron:session): session closed for user root
Sep 02 01:19:50 quasar postfix/qmgr[1205]: 63BE22010C3: from=<[email protected]>, size=2818, nrcpt=1 (queue active)
Sep 02 01:19:50 quasar postfix/qmgr[1205]: 661372010CF: from=<[email protected]>, size=2838, nrcpt=1 (queue active)
Sep 02 01:19:50 quasar postfix/smtp[270612]: connect to smtp.google.com[2607:f8b0:4004:c19::1b]:25: Network is unreachable
Sep 02 01:19:50 quasar postfix/smtp[270613]: connect to smtp.google.com[2607:f8b0:4004:c19::1a]:25: Network is unreachable
Sep 02 01:19:50 quasar postfix/smtp[270613]: connect to smtp.google.com[2607:f8b0:4004:c0b::1b]:25: Network is unreachable
Sep 02 01:19:50 quasar postfix/smtp[270613]: connect to smtp.google.com[2607:f8b0:4004:c19::1b]:25: Network is unreachable
Sep 02 01:20:20 quasar postfix/smtp[270612]: connect to smtp.google.com[142.250.31.26]:25: Connection timed out
Sep 02 01:20:20 quasar postfix/smtp[270613]: connect to smtp.google.com[142.251.111.27]:25: Connection timed out
Sep 02 01:20:50 quasar postfix/smtp[270612]: connect to smtp.google.com[142.250.31.27]:25: Connection timed out
Sep 02 01:20:50 quasar postfix/smtp[270613]: connect to smtp.google.com[142.250.31.27]:25: Connection timed out
Sep 02 01:20:50 quasar postfix/smtp[270612]: connect to smtp.google.com[2607:f8b0:4004:c0b::1b]:25: Network is unreachable
Sep 02 01:20:50 quasar postfix/smtp[270612]: connect to smtp.google.com[2607:f8b0:4004:c0b::1a]:25: Network is unreachable
Sep 02 01:20:50 quasar postfix/smtp[270613]: 661372010CF: to=<[email protected]>, relay=none, delay=330289, delays=330229/0.01/60/0, dsn=4.4.1, status=deferred (connect to smtp.google.com[14>
Sep 02 01:20:50 quasar postfix/smtp[270612]: 63BE22010C3: to=<[email protected]>, relay=none, delay=418494, delays=418434/0.01/60/0, dsn=4.4.1, status=deferred (connect to smtp.google.com[26>
-- Boot cbc1ece1b7034071bb3e8a33f88f6034 --
Sep 02 15:55:57 quasar kernel: Linux version 6.8.4-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-2 (2024-04-10T17:36Z) ()
Sep 02 15:55:57 quasar kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.4-2-pve root=/dev/mapper/pve-root ro quiet
Sep 02 15:55:57 quasar kernel: KERNEL supported cpus:
Sep 02 15:55:57 quasar kernel:   Intel GenuineIntel
Sep 02 15:55:57 quasar kernel:   AMD AuthenticAMD
Sep 02 15:55:57 quasar kernel:   Hygon HygonGenuine
Sep 02 15:55:57 quasar kernel:   Centaur CentaurHauls
Sep 02 15:55:57 quasar kernel:   zhaoxin   Shanghai  
Sep 02 15:55:57 quasar kernel: x86/tme: not enabled by BIOS
Sep 02 15:55:57 quasar kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
Sep 02 15:55:57 quasar kernel: BIOS-provided physical RAM map:
Xnapper-2024-09-02-16 00 27

@morey-tech
Copy link
Owner Author

IMG_4376

The Call Trace on the screen indicates a kernel panic.

https://pve.proxmox.com/wiki/Kernel_Crash_Trace_Log

@morey-tech
Copy link
Owner Author

One of the LXCs was getting oom killed due to insufficient memory allocation. This may be the cause of the system halting as another user reported this behaviour on Reddit. I've increased the memory on the container and will monitor for another oom kill followed by a system halt.

@morey-tech
Copy link
Owner Author

morey-tech commented Sep 7, 2024

Since fixing the OOM issues with the LXC container, the host has run into another kernel panic around Sep 07 03:38:03 while still running the 2x 48GB DIMMs.

root@quasar:~# journalctl -p err
Sep 07 03:38:03 quasar kernel: BUG: unable to handle page fault for address: 00000000000359e0
Sep 07 03:38:03 quasar kernel: #PF: supervisor write access in kernel mode
Sep 07 03:38:03 quasar kernel: #PF: error_code(0x0002) - not-present page
Detailed Logs

root@quasar:~# journalctl
Sep 07 03:38:03 quasar kernel: BUG: unable to handle page fault for address: 00000000000359e0
Sep 07 03:38:03 quasar kernel: #PF: supervisor write access in kernel mode
Sep 07 03:38:03 quasar kernel: #PF: error_code(0x0002) - not-present page
Sep 07 03:38:03 quasar kernel: PGD 0 P4D 0 
Sep 07 03:38:03 quasar kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI
Sep 07 03:38:03 quasar kernel: CPU: 13 PID: 1984 Comm: kvm Tainted: P           O       6.8.4-2-pve #1
Sep 07 03:38:03 quasar kernel: Hardware name: Micro Computer (HK) Tech Limited Venus Series/AHWSA, BIOS AHWSA.1.22 03/12/2024
Sep 07 03:38:03 quasar kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x276/0x2d0
Sep 07 03:38:03 quasar kernel: Code: 90 49 8b 14 24 48 85 d2 74 f5 eb e7 c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 05 48 63 d2 48 05 c0 59 03 00>
Sep 07 03:38:03 quasar kernel: RSP: 0018:ffffb52fc33db998 EFLAGS: 00010002
Sep 07 03:38:03 quasar kernel: RAX: 00000000000359e0 RBX: ffff8d2d9719c1c8 RCX: 0000000000380000
Sep 07 03:38:03 quasar kernel: RDX: 00000000000020c5 RSI: 0000000083198319 RDI: ffff8d2d9719c1c8
Sep 07 03:38:03 quasar kernel: RBP: ffffb52fc33db9b8 R08: 0000000000000000 R09: 0000000000000000
Sep 07 03:38:03 quasar kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d44cf8b59c0
Sep 07 03:38:03 quasar kernel: R13: 0000000000000000 R14: 000000000000000d R15: 0000000000000002
Sep 07 03:38:03 quasar kernel: FS:  00007ed9470ee4c0(0000) GS:ffff8d44cf880000(0000) knlGS:0000000000000000
Sep 07 03:38:03 quasar kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 07 03:38:03 quasar kernel: CR2: 00000000000359e0 CR3: 00000001288d6000 CR4: 0000000000f52ef0
Sep 07 03:38:03 quasar kernel: PKRU: 55555554
Sep 07 03:38:03 quasar kernel: Call Trace:
Sep 07 03:38:03 quasar kernel:  <TASK>
Sep 07 03:38:03 quasar kernel:  ? show_regs+0x6d/0x80
Sep 07 03:38:03 quasar kernel:  ? __die+0x24/0x80
Sep 07 03:38:03 quasar kernel:  ? page_fault_oops+0x176/0x500
Sep 07 03:38:03 quasar kernel:  ? update_cfs_group+0xcf/0xf0
Sep 07 03:38:03 quasar kernel:  ? psi_group_change+0x1fb/0x460
Sep 07 03:38:03 quasar kernel:  ? do_user_addr_fault+0x2f9/0x6b0
Sep 07 03:38:03 quasar kernel:  ? exc_page_fault+0x83/0x1b0
Sep 07 03:38:03 quasar kernel:  ? asm_exc_page_fault+0x27/0x30
Sep 07 03:38:03 quasar kernel:  ? native_queued_spin_lock_slowpath+0x276/0x2d0
Sep 07 03:38:03 quasar kernel:  _raw_spin_lock_irqsave+0x5c/0x80
Sep 07 03:38:03 quasar kernel:  remove_wait_queue+0x17/0x60
Sep 07 03:38:03 quasar kernel:  poll_freewait+0x42/0xb0
Sep 07 03:38:03 quasar kernel:  do_sys_poll+0x3a9/0x610
Sep 07 03:38:03 quasar kernel:  ? __pfx_pollwake+0x10/0x10
Sep 07 03:38:03 quasar kernel:  ? __pfx_pollwake+0x10/0x10
Sep 07 03:38:03 quasar kernel:  ? __pfx_pollwake+0x10/0x10
Sep 07 03:38:03 quasar kernel:  ? __pfx_pollwake+0x10/0x10
Sep 07 03:38:03 quasar kernel:  ? __pfx_pollwake+0x10/0x10
Sep 07 03:38:03 quasar kernel:  ? __pfx_pollwake+0x10/0x10
Sep 07 03:38:03 quasar kernel:  ? __pfx_pollwake+0x10/0x10
Sep 07 03:38:03 quasar kernel:  ? __pfx_pollwake+0x10/0x10
Sep 07 03:38:03 quasar kernel:  ? __pfx_pollwake+0x10/0x10
Sep 07 03:38:03 quasar kernel:  __x64_sys_ppoll+0xde/0x170
Sep 07 03:38:03 quasar kernel:  do_syscall_64+0x84/0x180
Sep 07 03:38:03 quasar kernel:  ? do_syscall_64+0x93/0x180
Sep 07 03:38:03 quasar kernel:  ? do_syscall_64+0x93/0x180
Sep 07 03:38:03 quasar kernel:  ? do_syscall_64+0x93/0x180
Sep 07 03:38:03 quasar kernel:  ? do_syscall_64+0x93/0x180
Sep 07 03:38:03 quasar kernel:  ? irqentry_exit+0x43/0x50
Sep 07 03:38:03 quasar kernel:  entry_SYSCALL_64_after_hwframe+0x73/0x7b
Sep 07 03:38:03 quasar kernel: RIP: 0033:0x7ed949b55256
Sep 07 03:38:03 quasar kernel: Code: 7c 24 08 e8 6c 95 f8 ff 4c 8b 54 24 18 48 8b 74 24 10 41 b8 08 00 00 00 41 89 c1 48 8b 7c 24 08 4c 89>
Sep 07 03:38:03 quasar kernel: RSP: 002b:00007ffc0fd32c90 EFLAGS: 00000293 ORIG_RAX: 000000000000010f
Sep 07 03:38:03 quasar kernel: RAX: ffffffffffffffda RBX: 000063d00f351ce0 RCX: 00007ed949b55256
Sep 07 03:38:03 quasar kernel: RDX: 00007ffc0fd32cb0 RSI: 0000000000000010 RDI: 000063d0105775c0
Sep 07 03:38:03 quasar kernel: RBP: 00007ffc0fd32d1c R08: 0000000000000008 R09: 0000000000000000
Sep 07 03:38:03 quasar kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffc0fd32cb0
Sep 07 03:38:03 quasar kernel: R13: 000063d00f351ce0 R14: 000063d00e3cdee8 R15: 00007ffc0fd32d20
Sep 07 03:38:03 quasar kernel:  </TASK>
Sep 07 03:38:03 quasar kernel: Modules linked in: dm_snapshot tcp_diag inet_diag 8021q garp mrp veth ebtable_filter ebtables ip_set ip6tab>
Sep 07 03:38:03 quasar kernel:  mt792x_lib i915 snd_intel_sdw_acpi polyval_generic mt76_connac_lib ghash_clmulni_intel snd_hda_codec mt76 >
Sep 07 03:38:03 quasar kernel: CR2: 00000000000359e0
Sep 07 03:38:03 quasar kernel: ---[ end trace 0000000000000000 ]---
Sep 07 03:38:03 quasar kernel: general protection fault, maybe for address 0x0: 0000 [#2] PREEMPT SMP NOPTI
Sep 07 03:38:03 quasar kernel: CPU: 18 PID: 1558222 Comm: .NET ThreadPool Tainted: P      D    O       6.8.4-2-pve #1
Sep 07 03:38:03 quasar kernel: Hardware name: Micro Computer (HK) Tech Limited Venus Series/AHWSA, BIOS AHWSA.1.22 03/12/2024
Sep 07 03:38:03 quasar kernel: RIP: 0010:futex_wait+0xc1/0x120
Sep 07 03:38:03 quasar kernel: Code: 48 8b 45 d0 65 48 2b 04 25 28 00 00 00 75 6d 48 83 c4 60 89 d0 5b 41 5c 41 5d 41 5e 41 5f 5d 31 d2 31>
Sep 07 03:38:03 quasar kernel: RSP: 0018:ffffb52fee48fe50 EFLAGS: 00010246
Sep 07 03:38:03 quasar kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Sep 07 03:38:03 quasar kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Sep 07 03:38:03 quasar kernel: RBP: 70cd60e5b6d5b6d5 R08: 0000000000000000 R09: 0000000000000000
Sep 07 03:38:03 quasar kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000089
Sep 07 03:38:03 quasar kernel: R13: 0000000000000009 R14: 0000000000000089 R15: e68ae68a19d919d9
Sep 07 03:38:03 quasar kernel: FS:  00007278e7400700(0000) GS:ffff8d44cfb00000(0000) knlGS:0000000000000000
Sep 07 03:38:03 quasar kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 07 03:38:03 quasar kernel: CR2: 000076e62b3e0380 CR3: 00000002a6874000 CR4: 0000000000f52ef0
Sep 07 03:38:03 quasar kernel: PKRU: 55555554
Sep 07 03:38:03 quasar kernel: Call Trace:
Sep 07 03:38:03 quasar kernel:  <TASK>
Sep 07 03:38:03 quasar kernel:  ? show_regs+0x6d/0x80
Sep 07 03:38:03 quasar kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x276/0x2d0
Sep 07 03:38:03 quasar kernel:  ? die_addr+0x37/0xa0
Sep 07 03:38:03 quasar kernel: Code: 90 49 8b 14 24 48 85 d2 74 f5 eb e7 c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 05 48 63 d2 48 05 c0 59 03 00>
Sep 07 03:38:03 quasar kernel:  ? exc_general_protection+0x1db/0x480
Sep 07 03:38:03 quasar kernel: RSP: 0018:ffffb52fc33db998 EFLAGS: 00010002
Sep 07 03:38:03 quasar kernel:  ? asm_exc_general_protection+0x27/0x30
Sep 07 03:38:03 quasar kernel: RAX: 00000000000359e0 RBX: ffff8d2d9719c1c8 RCX: 0000000000380000
Sep 07 03:38:03 quasar kernel:  ? futex_wait+0xc1/0x120
Sep 07 03:38:03 quasar kernel: RDX: 00000000000020c5 RSI: 0000000083198319 RDI: ffff8d2d9719c1c8
Sep 07 03:38:03 quasar kernel:  ? do_syscall_64+0x84/0x180
Sep 07 03:38:03 quasar kernel: RBP: ffffb52fc33db9b8 R08: 0000000000000000 R09: 0000000000000000
Sep 07 03:38:03 quasar kernel:  ? do_syscall_64+0x93/0x180
Sep 07 03:38:03 quasar kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d44cf8b59c0
Sep 07 03:38:03 quasar kernel:  ? do_syscall_64+0x93/0x180
Sep 07 03:38:03 quasar kernel: R13: 0000000000000000 R14: 000000000000000d R15: 0000000000000002
Sep 07 03:38:03 quasar kernel:  ? do_syscall_64+0x93/0x180
Sep 07 03:38:03 quasar kernel: FS:  00007ed9470ee4c0(0000) GS:ffff8d44cf880000(0000) knlGS:0000000000000000
Sep 07 03:38:03 quasar kernel:  ? entry_SYSCALL_64_after_hwframe+0x73/0x7b
Sep 07 03:38:03 quasar kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 07 03:38:03 quasar kernel:  </TASK>
Sep 07 03:38:03 quasar kernel: CR2: 00000000000359e0 CR3: 00000001288d6000 CR4: 0000000000f52ef0
Sep 07 03:38:03 quasar kernel: Modules linked in:
Sep 07 03:38:03 quasar kernel: PKRU: 55555554
Sep 07 03:38:03 quasar kernel:  dm_snapshot tcp_diag
Sep 07 03:38:03 quasar kernel: note: kvm[1984] exited with irqs disabled
Sep 07 03:38:03 quasar kernel:  inet_diag
Sep 07 03:38:03 quasar kernel: note: kvm[1984] exited with preempt_count 1
Sep 07 03:38:03 quasar kernel:  8021q garp mrp veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables ipt>
Sep 07 03:38:03 quasar kernel:  mt76_connac_lib ghash_clmulni_intel snd_hda_codec mt76 btusb sha256_ssse3 btrtl snd_hda_core sha1_ssse3 dr>
Sep 07 03:38:03 quasar kernel: ---[ end trace 0000000000000000 ]---
Sep 07 03:38:03 quasar kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x276/0x2d0
Sep 07 03:38:03 quasar kernel: Code: 90 49 8b 14 24 48 85 d2 74 f5 eb e7 c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 05 48 63 d2 48 05 c0 59 03 00>
Sep 07 03:38:03 quasar kernel: RSP: 0018:ffffb52fc33db998 EFLAGS: 00010002
Sep 07 03:38:03 quasar kernel: RAX: 00000000000359e0 RBX: ffff8d2d9719c1c8 RCX: 0000000000380000
Sep 07 03:38:03 quasar kernel: RDX: 00000000000020c5 RSI: 0000000083198319 RDI: ffff8d2d9719c1c8
Sep 07 03:38:03 quasar kernel: RBP: ffffb52fc33db9b8 R08: 0000000000000000 R09: 0000000000000000
Sep 07 03:38:03 quasar kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d44cf8b59c0
Sep 07 03:38:03 quasar kernel: R13: 0000000000000000 R14: 000000000000000d R15: 0000000000000002
Sep 07 03:38:03 quasar kernel: FS:  00007278e7400700(0000) GS:ffff8d44cfb00000(0000) knlGS:0000000000000000
Sep 07 03:38:03 quasar kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 07 03:38:03 quasar kernel: CR2: 000076e62b3e0380 CR3: 00000002a6874000 CR4: 0000000000f52ef0
Sep 07 03:38:03 quasar kernel: PKRU: 55555554
-- Boot c8ebc2afcd4343c697dad254aaad978d --
Sep 07 12:05:23 quasar kernel: Linux version 6.8.4-2-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) >
Sep 07 12:05:23 quasar kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.4-2-pve root=/dev/mapper/pve-root ro quiet

@morey-tech
Copy link
Owner Author

I've swapped back in the supported 2x 32GB DIMMs and ran upgrades on the host, which includes an upgraded proxmox-kernel.

Upgrade Details

The following NEW packages will be installed:
  proxmox-kernel-6.8.12-1-pve-signed
The following packages will be upgraded:
  base-files bash bind9-dnsutils bind9-host bind9-libs ceph-common ceph-fuse curl
  distro-info-data gnutls-bin ifupdown2 initramfs-tools initramfs-tools-core
  krb5-locales less libarchive13 libc-bin libc-l10n libc6 libcephfs2 libcurl3-gnutls
  libcurl4 libfreetype6 libglib2.0-0 libgnutls-dane0 libgnutls30 libgnutlsxx30
  libgssapi-krb5-2 libgstreamer-plugins-base1.0-0 libk5crypto3 libkrb5-3
  libkrb5support0 libnss-systemd libnvpair3linux libopeniscsiusr libpam-systemd
  libproxmox-acme-perl libproxmox-acme-plugins libpve-cluster-api-perl
  libpve-cluster-perl libpve-common-perl libpve-guest-common-perl libpve-notify-perl
  libpve-rs-perl libpve-storage-perl libpython3.11-minimal libpython3.11-stdlib
  libqt5core5a libqt5dbus5 libqt5network5 librados2 libradosstriper1 librbd1 librgw2
  libseccomp2 libssl3 libsystemd-shared libsystemd0 libudev1 libuutil3linux
  libzfs4linux libzpool5linux locales nano open-iscsi openssh-client openssh-server
  openssh-sftp-server openssl postfix proxmox-backup-client
  proxmox-backup-file-restore proxmox-firewall proxmox-kernel-6.8
  proxmox-secure-boot-support proxmox-termproxy proxmox-widget-toolkit pve-cluster
  pve-container pve-docs pve-esxi-import-tools pve-firewall pve-firmware
  pve-ha-manager pve-manager pve-qemu-kvm python3-ceph-argparse python3-ceph-common
  python3-cephfs python3-idna python3-rados python3-rbd python3-rgw python3.11
  python3.11-minimal qemu-server shim-helpers-amd64-signed shim-signed
  shim-signed-common shim-unsigned spl ssh systemd systemd-boot systemd-boot-efi
  systemd-sysv udev zfs-initramfs zfs-zed zfsutils-linux

@morey-tech
Copy link
Owner Author

Set up remote syslog to catch kernel panic next time.

https://pve.proxmox.com/wiki/Kernel_Crash_Trace_Log

root@quasar:$ cat /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet [email protected]/enp2s0f0np0,[email protected]/dc:a6:32:01:cf:0d loglevel=7"
morey-tech@raspberrypi:~ $ cat /etc/rsyslog.d/01-netconsole-collector.conf 
# Start UDP server on port 5555
$ModLoad imudp
$UDPServerRun 5555

# Define templates
$template NetconsoleFile,"/var/log/netconsole/%fromhost-ip%.log"
$template NetconsoleFormat,"%rawmsg%"

# Accept endline characters (unfortunatelly these options are global)
$EscapeControlCharactersOnReceive off
$DropTrailingLFOnReception off

# Store collected logs using templates without local ones
:fromhost-ip, !isequal, "127.0.0.1"     ?NetconsoleFile;NetconsoleFormat

# Discard logs match the rule above
& ~

@morey-tech
Copy link
Owner Author

Using vmbr0, it can't use the interface because it's not set up yet.

Sep 07 22:56:55 quasar kernel: i40e 0000:02:00.1 enp2s0f1np1: renamed from eth1
Sep 07 22:56:55 quasar kernel: i40e 0000:02:00.0 enp2s0f0np0: renamed from eth0
Sep 07 22:56:55 quasar kernel: netpoll: netconsole: local port 5555
Sep 07 22:56:55 quasar kernel: netpoll: netconsole: local IPv4 address 192.168.1.30
Sep 07 22:56:55 quasar kernel: netpoll: netconsole: interface 'vmbr0'
Sep 07 22:56:55 quasar kernel: netpoll: netconsole: remote port 5555
Sep 07 22:56:55 quasar kernel: netpoll: netconsole: remote IPv4 address 192.168.1.31
Sep 07 22:56:55 quasar kernel: netpoll: netconsole: remote ethernet address dc:a6:32:01:cf:0d
Sep 07 22:56:55 quasar kernel: netpoll: netconsole: vmbr0 doesn't exist, aborting
Sep 07 22:56:55 quasar kernel: netconsole: cleaning up

Using enp2s0f0np0 it works until vmbr0 set up.

Sep 07 22:59:07 quasar kernel: netpoll: netconsole: local port 5555
Sep 07 22:59:07 quasar kernel: netpoll: netconsole: local IPv4 address 192.168.1.30
Sep 07 22:59:07 quasar kernel: netpoll: netconsole: interface 'enp2s0f0np0'
Sep 07 22:59:07 quasar kernel: netpoll: netconsole: remote port 5555
Sep 07 22:59:07 quasar kernel: netpoll: netconsole: remote IPv4 address 192.168.1.31
Sep 07 22:59:07 quasar kernel: netpoll: netconsole: remote ethernet address dc:a6:32:01:cf:0d
Sep 07 22:59:07 quasar kernel: netpoll: netconsole: device enp2s0f0np0 not up yet, forcing it
Sep 07 22:59:07 quasar kernel: printk: legacy console [netcon0] enabled
Sep 07 22:59:07 quasar kernel: netconsole: network logging started
...
Sep 07 22:59:09 quasar kernel: vmbr0: port 1(enp2s0f0np0) entered blocking state
Sep 07 22:59:09 quasar kernel: vmbr0: port 1(enp2s0f0np0) entered disabled state
Sep 07 22:59:09 quasar kernel: netconsole: network logging stopped on interface enp2s0f0np0 as it is joining a master device
Sep 07 22:59:09 quasar kernel: i40e 0000:02:00.0 enp2s0f0np0: entered allmulticast mode
Sep 07 22:59:09 quasar kernel: i40e 0000:02:00.0 enp2s0f0np0: entered promiscuous mode
Sep 07 22:59:09 quasar kernel: vmbr0: port 1(enp2s0f0np0) entered blocking state
Sep 07 22:59:09 quasar kernel: i40e 0000:02:00.0: entering allmulti mode.
Sep 07 22:59:09 quasar kernel: vmbr0: port 1(enp2s0f0np0) entered forwarding state

@morey-tech
Copy link
Owner Author

morey-tech commented Sep 8, 2024

I can set it to vmbr0 after the system is running:

modprobe netconsole  [email protected]/vmbr0,[email protected]/dc:a6:32:01:cf:0d

FYI: [email protected]/ (excluding the interface name) defaults to eth0.

To update settings, first run

rmmod netconsole

If you get an error:

modprobe: ERROR: could not insert 'netconsole': No such device

Then run this first without any parameters:

modprobe netconsole

https://www.apalrd.net/posts/2024/pve_netconsole/

Logs are successfully being sent to the raspberrypi
Screenshot from 2024-09-08 08-53-19

@morey-tech
Copy link
Owner Author

morey-tech commented Sep 8, 2024

Set up a BMC interface with a static IP on enp89s0.

modprobe netconsole [email protected]/enp89s0,[email protected]/dc:a6:32:01:cf:0d

Confirmed working (receiving logs on raspberrypi.
Screenshot from 2024-09-08 09-00-53

Updated grub defaults:

root@quasar:~# cat /etc/default/grub
# ...
GRUB_CMDLINE_LINUX_DEFAULT="quiet [email protected]/enp89s0,[email protected]/dc:a6:32:01:cf:0d loglevel=7"

root@quasar:~# update-grub
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-6.8.12-1-pve
Found initrd image: /boot/initrd.img-6.8.12-1-pve
Found linux image: /boot/vmlinuz-6.8.4-2-pve
Found initrd image: /boot/initrd.img-6.8.4-2-pve
Found memtest86+ 64bit EFI image: /boot/memtest86+x64.efi
Adding boot menu entry for UEFI Firmware Settings ...
done

Shutdown all the guests on the host, then triggered kernel panic:

root@quasar:~# echo c > /proc/sysrq-trigger

Which was received on raspberrypi:

morey-tech@raspberrypi:~ $ cat /var/log/netconsole/192.168.1.32.log
# ...
[36313.874812] sysrq: Trigger a crash
[36313.875632] Kernel panic - not syncing: sysrq triggered crash
[36313.876420] CPU: 8 PID: 283605 Comm: bash Tainted: P           O       6.8.12-1-pve #1
[36313.877204] Hardware name: Micro Computer (HK) Tech Limited Venus Series/AHWSA, BIOS AHWSA.1.22 03/12/2024
[36313.877903] Call Trace:

Screenshot from 2024-09-08 09-04-57

On boot, netconsole was configured correctly:

Sep 08 09:08:36 quasar kernel: netpoll: netconsole: local port 5555
Sep 08 09:08:36 quasar kernel: netpoll: netconsole: local IPv4 address 192.168.1.32
Sep 08 09:08:36 quasar kernel: netpoll: netconsole: interface 'enp89s0'
Sep 08 09:08:36 quasar kernel: netpoll: netconsole: remote port 5555
Sep 08 09:08:36 quasar kernel: netpoll: netconsole: remote IPv4 address 192.168.1.31
Sep 08 09:08:36 quasar kernel: netpoll: netconsole: remote ethernet address dc:a6:32:01:cf:0d
Sep 08 09:08:36 quasar kernel: netpoll: netconsole: device enp89s0 not up yet, forcing it
Sep 08 09:08:36 quasar kernel: igc 0000:59:00.0 enp89s0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Sep 08 09:08:36 quasar kernel: printk: legacy console [netcon0] enabled
Sep 08 09:08:36 quasar kernel: netconsole: network logging started

The logs are sent to raspberrypi around 9.095761 seconds into loading the kernel.

[36313.957987] ---[ end Kernel panic - not syncing: sysrq triggered crash ]---
[    9.095761] vmbr0: port 1(enp2s0f0np0) entered blocking state
[    9.096704] vmbr0: port 1(enp2s0f0np0) entered forwarding state

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hardware
Projects
None yet
Development

No branches or pull requests

1 participant