Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IRQ handler failure results in hung enclave and broken system #464

Open
gmarkey opened this issue Mar 24, 2023 · 2 comments
Open

IRQ handler failure results in hung enclave and broken system #464

gmarkey opened this issue Mar 24, 2023 · 2 comments
Assignees

Comments

@gmarkey
Copy link

gmarkey commented Mar 24, 2023

Kernel version: 5.15.0-1031-aws
Nitro CLI version: 1.2.2
Instance type: m6i.2xlarge

Kernel module info:

filename:       /lib/modules/5.15.0-1031-aws/kernel/drivers/virt/nitro_enclaves/nitro_enclaves.ko
license:        GPL v2
description:    Nitro Enclaves Driver
author:         Amazon.com, Inc. or its affiliates
srcversion:     55EA5FE324256DEAE204522
alias:          pci:v00001D0Fd0000E4C1sv*sd*bc*sc*i*
depends:
retpoline:      Y
intree:         Y
name:           nitro_enclaves
vermagic:       5.15.0-1031-aws SMP mod_unload modversions
sig_id:         PKCS#7
signer:         Build time autogenerated kernel key
sig_key:        49:A9:55:87:90:5B:33:41:AF:C0:A7:BE:2A:71:6C:D2:CA:34:E0:AE
sig_hashalgo:   sha512
signature:      37:40:0C:76:41:C3:A6:E4:52:3F:0F:3F:5C:BB:09:5F:22:C2:45:0C:
                8A:3E:44:0C:B2:96:BB:69:42:70:40:C5:46:FC:1F:7E:6F:23:67:B5:
                C1:C0:5F:7F:7D:25:15:F8:8A:20:FC:27:50:4D:57:47:E6:AB:F6:DE:
                71:8E:70:FE:06:2B:DD:0B:2F:5A:18:03:B7:98:35:70:D0:F3:AC:A6:
                39:16:C5:5C:26:E1:4A:13:07:46:FE:54:9D:A3:43:11:16:4B:C5:77:
                E7:99:56:9F:DF:C0:61:14:4D:6E:EB:A9:6E:18:2B:3A:29:CF:D0:AC:
                18:46:85:7E:90:2E:B0:A3:2B:F0:7F:54:C3:0D:72:10:30:31:D4:53:
                33:11:6E:A4:D2:C5:8D:4F:31:AA:5C:9E:8E:75:C7:06:0C:8C:87:A4:
                91:42:81:E3:DC:7B:E8:75:F5:F1:15:D7:17:8E:FE:C0:8B:7C:24:05:
                3F:0C:A7:D1:47:8F:E7:85:65:DB:B9:7F:E4:C7:11:3F:A3:AE:A9:ED:
                F4:DF:C6:16:E6:13:12:B4:40:37:97:40:3E:87:9E:1F:5E:DC:49:7E:
                05:8C:57:87:2C:74:25:50:35:A0:75:C4:BB:1E:6B:B2:F9:33:05:0C:
                6C:05:FC:F5:55:5D:17:B2:05:B1:18:93:D7:C5:E1:39:4B:B5:B4:E8:
                24:BC:B2:18:ED:37:8D:68:91:9B:0D:91:E3:77:8F:7C:92:30:77:CA:
                E0:52:C4:CD:1F:09:08:9E:1A:B8:F5:04:B6:2F:94:23:44:50:D2:F7:
                04:07:D9:84:D7:95:88:9B:27:E8:B5:6B:ED:A1:2E:77:25:A4:83:88:
                EA:D1:4F:6C:FB:94:7C:24:4E:DD:99:62:36:60:1D:A1:7E:36:E7:BF:
                1A:32:4A:02:48:21:1A:72:5F:96:A2:25:6E:F0:74:B1:9E:92:B0:0B:
                DB:7A:BE:95:29:C2:7A:7F:91:5F:74:03:7C:6A:D7:9B:85:DB:FB:62:
                29:9B:B5:A3:9A:5D:91:CE:24:D7:AD:04:F5:69:0B:97:2B:EB:C8:94:
                CB:EB:18:AC:19:0B:43:53:3C:25:56:F8:EF:65:36:29:F3:82:35:D6:
                34:FB:41:1C:F3:21:11:97:AA:41:DE:25:74:84:61:3A:BF:33:EC:8F:
                C4:57:7F:8E:62:BE:BE:8C:47:F1:D1:25:38:B7:86:C3:AE:BA:8C:AB:
                94:D1:5E:B2:78:BC:38:E6:26:D6:E8:13:B0:01:4A:73:58:5D:D1:87:
                51:D3:BD:D0:34:2D:99:98:85:07:EA:CE:00:DF:C0:AD:7B:13:5B:81:
                0B:30:D8:89:E5:DD:FD:F6:AE:D2:3A:D8
parm:           ne_cpus:<cpu-list> - CPU pool used for Nitro Enclaves```

Allocator config:

---
# Enclave configuration file.
#
# How much memory to allocate for enclaves (in MiB).
memory_mib: 22108
#
# How many CPUs to reserve for enclaves.
cpu_count: 6
#
# Alternatively, the exact CPUs to be reserved for the enclave can be explicitely
# configured by using `cpu_pool` (like below), instead of `cpu_count`.
# Note: cpu_count and cpu_pool conflict with each other. Only use exactly one of them.
# Example of reserving CPUs 2, 3, and 6 through 9:
# cpu_pool: 2,3,6-9

Enclave config:

{
  "EnclaveName": "...",
  "EnclaveID": "...",
  "ProcessID": 3189409,
  "EnclaveCID": 13,
  "NumberOfCPUs": 2,
  "CPUIDs": [
    1,
    5
  ],
  "MemoryMiB": 5120
}

What is happening?
Previously, I found that enclaves with debugging enabled and console attachment that had large amounts of stdout/stderr would eventually hang, and that attempts to restart them would result in the following error:

Start allocating memory...
Started enclave with enclave-cid: 13, memory: 5120 MiB, cpu-ids: [1, 5]
[ E36 ] Enclave boot failure. Such error appears when attempting to receive the `ready` signal from a freshly booted enclave. It arises in several contexts, for instance, when the enclave is booted from an invalid EIF file and the enclave process immediately exits, failing to submit the `ready` signal. In this case, the error backtrace provides detailed information on what specifically failed during the enclave boot process.

From the log file:

  Action: Run Enclave
  Subactions:
    Failed to execute command `Run`
    Failed to trigger enclave run
    Failed to run enclave
    Failed to create enclave
    Waiting on enclave to boot failed with error VsockReceivingError
  Root error file: src/enclave_proc/resource_manager.rs
  Root error line: 605
  Version: 1.2.2

The kernel meanwhile would report errors like this:

Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.443240] irq 41: nobody cared (try booting with the "irqpoll" option)
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446594] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.0-1031-aws #35-Ubuntu
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446598] Hardware name: Amazon EC2 m6i.2xlarge/, BIOS 1.0 10/16/2017
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446599] Call Trace:
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446602]  <IRQ>
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446604]  show_stack+0x52/0x5c
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446610]  dump_stack_lvl+0x4a/0x63
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446613]  dump_stack+0x10/0x16
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446615]  __report_bad_irq+0x3a/0xb3
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446618]  note_interrupt.cold+0xb/0x60
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446620]  handle_irq_event+0xa8/0xb0
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446623]  handle_edge_irq+0x8c/0x230
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446625]  __common_interrupt+0x4f/0xe0
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446628]  common_interrupt+0x89/0xa0
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446632]  </IRQ>
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446632]  <TASK>
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446633]  asm_common_interrupt+0x27/0x40
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446637] RIP: 0010:cpu_idle_poll.isra.0+0x33/0xd0
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446639] Code: 53 65 8b 15 47 9e 85 7d 66 90 e8 08 cd 39 ff fb 66 0f 1f 44 00 00 65 48 8b 1c 25 c0 fb 01 00 48 8b 03 a8 08 74 0b eb 1c f3 90 <48> 8b 03 a8 0
8 75 13 8b 05 00 c3 47 01 85 c0 75 ed e8 d7 68 3c ff
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446641] RSP: 0018:ffffffff83803e68 EFLAGS: 00000202
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446644] RAX: 0000000000000001 RBX: ffffffff8381b440 RCX: 0000000000000001
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446645] RDX: 000000005f5a1209 RSI: 0000000000000000 RDI: 000000005f5a120a
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446646] RBP: ffffffff83803e70 R08: 00048be10300d8c4 R09: 00000000000003ed
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446648] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8381b440
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446649] R13: 0000000000000027 R14: 0000000000000000 R15: 0000000000000000
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446651]  ? cpu_idle_poll.isra.0+0x18/0xd0
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446653]  do_idle+0x45/0xf0
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446656]  cpu_startup_entry+0x20/0x30
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446658]  rest_init+0xd3/0x100
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446659]  ? acpi_enable_subsystem+0x20b/0x217
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446664]  arch_call_rest_init+0xe/0x23
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446669]  start_kernel+0x4a9/0x4ca
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446671]  x86_64_start_reservations+0x24/0x2a
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446673]  x86_64_start_kernel+0xfb/0x106
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446675]  secondary_startup_64_no_verify+0xc2/0xcb
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446680]  </TASK>
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446680] handlers:
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.447978] [<0000000043f4e91b>] vring_interrupt
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.450356] Disabling IRQ #41

The only (apparent) way to allow the system to run any enclaves at this point was a complete reboot - the EIF file is unchanged, so there's nothing wrong with that. I suspected that the issue was caused by either a high rate or high volume of stdout/stderr, so my solution was to disable debugging and create a log shipping pipeline that used a separate vsock connection to get meaningful data out of the enclave. This appears to have been working fine until recently, when we ran into the exact same problem after a few weeks of uptime. It's worth noting that we still attach to the console as a means of blocking until the enclave exits, however there is no output (as expected). It does appear that at some point the enclave crashed (RC=4) after some time of running OK, after which it was no longer able to start due to the ready signal error. The enclaved application appears to have been in a hung state starting from about 02:21+08:00 (when the kernel IRQ error occurred) with the enclave itself crashing at about 17:47+08:00.

@eugkoira eugkoira self-assigned this Mar 24, 2023
@eugkoira
Copy link
Contributor

Known issue on our side. We are busy with a fix for it. Reproducible especially when one of the vsock directions is overloaded while other one is relatively unused.

@eugkoira
Copy link
Contributor

eugkoira commented Nov 7, 2024

@gmarkey
Is it working on your side? The fix for it was deployed quite some time ago

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants