You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
filename: /lib/modules/5.15.0-1031-aws/kernel/drivers/virt/nitro_enclaves/nitro_enclaves.ko
license: GPL v2
description: Nitro Enclaves Driver
author: Amazon.com, Inc. or its affiliates
srcversion: 55EA5FE324256DEAE204522
alias: pci:v00001D0Fd0000E4C1sv*sd*bc*sc*i*
depends:
retpoline: Y
intree: Y
name: nitro_enclaves
vermagic: 5.15.0-1031-aws SMP mod_unload modversions
sig_id: PKCS#7
signer: Build time autogenerated kernel key
sig_key: 49:A9:55:87:90:5B:33:41:AF:C0:A7:BE:2A:71:6C:D2:CA:34:E0:AE
sig_hashalgo: sha512
signature: 37:40:0C:76:41:C3:A6:E4:52:3F:0F:3F:5C:BB:09:5F:22:C2:45:0C:
8A:3E:44:0C:B2:96:BB:69:42:70:40:C5:46:FC:1F:7E:6F:23:67:B5:
C1:C0:5F:7F:7D:25:15:F8:8A:20:FC:27:50:4D:57:47:E6:AB:F6:DE:
71:8E:70:FE:06:2B:DD:0B:2F:5A:18:03:B7:98:35:70:D0:F3:AC:A6:
39:16:C5:5C:26:E1:4A:13:07:46:FE:54:9D:A3:43:11:16:4B:C5:77:
E7:99:56:9F:DF:C0:61:14:4D:6E:EB:A9:6E:18:2B:3A:29:CF:D0:AC:
18:46:85:7E:90:2E:B0:A3:2B:F0:7F:54:C3:0D:72:10:30:31:D4:53:
33:11:6E:A4:D2:C5:8D:4F:31:AA:5C:9E:8E:75:C7:06:0C:8C:87:A4:
91:42:81:E3:DC:7B:E8:75:F5:F1:15:D7:17:8E:FE:C0:8B:7C:24:05:
3F:0C:A7:D1:47:8F:E7:85:65:DB:B9:7F:E4:C7:11:3F:A3:AE:A9:ED:
F4:DF:C6:16:E6:13:12:B4:40:37:97:40:3E:87:9E:1F:5E:DC:49:7E:
05:8C:57:87:2C:74:25:50:35:A0:75:C4:BB:1E:6B:B2:F9:33:05:0C:
6C:05:FC:F5:55:5D:17:B2:05:B1:18:93:D7:C5:E1:39:4B:B5:B4:E8:
24:BC:B2:18:ED:37:8D:68:91:9B:0D:91:E3:77:8F:7C:92:30:77:CA:
E0:52:C4:CD:1F:09:08:9E:1A:B8:F5:04:B6:2F:94:23:44:50:D2:F7:
04:07:D9:84:D7:95:88:9B:27:E8:B5:6B:ED:A1:2E:77:25:A4:83:88:
EA:D1:4F:6C:FB:94:7C:24:4E:DD:99:62:36:60:1D:A1:7E:36:E7:BF:
1A:32:4A:02:48:21:1A:72:5F:96:A2:25:6E:F0:74:B1:9E:92:B0:0B:
DB:7A:BE:95:29:C2:7A:7F:91:5F:74:03:7C:6A:D7:9B:85:DB:FB:62:
29:9B:B5:A3:9A:5D:91:CE:24:D7:AD:04:F5:69:0B:97:2B:EB:C8:94:
CB:EB:18:AC:19:0B:43:53:3C:25:56:F8:EF:65:36:29:F3:82:35:D6:
34:FB:41:1C:F3:21:11:97:AA:41:DE:25:74:84:61:3A:BF:33:EC:8F:
C4:57:7F:8E:62:BE:BE:8C:47:F1:D1:25:38:B7:86:C3:AE:BA:8C:AB:
94:D1:5E:B2:78:BC:38:E6:26:D6:E8:13:B0:01:4A:73:58:5D:D1:87:
51:D3:BD:D0:34:2D:99:98:85:07:EA:CE:00:DF:C0:AD:7B:13:5B:81:
0B:30:D8:89:E5:DD:FD:F6:AE:D2:3A:D8
parm: ne_cpus:<cpu-list> - CPU pool used for Nitro Enclaves```
Allocator config:
---
# Enclave configuration file.
#
# How much memory to allocate for enclaves (in MiB).
memory_mib: 22108
#
# How many CPUs to reserve for enclaves.
cpu_count: 6
#
# Alternatively, the exact CPUs to be reserved for the enclave can be explicitely
# configured by using `cpu_pool` (like below), instead of `cpu_count`.
# Note: cpu_count and cpu_pool conflict with each other. Only use exactly one of them.
# Example of reserving CPUs 2, 3, and 6 through 9:
# cpu_pool: 2,3,6-9
What is happening?
Previously, I found that enclaves with debugging enabled and console attachment that had large amounts of stdout/stderr would eventually hang, and that attempts to restart them would result in the following error:
Start allocating memory...
Started enclave with enclave-cid: 13, memory: 5120 MiB, cpu-ids: [1, 5]
[ E36 ] Enclave boot failure. Such error appears when attempting to receive the `ready` signal from a freshly booted enclave. It arises in several contexts, for instance, when the enclave is booted from an invalid EIF file and the enclave process immediately exits, failing to submit the `ready` signal. In this case, the error backtrace provides detailed information on what specifically failed during the enclave boot process.
From the log file:
Action: Run Enclave
Subactions:
Failed to execute command `Run`
Failed to trigger enclave run
Failed to run enclave
Failed to create enclave
Waiting on enclave to boot failed with error VsockReceivingError
Root error file: src/enclave_proc/resource_manager.rs
Root error line: 605
Version: 1.2.2
The kernel meanwhile would report errors like this:
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.443240] irq 41: nobody cared (try booting with the "irqpoll" option)
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446594] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.0-1031-aws #35-Ubuntu
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446598] Hardware name: Amazon EC2 m6i.2xlarge/, BIOS 1.0 10/16/2017
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446599] Call Trace:
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446602] <IRQ>
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446604] show_stack+0x52/0x5c
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446610] dump_stack_lvl+0x4a/0x63
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446613] dump_stack+0x10/0x16
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446615] __report_bad_irq+0x3a/0xb3
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446618] note_interrupt.cold+0xb/0x60
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446620] handle_irq_event+0xa8/0xb0
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446623] handle_edge_irq+0x8c/0x230
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446625] __common_interrupt+0x4f/0xe0
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446628] common_interrupt+0x89/0xa0
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446632] </IRQ>
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446632] <TASK>
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446633] asm_common_interrupt+0x27/0x40
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446637] RIP: 0010:cpu_idle_poll.isra.0+0x33/0xd0
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446639] Code: 53 65 8b 15 47 9e 85 7d 66 90 e8 08 cd 39 ff fb 66 0f 1f 44 00 00 65 48 8b 1c 25 c0 fb 01 00 48 8b 03 a8 08 74 0b eb 1c f3 90 <48> 8b 03 a8 0
8 75 13 8b 05 00 c3 47 01 85 c0 75 ed e8 d7 68 3c ff
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446641] RSP: 0018:ffffffff83803e68 EFLAGS: 00000202
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446644] RAX: 0000000000000001 RBX: ffffffff8381b440 RCX: 0000000000000001
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446645] RDX: 000000005f5a1209 RSI: 0000000000000000 RDI: 000000005f5a120a
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446646] RBP: ffffffff83803e70 R08: 00048be10300d8c4 R09: 00000000000003ed
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446648] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8381b440
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446649] R13: 0000000000000027 R14: 0000000000000000 R15: 0000000000000000
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446651] ? cpu_idle_poll.isra.0+0x18/0xd0
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446653] do_idle+0x45/0xf0
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446656] cpu_startup_entry+0x20/0x30
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446658] rest_init+0xd3/0x100
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446659] ? acpi_enable_subsystem+0x20b/0x217
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446664] arch_call_rest_init+0xe/0x23
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446669] start_kernel+0x4a9/0x4ca
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446671] x86_64_start_reservations+0x24/0x2a
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446673] x86_64_start_kernel+0xfb/0x106
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446675] secondary_startup_64_no_verify+0xc2/0xcb
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446680] </TASK>
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.446680] handlers:
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.447978] [<0000000043f4e91b>] vring_interrupt
Mar 23 02:21:40 prod-vrg-trade1 kernel: [1279698.450356] Disabling IRQ #41
The only (apparent) way to allow the system to run any enclaves at this point was a complete reboot - the EIF file is unchanged, so there's nothing wrong with that. I suspected that the issue was caused by either a high rate or high volume of stdout/stderr, so my solution was to disable debugging and create a log shipping pipeline that used a separate vsock connection to get meaningful data out of the enclave. This appears to have been working fine until recently, when we ran into the exact same problem after a few weeks of uptime. It's worth noting that we still attach to the console as a means of blocking until the enclave exits, however there is no output (as expected). It does appear that at some point the enclave crashed (RC=4) after some time of running OK, after which it was no longer able to start due to the ready signal error. The enclaved application appears to have been in a hung state starting from about 02:21+08:00 (when the kernel IRQ error occurred) with the enclave itself crashing at about 17:47+08:00.
The text was updated successfully, but these errors were encountered:
Known issue on our side. We are busy with a fix for it. Reproducible especially when one of the vsock directions is overloaded while other one is relatively unused.
Kernel version: 5.15.0-1031-aws
Nitro CLI version: 1.2.2
Instance type: m6i.2xlarge
Kernel module info:
Allocator config:
Enclave config:
What is happening?
Previously, I found that enclaves with debugging enabled and console attachment that had large amounts of stdout/stderr would eventually hang, and that attempts to restart them would result in the following error:
From the log file:
The kernel meanwhile would report errors like this:
The only (apparent) way to allow the system to run any enclaves at this point was a complete reboot - the EIF file is unchanged, so there's nothing wrong with that. I suspected that the issue was caused by either a high rate or high volume of stdout/stderr, so my solution was to disable debugging and create a log shipping pipeline that used a separate vsock connection to get meaningful data out of the enclave. This appears to have been working fine until recently, when we ran into the exact same problem after a few weeks of uptime. It's worth noting that we still attach to the console as a means of blocking until the enclave exits, however there is no output (as expected). It does appear that at some point the enclave crashed (RC=4) after some time of running OK, after which it was no longer able to start due to the
ready
signal error. The enclaved application appears to have been in a hung state starting from about 02:21+08:00 (when the kernel IRQ error occurred) with the enclave itself crashing at about 17:47+08:00.The text was updated successfully, but these errors were encountered: