test failed in CI: helios/deploy CLI `SendError` #6771

jgallagher · 2024-10-04T16:37:01Z

This test failed on a CI run on "main":

https://github.com/oxidecomputer/omicron/runs/31088964323

Log showing the specific test failure:

https://buildomat.eng.oxide.computer/wg/0/details/01J9C1NYC3BNWDE017JZ1NT99K/A3qZeEFGXoXmvc5zGw6ECtKYko93V56qQ9uL79n9NYzLyoQK/01J9C1PBV7PARZAD3391DHDKRQ

Excerpt from the log showing the failure:

976	2024-10-04T15:58:09.854Z	+ /usr/oxide/oxide --resolve recovery.sys.oxide.test:443:10.151.2.166 --cacert /opt/oxide/sled-agent/pkg/initial-tls-cert.pem disk import --path debian-11-genericcloud-amd64.raw --disk debian11-boot --project images --description 'debian 11 cloud image from distros' --snapshot debian11-snapshot --image debian11 --image-description 'debian 11 original base image' --image-os debian --image-version 11
977	2024-10-04T15:59:47.203Z	sending chunk to thread failed with SendError { .. }
978	2024-10-04T15:59:48.324Z	channel closed

The text was updated successfully, but these errors were encountered:

davepacheco · 2024-10-04T16:52:06Z

Figuring that kind of failure might result from a Nexus crash, I took a look at the Nexus logs. I don't see any crashes but I do see some related-looking request failures. All the import-related request log entries seem to be in this log:
https://buildomat.eng.oxide.computer/wg/0/artefact/01J9C1NYC3BNWDE017JZ1NT99K/A3qZeEFGXoXmvc5zGw6ECtKYko93V56qQ9uL79n9NYzLyoQK/01J9C1PBV7PARZAD3391DHDKRQ/01J9C46T7F5WYNKPWE02126CE3/oxide-nexus:default.log?format=x-bunyan

and I see a bunch of messages like this one:
https://buildomat.eng.oxide.computer/wg/0/artefact/01J9C1NYC3BNWDE017JZ1NT99K/A3qZeEFGXoXmvc5zGw6ECtKYko93V56qQ9uL79n9NYzLyoQK/01J9C1PBV7PARZAD3391DHDKRQ/01J9C46T7F5WYNKPWE02126CE3/oxide-nexus:default.log?format=x-bunyan#L35948

2024-10-04T15:59:47.315Z	INFO	nexus (dropshot_external): request completed
    error_message_external = cannot import blocks with a bulk write for disk in state ImportReady
    error_message_internal = cannot import blocks with a bulk write for disk in state ImportReady
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:938
    latency_us = 451297
    local_addr = 172.30.2.5:443
    method = POST
    remote_addr = 10.151.2.100:52890
    req_id = e74f85b8-5e08-495e-a56f-2239902eeaab
    response_code = 400
    uri = /v1/disks/debian11-boot/bulk-write?project=images

and a couple where the state is Finalizing or Detached.

jmpesp · 2024-10-04T19:22:13Z

I'm unable to reproduce this now, of course, though I noticed that the oxide-v0.1.0 binary that is grabbed during the deploy job is quite old:

james@atrium ~ $ curl -sSfL -I http://catacomb.eng.oxide.computer:12346/oxide-v0.1.0
...
Last-Modified: Tue, 05 Dec 2023 21:41:37 GMT
...

jmpesp · 2024-10-04T21:56:19Z

I was finally able to reproduce this using the old CLI binary:

[00:00:16] [█████████████████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 954.50 MiB/2.00 GiB (75.49 MiB/s, 14s)
sending chunk to thread failed with SendError { .. }
channel closed

The trick was to set a --timeout to something very small:

./oxide-v0.1.0 --timeout 2 ...

causing the .disk_bulk_write_import call to timeout, causing the upload thread to drop the receiver. This makes me believe it's related to IO speed being very slow for this setup, as the normal timeout without the explicit argument is probably the progenitor default of 15 seconds!

In contrast, the latest CLI does responds to a timeout differently:

[00:00:17] [███████████████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 907.50 MiB/2.00 GiB (52.85 MiB/s, 22s
Communication Error: error sending request for url (http://fancyfeast:8080/v1/disks/crucible-tester-duster-disk?project=myproj): operation timed out: sending chunk to thread failed: channel closed

jmpesp · 2024-10-04T22:04:52Z

This is almost certainly the case: Nexus reports that a client disconnected during a request to the bulk write endpoint:

nexus (dropshot_external): request handling cancelled (client disconnected)
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:887
    latency_us = 1546327
    local_addr = 172.30.2.5:443
    method = POST
    remote_addr = 10.151.2.100:36568
    req_id = 6f78ea08-7945-43f1-8574-54f7b2b86ab8
    uri = /v1/disks/debian11-boot/bulk-write?project=images

iliana · 2024-10-08T17:37:35Z

Do we think updating the CLI binary is useful here, or will it similarly timeout?
And/or should we increase the timeout to handle the CI setup not being Gimlet-grade SSDs?

jmpesp · 2024-10-08T20:03:04Z

Do we think updating the CLI binary is useful here, or will it similarly timeout?

We probably should update it so we're testing the latest binary in general, but I suspect it will similarly time out.

should we increase the timeout to handle the CI setup not being Gimlet-grade SSDs?

I'm not sure: .github/buildomat/jobs/deploy.sh says that the target is lab-2.0-opte-0.33, and doesn't that mean that this is running on our hardware? Those shouldn't have disks so bad that it takes 15 seconds to write 512kb (!).

If I'm misreading this and it's running on AWS, are these t2.nano instances or something haha? Again, 15 seconds to write 512kb is huge, unless it's something like we don't have TRIM enabled and there's a big reclaim going on?

iliana · 2024-10-08T20:04:57Z

It is running on lab hardware, I'm just not sure what the SSDs in the machine are. (Also keep in mind that the ZFS pools used by the deployed control plane are file-backed.)

leftwo · 2024-10-09T22:07:04Z

I'm using stock omicron bits on my bench gimlet and I can get this to happen 100% of the time:

EVT22200005 # oxide disk import --project alan --description "test2" --path /alan/cli/jammy-server-cloudimg-amd64.raw --disk test2
[00:00:55] [████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 377.50 MiB/2.20 GiB (3.53 MiB/s, 9m)sending chunk to thread failed with SendError { .. }
[00:00:55] [████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 378.50 MiB/2.20 GiB (3.53 MiB/s, 9m)
channel closed

I'm on omicron commit: 0640bb2
My bench gimlet is:

EVT22200005 # uname -a
SunOS EVT22200005 5.11 helios-2.0.22861 oxide i386 oxide

And, the cli:

EVT22200005 # oxide version
Oxide CLI 0.6.1+20240710.0
Built from commit: 4a896d13efa760c0e9120e5d82fb2b6b7f5a27da 
Oxide API: 20240821.0

I added a --timeout 60 to my import line, and I did get an upload to work.

andrewjstone · 2024-10-09T23:03:43Z

I got another one of these: https://github.com/oxidecomputer/omicron/pull/6810/checks?check_run_id=31321074093

leftwo · 2024-10-09T23:11:37Z

I updated my bench gimlet to:

EVT22200005 # uname -a
SunOS EVT22200005 5.11 helios-2.0.22957 oxide i386 oxide

And, it still fails the same way.

andrewjstone · 2024-10-10T16:13:20Z

I'm unclear if this is the same error, but it seems like it might be related:
https://github.com/oxidecomputer/omicron/pull/6810/checks?check_run_id=31332921093

702 | 2024-10-10T05:40:53.995Z | + /usr/oxide/oxide --resolve recovery.sys.oxide.test:443:10.151.2.166 --cacert /opt/oxide/sled-agent/pkg/initial-tls-cert.pem disk import --path debian-11-genericcloud-amd64.raw --disk debian11-boot --project images --description 'debian 11 cloud image from distros' --snapshot debian11-snapshot --image debian11 --image-description 'debian 11 original base image' --image-os debian --image-version 11
-- | -- | --
1703 | 2024-10-10T05:44:03.700Z | one of the upload threads failed
1704 | 2024-10-10T05:44:04.618Z | one of the upload threads failed

jgallagher · 2024-10-11T17:48:50Z

I saw the same variant Andrew did on https://buildomat.eng.oxide.computer/wg/0/details/01J9Y2BWQWFB0HPRRFCWC6GHAN/HzxHMKpHk2lzsmDBBvStwfjrwTZfnljEjTH4TLjd2l3KTopv/01J9Y2CEDVND9FA5DK9G0AX5FY. That run is on #6822 which does modify the pantry code path, but I think the modifications it makes (changing how Nexus chooses a pantry) would result in a different kind of failure if they were wrong.

871     2024-10-11T16:02:24.918Z        + /usr/oxide/oxide --resolve recovery.sys.oxide.test:443:10.151.1.166 --cacert /opt/oxide/sled-agent/pkg/initial-tls-cert.pem disk import --path debian-11-genericcloud-amd64.raw --disk debian11-boot --project images --description 'debian 11 cloud image from distros' --snapshot debian11-snapshot --image debian11 --image-description 'debian 11 original base image' --image-os debian --image-version 11
872     2024-10-11T16:05:33.357Z        one of the upload threads failed
873     2024-10-11T16:05:34.447Z        one of the upload threads failed

jgallagher added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Oct 4, 2024

jgallagher changed the title ~~test failed in CI: NAME_OF_TEST~~ test failed in CI: helios/deploy CLI SendError Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test failed in CI: helios/deploy CLI `SendError` #6771

test failed in CI: helios/deploy CLI `SendError` #6771

jgallagher commented Oct 4, 2024

davepacheco commented Oct 4, 2024

jmpesp commented Oct 4, 2024

jmpesp commented Oct 4, 2024

jmpesp commented Oct 4, 2024

iliana commented Oct 8, 2024

jmpesp commented Oct 8, 2024

iliana commented Oct 8, 2024

leftwo commented Oct 9, 2024 •

edited

Loading

andrewjstone commented Oct 9, 2024

leftwo commented Oct 9, 2024

andrewjstone commented Oct 10, 2024

jgallagher commented Oct 11, 2024

test failed in CI: helios/deploy CLI SendError #6771

test failed in CI: helios/deploy CLI SendError #6771

Comments

jgallagher commented Oct 4, 2024

davepacheco commented Oct 4, 2024

jmpesp commented Oct 4, 2024

jmpesp commented Oct 4, 2024

jmpesp commented Oct 4, 2024

iliana commented Oct 8, 2024

jmpesp commented Oct 8, 2024

iliana commented Oct 8, 2024

leftwo commented Oct 9, 2024 • edited Loading

andrewjstone commented Oct 9, 2024

leftwo commented Oct 9, 2024

andrewjstone commented Oct 10, 2024

jgallagher commented Oct 11, 2024

test failed in CI: helios/deploy CLI `SendError` #6771

test failed in CI: helios/deploy CLI `SendError` #6771

leftwo commented Oct 9, 2024 •

edited

Loading