-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test failed in CI: helios/deploy CLI SendError
#6771
Comments
SendError
Figuring that kind of failure might result from a Nexus crash, I took a look at the Nexus logs. I don't see any crashes but I do see some related-looking request failures. All the import-related request log entries seem to be in this log: and I see a bunch of messages like this one:
and a couple where the state is |
I'm unable to reproduce this now, of course, though I noticed that the
|
I was finally able to reproduce this using the old CLI binary:
The trick was to set a
causing the In contrast, the latest CLI does responds to a timeout differently:
|
This is almost certainly the case: Nexus reports that a client disconnected during a request to the bulk write endpoint:
|
Do we think updating the CLI binary is useful here, or will it similarly timeout? |
We probably should update it so we're testing the latest binary in general, but I suspect it will similarly time out.
I'm not sure: If I'm misreading this and it's running on AWS, are these t2.nano instances or something haha? Again, 15 seconds to write 512kb is huge, unless it's something like we don't have TRIM enabled and there's a big reclaim going on? |
It is running on lab hardware, I'm just not sure what the SSDs in the machine are. (Also keep in mind that the ZFS pools used by the deployed control plane are file-backed.) |
I'm using stock omicron bits on my bench gimlet and I can get this to happen 100% of the time:
I'm on omicron commit: 0640bb2
And, the cli:
I added a |
I got another one of these: https://github.com/oxidecomputer/omicron/pull/6810/checks?check_run_id=31321074093 |
I updated my bench gimlet to:
And, it still fails the same way. |
I'm unclear if this is the same error, but it seems like it might be related:
|
I saw the same variant Andrew did on https://buildomat.eng.oxide.computer/wg/0/details/01J9Y2BWQWFB0HPRRFCWC6GHAN/HzxHMKpHk2lzsmDBBvStwfjrwTZfnljEjTH4TLjd2l3KTopv/01J9Y2CEDVND9FA5DK9G0AX5FY. That run is on #6822 which does modify the pantry code path, but I think the modifications it makes (changing how Nexus chooses a pantry) would result in a different kind of failure if they were wrong.
|
This test failed on a CI run on "main":
https://github.com/oxidecomputer/omicron/runs/31088964323
Log showing the specific test failure:
https://buildomat.eng.oxide.computer/wg/0/details/01J9C1NYC3BNWDE017JZ1NT99K/A3qZeEFGXoXmvc5zGw6ECtKYko93V56qQ9uL79n9NYzLyoQK/01J9C1PBV7PARZAD3391DHDKRQ
Excerpt from the log showing the failure:
The text was updated successfully, but these errors were encountered: