-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: failed migrations are reported as having completed before the source VM is fully resumed #811
Comments
Seen again in this run from #816: https://github.com/oxidecomputer/propolis/pull/816/checks?check_run_id=34568508831 |
Looking a little more closely at the logs from the most recent failure, I believe the "VM was asked to stop too early" explanation from above is incorrect. The real problem is that the source is rejecting the second target's request to migrate out because the source believes the first attempt to migrate out hasn't resolved yet:
The migration ID in the first message lines up with the migration IDs in the PHD logs. The reason this can happen is that when a migration fails, Propolis publishes two separate state updates. First, the migration runner publishes that the migration failed (while leaving the VM itself in the "Migrating" state):
Then it resumes the VM's devices and moves it back to "Running":
A new migration is only allowed to start after the second transition, but The control plane behaves similarly to PHD here (the instance update saga will retire a migration attempt and clear the instance's migration ID as soon as the migration itself is in a terminal state, even if the source instance hasn't moved back to the "running" state yet). That makes me lean a bit toward trying to fix this in Propolis (i.e. only publish the final transition to the "Error" migration state once the source VM has resumed and is ready to go back to "Running"). |
Note that this probably has very little impact on a production deployment:
Nevertheless, this flake is annoying and needs to be fixed. |
Originally seen in #809; see excerpt below.
The PHD failure here is weird. The failure is in the
running_process::import_failure
test, which is what I'd expect this change to break if it broke anything. But the failure mode is odd: the first migration fails as expected, but then the source VM immediately gets a stop request, which it immediately handles, rendering it unable to serve as a migration source for the second migration:We log state changes requested through the framework's VM state change functions, as well as attempts to stop a VM when it's being torn down, and I don't see anything especially suspicious preceding the VM stopping. My best guess is that there's a synchronization bug in
phd_framework::Framework::wait_for_cleanup_tasks
that's allowing a VM teardown task from a previous test to run and affect this one, but I haven't yet found any evidence of this.I also can't reproduce this locally, so I've queued a rerun to see if it happens again. Could probably use some more instrumentation in the PHD VM cleanup logic in any case.
Originally posted by @gjcolombo in #809 (comment)
The text was updated successfully, but these errors were encountered: