-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.9.2 upgrade results in wiped state #24411
Comments
This absolutely appalling. Unless you call it alpha or beta, this is not acceptable. You may still be shocked by the outcome of the elections, but you need to pay closer attention, even if this is the community edition. (*) |
Can confirm the issue about the token (same message appeared like @hynek has posted). |
Hey folks, I saw this issue pop up and did some quick testing to confirm and reproduce. It looks like this slipped past upgrade testing by happening as part of the Raft snapshot restore -- I can't reproduce so far unless there's a Raft snapshot in play (which is pretty much always the case for production clusters but isn't typical in short-lived test clusters). Apologies for this... that's a pretty bad miss. Obviously this is top priority. That being said, let's all keep in mind the HashCorp Community Guidelines here. |
@tgross I apologise, but it's the second time in a few months that something like this is happening, early this morning I found the cluster wiped, and I started my weekend working, and I was at least disoriented and frustrated. |
When we removed the time table in #24112 we introduced a bug where if a previous version of Nomad had written a time table entry, we'd return from the restore loop early and never load the rest of the FSM. This will result in a mostly or partially wiped state for that Nomad node, which would then be out of sync with its peers (which would also have the same problem on upgrade). The bug only occurs when the FSM is being restored from snapshot, which isn't the case if you test with a server that's only written Raft logs and not snapshotted them. While fixing this bug, we still need to ensure we're reading the time table entries even if we're throwing them away, so that we move the snapshot reader along to the next full entry. Fixes: #24411
Fix is up here: #24412. We'll get that reviewed on Monday and get it shipped out ASAP. Again, apologies for this serious bug. For what it's worth, we've recognized upgrade testing has been our Achilles' Heel for some time, and I'm helping lead one of major projects for the 1.10 cycle to greatly improve the automation we can do around that. Thanks for your patience. |
Is there any way to extract state after it's been lost? ACL policies and SSO configuration are painful enough the first time around and I'd prefer to not set them up from scratch again. |
Thank you for quick replies on a weekend @tgross, much appreciated. While I understand your frustration @maxadamo, please yell at your enterprise contacts if you are a paying customer, if not let's try to be constructive and try to figure out ways so this doesn't happen in the future (and to make this abundantly clear: I am not a hashicorp employee, yelling at me will not do anything ;)). With that being said I think there are a few ares for improvement:
|
I’m very glad to hear that because right now, out of the three 1.9 releases, two had taking-whole-cluster-down-level bugs. I guess I was lucky this time because my update script uses the API before moving on to the clients and since that failed, I rolled back before the wiping could commence. |
@apollo13 I'm definitely not here to argue, or to bash anyone, and I am not an enterprise customer.
|
Hi @maxadamo, thanks, sorry if my previous comments came over to aggressive; that wasn't my intention. I am not an enterprise customer either, so the following is mostly educated guesses.
I would love for that to be true, but I fear that is expecting to much. Not speaking about nomad or hashicorp specifically here but unless a project is opensource at it's core I do not think that the "free" variant of it will ever receive the same attention. At the end of the day there is a business to run and the attention the free project gets is based on a cost vs reward guess/estimation (at least in my experience). And many businesses don't or can't calculate the benefits they get from a healthy community (which is hard at best to impossible to quantify if we want to be honest).
Sure but Kubernetes is actually opensource, hashicorp products are not opensource in the same sense. In this specific case (again not an enterprise customer so only guessing) I think even enterprise customers would have gotten that fix with 1.9.2+ent and would have been wiping their clusters (or maybe currently still are), so I don't think enterprise would have been safe to use. As for paywalls and beta testing, I think it depends on the product, many companies are exactly doing that. Take for instance proxmox (a hypervisor) as an example. They provide the full feature-set for free (and opensource) but you basically run from tip/main branch. If you want stability you have to pay for access to the enterprise repositories where packages are pushed after a testing phase (where the community participates with the use of the main branch). Hashicorp is different here in the sense that Nomad enterprise offers additional features. I guess there are ups and downs to either of those approaches.
That is not good, maybe this issue helps to establish a process for yanking releases, because as it shows it is clearly needed.
Absolutely, and an issue like this has the potential to draw many eyes, so it is even more important that we all offer our ideas on how to improve :) |
Sunday morning thoughts, in no particular order:
Any servers that haven't been upgraded will still have the state, as will the backups you've taken via
I do want to make clear that we in no way consider the community involuntary beta testers for changes. (I also dug thru the issue history here on GitHub and don't see anywhere anyone even might have given you that impression, but I could have missed something.) We run 1000s of compute-hours of integration testing and 100s of compute-hours of end-to-end testing a week, and virtually all of that is run against the current tip. Nomad Enterprise 1.9.2 also got the same backported bug fix, and will have hit the same issues (I've got at least one internal report on this). Yet our tests still failed to detect this bug! The difference being that enterprise customers will not have done a unattended production upgrade on a Friday night, and so while they'll be filing support tickets against their pre-prod environments, your banking app didn't have an outage.
As of right now, there's no process in place to yank releases. That was a surprise to me the last time this came up, but it's definitely going to be a topic of discussion in our internal retro on this. We don't own the release infrastructure, so that involves engaging our friends in another part of the organization.
Yeah. We normally try to release on Tuesday or Wednesday. In this case, we pulled the release schedule forward a few days to Friday to free up time for other team activities happening this coming week. But of course now we're going to be releasing anyways, so that really didn't help much, did it? Definitely a topic for the retro.
In this case we violated the backport policy in favor of shipping it to fewer builds. The time table change was a bug fix and we didn't backport it to the LTS version. We did so out of extra caution and not wanting to ship this change for the first time in the 1.10 LTS. This was an important bug fix reported by a community member running large-scale batch workloads and needing observability into their work that runs over weekends. Had this been reported by an Enterprise customer we definitely wouldn't have been able to get away with not backporting it to the LTS. At a higher level though, the big picture business policies for backports and LTS are defined at a corporate level. As a public company, individual product teams hear about policy changes at the same time y'all do. We've pushed back on this and other changes (like the licensing, which makes the backport policy more painful b/c it limits the ability for the community to maintain an independent backport release if they want). A couple of us have been loud enough about it to be veering into CLM territory. 😁 So not something I have much control over, but I do have a lot of sympathy. |
Although this bug absolutely sucks, at least it gave me a chance to test my disaster recovery plan :) Either way, just wanted to compliment @tgross on being transparent, honest, and critical to "your" (as in, Hashicorp's) way of working. That is absolutely appreciated, and makes me feel less anxious for the next release. Thank you, and please keep this attitude going :) |
I want to look at the bright side of things. Open discussions like this are only possible within a vibrant community, and @tgross is one of the most professional and kind developers I’ve encountered online. |
Thank you for your active engagement @tgross. It is appreciated 😎
Would it be possible to link to or reference this issue in the v1.9.2 release in GitHub, for the time being? https://github.com/hashicorp/nomad/releases/tag/v1.9.2 That might help some (namely those that look at the release notes) make an informed decision on whether or not to upgrade. |
Done. |
When we removed the time table in #24112 we introduced a bug where if a previous version of Nomad had written a time table entry, we'd return from the restore loop early and never load the rest of the FSM. This will result in a mostly or partially wiped state for that Nomad node, which would then be out of sync with its peers (which would also have the same problem on upgrade). The bug only occurs when the FSM is being restored from snapshot, which isn't the case if you test with a server that's only written Raft logs and not snapshotted them. While fixing this bug, we still need to ensure we're reading the time table entries even if we're throwing them away, so that we move the snapshot reader along to the next full entry. Fixes: #24411
We've merged #24412 and are in the process of validating the release branch so we can kick off the release. |
Release binaries for 1.9.3 and 1.9.3+ent have been posted. Packages (deb/rpm/docker) take a little longer to move through the pipeline, so those will trickle out over the next couple hours. We also had a chat internally about yanking releases and apparently there is now infrastructure to do that but we on the Nomad team just didn't know about it or have a process to engage it. That's in motion now as well. |
Upgraded my cluster to |
My 0.02: While upgrading OSS Nomad 1.9.1 to Nomad 1.9.2 I too was hit by this. This is a "brand new fear unlocked" scenario for me! 😨 My "verification workflow" is to run the following while I am upgrading the servers Consul/Nomad is
As "only the state was wiped" 🙄 but the cluster server quorum was fine, I kept updating the next server and the next and eventually ended with empty state! 😨 I am now thinking of writing a while/watch loop of counting the no. of parent jobs and watching that counter while upgrading the servers. Also dumping the list of jobs before kicking off an upgrade. Until 1.9.3 landed, I thought it was a case of "PEBKAC" and that I upgraded the servers too soon or something like that. |
What APIs were you using to confirm? |
Superstitious Observation: The |
I'm not using it to confirm anything; I just query |
upgrade to Nomad 1.9.3 worked for me. If some of you is using puppet, I am working on a solution to apply an idempotent configuration, for Nomad variables and ACLs. This is already in the works: voxpupuli/puppet-nomad#87 . So, I'll be able to restore the data. For the sake of clarity, since I mentioned that a while back I had a similar issue, now I recalled that it wasn't Nomad but it was Consul: hashicorp/consul#21336 |
Nomad version
1.9.2
Operating system and Environment details
Ubuntu Jammy
Issue
After an update of the servers from 1.9.1. to 1.9.2, our existing tokens are rejected as invalid. Both in UI as well as via API. Downgrading to 1.9.1 fixes that.
Reproduction steps
Upgrade Nomad servers from 1.9.1. to 1.9.2.
Expected Result
Tokens still work.
Actual Result
Tokens don't work anymore.
Nomad Server logs (if appropriate)
There is a warning:
The rest is just raft chatter.
Nomad Client logs (if appropriate)
Clients are not involved.
The text was updated successfully, but these errors were encountered: