Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.9.2 upgrade results in wiped state #24411

Closed
hynek opened this issue Nov 9, 2024 · 23 comments · Fixed by #24412
Closed

1.9.2 upgrade results in wiped state #24411

hynek opened this issue Nov 9, 2024 · 23 comments · Fixed by #24412
Assignees
Labels
Milestone

Comments

@hynek
Copy link
Contributor

hynek commented Nov 9, 2024

Nomad version

1.9.2

Operating system and Environment details

Ubuntu Jammy

Issue

After an update of the servers from 1.9.1. to 1.9.2, our existing tokens are rejected as invalid. Both in UI as well as via API. Downgrading to 1.9.1 fixes that.

Reproduction steps

Upgrade Nomad servers from 1.9.1. to 1.9.2.

Expected Result

Tokens still work.

Actual Result

Tokens don't work anymore.

Nomad Server logs (if appropriate)

There is a warning:

Setting a Vault token in the agent configuration is deprecated and will be removed in Nomad 1.9. Migrate your Vault configuration to use workload identity

The rest is just raft chatter.

Nomad Client logs (if appropriate)

Clients are not involved.

@hynek hynek added the type/bug label Nov 9, 2024
@maxadamo
Copy link

maxadamo commented Nov 9, 2024

This absolutely appalling.
It's not only about token: the nomad cluster turned into a blank slate.
Blank slate means, everything was gone. All the jobs gone (*), all variables deleted, namespaces destroyed, and I had to bootstrap and reapply an ACL, create namespaces, re-run all the jobs.

Unless you call it alpha or beta, this is not acceptable. You may still be shocked by the outcome of the elections, but you need to pay closer attention, even if this is the community edition.

(*) docker ps was showing that the containers had been killed.

@allroundtechie
Copy link

Can confirm the issue about the token (same message appeared like @hynek has posted).
The jobs were not all gone, a few "survived" the update to 1.9.2

@tgross tgross changed the title 1.9.2 stopped accepting existing tokens 1.9.2 upgrade results in wiped state Nov 9, 2024
@tgross tgross pinned this issue Nov 9, 2024
@tgross tgross self-assigned this Nov 9, 2024
@tgross tgross added this to the 1.9.x milestone Nov 9, 2024
@tgross
Copy link
Member

tgross commented Nov 9, 2024

Hey folks, I saw this issue pop up and did some quick testing to confirm and reproduce. It looks like this slipped past upgrade testing by happening as part of the Raft snapshot restore -- I can't reproduce so far unless there's a Raft snapshot in play (which is pretty much always the case for production clusters but isn't typical in short-lived test clusters). Apologies for this... that's a pretty bad miss. Obviously this is top priority.

That being said, let's all keep in mind the HashCorp Community Guidelines here.

@maxadamo
Copy link

maxadamo commented Nov 9, 2024

@tgross I apologise, but it's the second time in a few months that something like this is happening, early this morning I found the cluster wiped, and I started my weekend working, and I was at least disoriented and frustrated.

tgross added a commit that referenced this issue Nov 9, 2024
When we removed the time table in #24112 we introduced a bug where if a previous
version of Nomad had written a time table entry, we'd return from the restore
loop early and never load the rest of the FSM. This will result in a mostly or
partially wiped state for that Nomad node, which would then be out of sync with
its peers (which would also have the same problem on upgrade).

The bug only occurs when the FSM is being restored from snapshot, which isn't
the case if you test with a server that's only written Raft logs and not
snapshotted them.

While fixing this bug, we still need to ensure we're reading the time table
entries even if we're throwing them away, so that we move the snapshot reader
along to the next full entry.

Fixes: #24411
@tgross
Copy link
Member

tgross commented Nov 9, 2024

Fix is up here: #24412. We'll get that reviewed on Monday and get it shipped out ASAP.

Again, apologies for this serious bug. For what it's worth, we've recognized upgrade testing has been our Achilles' Heel for some time, and I'm helping lead one of major projects for the 1.10 cycle to greatly improve the automation we can do around that. Thanks for your patience.

@jinnatar
Copy link

Is there any way to extract state after it's been lost? ACL policies and SSO configuration are painful enough the first time around and I'd prefer to not set them up from scratch again.

@apollo13
Copy link
Contributor

Thank you for quick replies on a weekend @tgross, much appreciated. While I understand your frustration @maxadamo, please yell at your enterprise contacts if you are a paying customer, if not let's try to be constructive and try to figure out ways so this doesn't happen in the future (and to make this abundantly clear: I am not a hashicorp employee, yelling at me will not do anything ;)).

With that being said I think there are a few ares for improvement:

  • 1.9.2 should have been yanked (I realize it is weekend, but hashicorp can probably mobilize enough people for such an issue), in the worst case users are upgrading over the weekend and breaking even more clusters. I understand that this is also a security release which makes it a bit of a pain, which brings me to my next point.
  • Don't ever release on Friday unless it is a security relevant release which is actively exploited. I cannot stress this enough, since we are following this policy for Django there has been much less yelling because whene we release on Monday, people are deploying on Tuesday and get to fix stuff on Wednesday -- no weekend broken. As a sidenote the same should be done for major holidays (christmas eve, chinese new year etc -- think global not just US).
  • Revisit the backport policy, I'd argue that the timetable commit should maybe not have been backported in the first place and maybe deferred to 1.10. In general I'd be stricter about backports and only allow them if a) docs backports, b) backports for (critical) bug fixes and c) if you want to backport features that they are off by default and don't touch to much code. On the upside this would also mean that the commit would be part of a beta release with the hopes that more users test it -- I am pretty sure not many people test a security release that much, especially not if released on a friday and they want to go home.

@hynek
Copy link
Contributor Author

hynek commented Nov 10, 2024

Again, apologies for this serious bug. For what it's worth, we've recognized upgrade testing has been our Achilles' Heel for some time, and I'm helping lead one of major projects for the 1.10 cycle to greatly improve the automation we can do around that.

I’m very glad to hear that because right now, out of the three 1.9 releases, two had taking-whole-cluster-down-level bugs.

I guess I was lucky this time because my update script uses the API before moving on to the clients and since that failed, I rolled back before the wiping could commence.

@maxadamo
Copy link

maxadamo commented Nov 10, 2024

While I understand your frustration @maxadamo, please yell at your enterprise contacts if you are a paying customer, if not let's try to be constructive and try to figure out ways so this doesn't happen in the future (and to make this abundantly clear: I am not a hashicorp employee, yelling at me will not do anything ;)).

@apollo13 I'm definitely not here to argue, or to bash anyone, and I am not an enterprise customer.
I was a bit frustrated, I apologized, I can apologize again and again, and I meant to say that community projects deserve special, or at least equal, attention. What is Kubernetes if not an opensource, community project? An assertion that suggests that only Enterprise products are safe to use, has no place in opensource world. We have paywalls to differentiate paying customers from others, but paywall does not mean beta testing.
Having said that, from memory (I still can't find it in my email) few months ago, something similar happened. I asked the developer to delete the release that had been pushed to the repositories, because it was equally dangerous. He did not want to listen, and I was under the impression that users were perceived like guinea pigs, or beta tester, and their clusters could break any time.
My comment is still very wrong, but I hope it can be seen as a spur to go further and make it better ... always better than Kubernetes 😉

  • btw, your propositions look good to me.

@apollo13
Copy link
Contributor

Hi @maxadamo, thanks, sorry if my previous comments came over to aggressive; that wasn't my intention. I am not an enterprise customer either, so the following is mostly educated guesses.

and I meant to say that community projects deserve special, or at least equal, attention.

I would love for that to be true, but I fear that is expecting to much. Not speaking about nomad or hashicorp specifically here but unless a project is opensource at it's core I do not think that the "free" variant of it will ever receive the same attention. At the end of the day there is a business to run and the attention the free project gets is based on a cost vs reward guess/estimation (at least in my experience). And many businesses don't or can't calculate the benefits they get from a healthy community (which is hard at best to impossible to quantify if we want to be honest).

What is Kubernetes if not an opensource, community project? An assertion that suggests that only Enterprise products are safe to use, has no place in opensource world. We have paywalls to differentiate paying customers from others, but paywall does not mean beta testing.

Sure but Kubernetes is actually opensource, hashicorp products are not opensource in the same sense. In this specific case (again not an enterprise customer so only guessing) I think even enterprise customers would have gotten that fix with 1.9.2+ent and would have been wiping their clusters (or maybe currently still are), so I don't think enterprise would have been safe to use. As for paywalls and beta testing, I think it depends on the product, many companies are exactly doing that. Take for instance proxmox (a hypervisor) as an example. They provide the full feature-set for free (and opensource) but you basically run from tip/main branch. If you want stability you have to pay for access to the enterprise repositories where packages are pushed after a testing phase (where the community participates with the use of the main branch). Hashicorp is different here in the sense that Nomad enterprise offers additional features. I guess there are ups and downs to either of those approaches.

Having said that, from memory (I still can't find it in my email) few months ago, something similar happened. I asked the developer to delete the release that had been pushed to the repositories, because it was equally dangerous. He did not want to listen, and I was under the impression that users were perceived like guinea pigs, or beta tester, and their clusters could break any time.

That is not good, maybe this issue helps to establish a process for yanking releases, because as it shows it is clearly needed.

My comment is still very wrong, but I hope it can be seen as a spur to go further and make it better ... always better than Kubernetes 😉

Absolutely, and an issue like this has the potential to draw many eyes, so it is even more important that we all offer our ideas on how to improve :)

@tgross
Copy link
Member

tgross commented Nov 10, 2024

Sunday morning thoughts, in no particular order:

Is there any way to extract state after it's been lost?

Any servers that haven't been upgraded will still have the state, as will the backups you've taken via nomad operator snapshot save. Servers that have upgraded might have the state on disk still, if they haven't re-snapshot. In which case you can rollback the binary and it'll all come back on its own. I haven't been able to verify that yet.

paywall does not mean beta testing

I do want to make clear that we in no way consider the community involuntary beta testers for changes. (I also dug thru the issue history here on GitHub and don't see anywhere anyone even might have given you that impression, but I could have missed something.) We run 1000s of compute-hours of integration testing and 100s of compute-hours of end-to-end testing a week, and virtually all of that is run against the current tip. Nomad Enterprise 1.9.2 also got the same backported bug fix, and will have hit the same issues (I've got at least one internal report on this). Yet our tests still failed to detect this bug! The difference being that enterprise customers will not have done a unattended production upgrade on a Friday night, and so while they'll be filing support tickets against their pre-prod environments, your banking app didn't have an outage.

1.9.2 should have been yanked

As of right now, there's no process in place to yank releases. That was a surprise to me the last time this came up, but it's definitely going to be a topic of discussion in our internal retro on this. We don't own the release infrastructure, so that involves engaging our friends in another part of the organization.

Don't ever release on Friday

Yeah. We normally try to release on Tuesday or Wednesday. In this case, we pulled the release schedule forward a few days to Friday to free up time for other team activities happening this coming week. But of course now we're going to be releasing anyways, so that really didn't help much, did it? Definitely a topic for the retro.

Revisit the backport policy

In this case we violated the backport policy in favor of shipping it to fewer builds. The time table change was a bug fix and we didn't backport it to the LTS version. We did so out of extra caution and not wanting to ship this change for the first time in the 1.10 LTS. This was an important bug fix reported by a community member running large-scale batch workloads and needing observability into their work that runs over weekends. Had this been reported by an Enterprise customer we definitely wouldn't have been able to get away with not backporting it to the LTS.

At a higher level though, the big picture business policies for backports and LTS are defined at a corporate level. As a public company, individual product teams hear about policy changes at the same time y'all do. We've pushed back on this and other changes (like the licensing, which makes the backport policy more painful b/c it limits the ability for the community to maintain an independent backport release if they want). A couple of us have been loud enough about it to be veering into CLM territory. 😁 So not something I have much control over, but I do have a lot of sympathy.

@TrueBrain
Copy link
Contributor

Although this bug absolutely sucks, at least it gave me a chance to test my disaster recovery plan :)

Either way, just wanted to compliment @tgross on being transparent, honest, and critical to "your" (as in, Hashicorp's) way of working. That is absolutely appreciated, and makes me feel less anxious for the next release. Thank you, and please keep this attitude going :)

@maxadamo
Copy link

I want to look at the bright side of things. Open discussions like this are only possible within a vibrant community, and @tgross is one of the most professional and kind developers I’ve encountered online.

@EtienneBruines
Copy link
Contributor

Thank you for your active engagement @tgross. It is appreciated 😎

As of right now, there's no process in place to yank releases. We don't own the release infrastructure, so that involves engaging our friends in another part of the organization.

Would it be possible to link to or reference this issue in the v1.9.2 release in GitHub, for the time being? https://github.com/hashicorp/nomad/releases/tag/v1.9.2

That might help some (namely those that look at the release notes) make an informed decision on whether or not to upgrade.

@tgross
Copy link
Member

tgross commented Nov 11, 2024

Would it be possible to link to or reference this issue in the v1.9.2 release in GitHub, for the time being?

Done.

tgross added a commit that referenced this issue Nov 11, 2024
When we removed the time table in #24112 we introduced a bug where if a previous
version of Nomad had written a time table entry, we'd return from the restore
loop early and never load the rest of the FSM. This will result in a mostly or
partially wiped state for that Nomad node, which would then be out of sync with
its peers (which would also have the same problem on upgrade).

The bug only occurs when the FSM is being restored from snapshot, which isn't
the case if you test with a server that's only written Raft logs and not
snapshotted them.

While fixing this bug, we still need to ensure we're reading the time table
entries even if we're throwing them away, so that we move the snapshot reader
along to the next full entry.

Fixes: #24411
@tgross
Copy link
Member

tgross commented Nov 11, 2024

We've merged #24412 and are in the process of validating the release branch so we can kick off the release.

@tgross
Copy link
Member

tgross commented Nov 11, 2024

Release binaries for 1.9.3 and 1.9.3+ent have been posted. Packages (deb/rpm/docker) take a little longer to move through the pipeline, so those will trickle out over the next couple hours.

We also had a chat internally about yanking releases and apparently there is now infrastructure to do that but we on the Nomad team just didn't know about it or have a process to engage it. That's in motion now as well.

@lopcode
Copy link

lopcode commented Nov 11, 2024

Upgraded my cluster to 1.9.3 successfully, having previously had to restore with 1.9.2 - thanks 🤞

@shantanugadgil
Copy link
Contributor

My 0.02:

While upgrading OSS Nomad 1.9.1 to Nomad 1.9.2 I too was hit by this.
(I do not have any ACLs)

This is a "brand new fear unlocked" scenario for me! 😨

My "verification workflow" is to run the following while I am upgrading the servers Consul/Nomad is

consul_watch()
{
    watch -n 3 "consul operator raft list-peers; echo; consul members -wan"
}

nomad_watch()
{
    watch -n 3 "nomad operator raft list-peers; echo; nomad server members"
}

As "only the state was wiped" 🙄 but the cluster server quorum was fine, I kept updating the next server and the next and eventually ended with empty state! 😨

I am now thinking of writing a while/watch loop of counting the no. of parent jobs and watching that counter while upgrading the servers.

Also dumping the list of jobs before kicking off an upgrade.

Until 1.9.3 landed, I thought it was a case of "PEBKAC" and that I upgraded the servers too soon or something like that.
(in-spite of doing the same steps that I do every time)

@shantanugadgil
Copy link
Contributor

@hynek

I guess I was lucky this time because my update script uses the API before moving on to the clients and since that failed, I rolled back before the wiping could commence.

What APIs were you using to confirm?

@shantanugadgil
Copy link
Contributor

Superstitious Observation: The .9 releases are probably jinxed. 0.9.x (rc releases had weird issues of reporting double service names, etc. etc.)

@hynek
Copy link
Contributor Author

hynek commented Nov 12, 2024

I guess I was lucky this time because my update script uses the API before moving on to the clients and since that failed, I rolled back before the wiping could commence.

What APIs were you using to confirm?

I'm not using it to confirm anything; I just query agent/members while waiting for the updated servers to come up with the correct version.

@maxadamo
Copy link

maxadamo commented Nov 12, 2024

upgrade to Nomad 1.9.3 worked for me.

If some of you is using puppet, I am working on a solution to apply an idempotent configuration, for Nomad variables and ACLs. This is already in the works: voxpupuli/puppet-nomad#87 . So, I'll be able to restore the data.

For the sake of clarity, since I mentioned that a while back I had a similar issue, now I recalled that it wasn't Nomad but it was Consul: hashicorp/consul#21336
Consul started replying to any tag. For instance if I had a record primary.postgres.service.consul, even the standby nodes were being associated to the same record.
Same situation: I asked if he could yank that release, but as you can see, I didn't receive any feedback, and that created more problems, putting the organization in a bad light and letting everyone believe that the organization doesn't care too much about the users.
Now I see things in a different light, and that is very good, and it reinforces trust in the organization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants