-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid Vault token (403) in a Nomad client after recycling Nomad servers #24256
Comments
Summarizing our internal discussion so far:
|
noticed that Nomad also reaches out to newly created Vault servers when they are still joining the cluster and aren't ready for requests:
|
Can you clarify what "they" is here? Are you saying Nomad clients/servers(?) aren't ready for requests or the Vault servers aren't ready for requests? |
sorry Tim, I shouldn't have mentioned Vault here, it's adding confusion. what we noticed is that Nomad servers (and this is happening to our Vault servers too, but it's a different issue) will join Consul and will report themselves as the nomad-clients are able to talk to this new server (via does this make sense @tgross? |
@tgross is there anything else we can do externally to avoid issues? |
Unfortunately even if you could get the Consul health check to work as you expect, that wouldn't help here. Consul is only used for discovery on client start or if the client loses all servers somehow and has to start over. Once a client is connected to the cluster, it gets the list of servers from server's response to heartbeats, and not from Consul. That list consists of the local Raft peers. The client periodically reshuffles its copy of the list (every 5m) to spread load. Something that comes to mind in terms of fixing this that might me smaller in scope than reworking server bring-up is to have the list of servers we return to the client be not just the local peers but those that autopilot says are ready. That'd need some investigation to verify feasibility. But in any case, short of net-splitting the new servers when they come up, no there's no workaround currently. Using Workload Identity will help specifically for Vault because then we don't go to the server for Vault tokens, but doesn't help the general problem. This overall issues is a problem with all the Raft-based HashiCorp products, as it turns out, but Nomad is probably impacted the worst because of how much the client gets canonical status from the servers. |
Nomad version
v1.5.15+ent
Operating system and Environment details
Ubuntu 22.04 - AWS EC2 instances
Issue
It looks like we've hit a bug where a nomad client starts receiving 403s from Vault when we're in the middle of recycling the nomad servers (3 node cluster -> we spin up 3 new servers, and then slowly shut the old ones down one by one).
This has happened twice already in our Production systems recently.
Reproduction steps
client logs:
servers:
the "Promoting server" message I don't think means leader election since the rest of the logs indicate that other node acquires leadership later in the recycling process (5min later)
After that, the client will be rejected by Vault for all requests with 403s for 8+ minutes (so, even after the re-election has happened)
New servers finish registering in Consul
after the 3 old servers have left the cluster, the client no longer receives 403s from Vault.
Expected Result
Client should continue to operate normally when rolling nomad servers
Actual Result
Client is interrupted and receives 403s from Vault
The text was updated successfully, but these errors were encountered: