-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
disconnected client that reboots can't recover allocs w/ Vault tokens #24236
Comments
@Alexsandr-Random creating the bridge and iptables chains at startup has been something we've wanted to do for a while. See #6618. But in the case you're talking about here, the tasks are all dead when the client comes up so it needs to restart the tasks. But because the host has power-cycled, all the allocation-level hooks (like CNI plugins) will have lost local state as well. So what we need to cover this case is for the client to not just restart the tasks but restart the allocation. The tricky thing is that we've done this intentionally -- the client has been offline so it can't know whether the server has rescheduled the workload (or maybe the user has stopped it!). So if the tasks are gone we fail the "restore" and then we wait until the client contacts the server to attempt to create a new allocation. You may want to look at the |
Hi!
and than we setup the following policy in job spec:
So we trying to set job state as lost after 12h before it - we need to set it as unknown.
2.2 Also with that policy even after alloc restarted from failed state - CNI rules for this job not created. and we have situation where alloc not listen on port what it should be. (dynamic port mapping) |
tl;dr The root cause of the problem here is that the task's Vault token is written to the secrets dir which is a tmpfs, which gets destroyed on client reboot. Read on for more... 😀
Can you describe this? There's a good bit of difference between a host reboot and restarting the Nomad agent, so I want to make sure that's well-covered. But assuming that's ok, I'm pretty sure I understand why you're seeing the sequence of events you've described. If a task dies while a client is stopped, what's happens is the following:
Where things are going wrong for the reboot case is step (6). That's because the host has been rebooted, so the tmpfs has been cleared! So we need to get the Vault token again and using the Legacy Workflow that happens by contacting the Nomad server. We should _not be seeing subsequent requests to Vault for the // Check if this is a server side error
if structs.IsServerSide(err) {
h.logger.Error("failed to derive Vault token", "error", err, "server_side", true)
h.lifecycle.Kill(h.ctx,
structs.NewTaskEvent(structs.TaskKilling).
SetFailsTask().
SetDisplayMessage(fmt.Sprintf("Vault: server failed to derive vault token: %v", err)))
return "", true
} But for some reason we're seeing subsequent Vault API requests from what looks like a However, as noted in the tl;dr, the real problem you're facing is actually about not being able restore from a tmpfs! The only way to fix this would be to write the Vault token (and Workload Identity token, if using that workflow) to durable storage. This would be less secure on the client but maybe useful for folks who anticipate having clients completely disconnected for long periods of time. Let me bring that back to the team for discussion and see if there's a way we could be comfortable with that. It probably doesn't help with |
One more thing are you getting the task event "failed to restore task; will not run until server is contacted" for these failed tasks when we try to restore them? |
I mean reboot of host machine via "reboot command"), that simulates lost of power.
Maybe it's good option to start with. Can we do it via nomad config files?
I am searched this key words and phrases in logs (not only syslog files) and did not find anything. |
Not currently, so that's something we need to discuss as a team whether that can be done in a way that won't break our security model.
It should be on the Task Events (visible via |
@tgross |
@Alexsandr-Random that set of task events doesn't really reflect the problem at hand. It looks like the task died and then was still pending restart when the host was rebooted. You can see "task restarting in 3m39s" and then there's only a gap of 2m32s before the client reconnected. At which point we identify that the sibling is already gone and halt this task too. But that's ok, it's pretty clear what the overall problem is here with the tmpfs. I'm going to mark this for roadmapping. |
@tgross Thanks a lot for quick responses! |
Hi @Alexsandr-Random. The issue has not been assigned to be worked on yet and therefore we cannot provide an approximate time frame if/when a potential fix will be released. When this issue is assigned to an engineer, they will update this issue accordingly. |
@tgross I thought about the phrase for a long time
So if we upgrade to new workflow should this bug disappeared and we don't depend on tmpfs? |
One more question. Can we somehow cache secrets on a remote server, say using nomad\vault binary? |
No, because in the new workflow you still would need to contact Vault for the new token. The new workflow just doesn't hop through the Nomad server to do so.
If they're static secrets without a TTL, you could copy them out of the tmpdir to somewhere durable. But that wouldn't help the problem that the |
Nomad version
Nomad v1.8.4
BuildDate 2024-09-17T20:18:34Z
Revision 22ab32e
Operating system and Environment details
OS: Ubuntu 22.04.5 LTS
Kernel: Linux a784.tso.net.ua 6.8.0-47-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Oct 2 16:16:55 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Issue
Background:
In conditions of power outages and unstable internet, it was decided to use the parameter heatbeat_grace = 12 h in the master server configuration. This allowed us to ignore unstable internet during power outages and shelling.
Because with the standard heatbeat_grace = 60s, with a bad internet, we got into a situation where jobs regularly go into the "lost" status, after which they are recreated by the masters and thus we get into a ring of endless reboot of jobs on the client server, provided that the internet is either there or not (mobile 3G)
To solve this problem, we used heatbeat_grace = 12 h, so with an unstable internet, there is no constant restart of jobs and status changes.
However, this gave rise to a strange problem.
When restarting the client (it doesn't matter which one, manually or with emergency power)
CNI plugins are not loaded, nomad interface is not created, iptables rules and chains are not loaded, which should be created by CNI plugins, while docker rules in iptables are created, as are containers. The problem is at the junction of the interaction of nomad and cni plugins.
Can only be fixed with node drain\undrain
It turns out that the agent does not track its own state and completely relies on tracking from the center, which we changed in order to solve the problem mentioned above.
Reproduction steps
Set heatbeat_grace = 12h in all master server configs like here:
server {
heatbeat_grace = 12h
}
Then reboot one of client nodes what run jobs that requires CNI plugins usage and uses docker.
After reboot login and check:
Expected Result
Better integration with CNI plugins, docker iptables rules and chains created when jobs started after agent restart. i expect that cni rules also created like docker if job require cni to run.
Actual Result
CNI plugins are not loaded, nomad interface is not created, iptables rules and chains are not loaded, which should be created by CNI plugins
Nomad Client logs (if appropriate)
Here is logs of nomad agent when we set heatbeat_grace = 12h
The text was updated successfully, but these errors were encountered: