Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disconnected client that reboots can't recover allocs w/ Vault tokens #24236

Open
Alexsandr-Random opened this issue Oct 17, 2024 · 13 comments
Open
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/client-restart type/bug

Comments

@Alexsandr-Random
Copy link

Nomad version

Nomad v1.8.4
BuildDate 2024-09-17T20:18:34Z
Revision 22ab32e

Operating system and Environment details

OS: Ubuntu 22.04.5 LTS
Kernel: Linux a784.tso.net.ua 6.8.0-47-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Oct 2 16:16:55 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Issue

Background:
In conditions of power outages and unstable internet, it was decided to use the parameter heatbeat_grace = 12 h in the master server configuration. This allowed us to ignore unstable internet during power outages and shelling.
Because with the standard heatbeat_grace = 60s, with a bad internet, we got into a situation where jobs regularly go into the "lost" status, after which they are recreated by the masters and thus we get into a ring of endless reboot of jobs on the client server, provided that the internet is either there or not (mobile 3G)

To solve this problem, we used heatbeat_grace = 12 h, so with an unstable internet, there is no constant restart of jobs and status changes.
However, this gave rise to a strange problem.
When restarting the client (it doesn't matter which one, manually or with emergency power)
CNI plugins are not loaded, nomad interface is not created, iptables rules and chains are not loaded, which should be created by CNI plugins, while docker rules in iptables are created, as are containers. The problem is at the junction of the interaction of nomad and cni plugins.
Can only be fixed with node drain\undrain

It turns out that the agent does not track its own state and completely relies on tracking from the center, which we changed in order to solve the problem mentioned above.

Reproduction steps

Set heatbeat_grace = 12h in all master server configs like here:
server {
heatbeat_grace = 12h
}

Then reboot one of client nodes what run jobs that requires CNI plugins usage and uses docker.
After reboot login and check:

  1. Does interface (ip a) named "nomad" exist on agent node?
  2. Does CNI rules in iptables created after reboot and jobs start by agent?

Expected Result

Better integration with CNI plugins, docker iptables rules and chains created when jobs started after agent restart. i expect that cni rules also created like docker if job require cni to run.

Actual Result

CNI plugins are not loaded, nomad interface is not created, iptables rules and chains are not loaded, which should be created by CNI plugins

Nomad Client logs (if appropriate)

Here is logs of nomad agent when we set heatbeat_grace = 12h

Oct 17 11:23:04 client systemd[1]: Stopping Nomad...
Oct 17 11:23:04 client nomad[1947]: ==> Caught signal: interrupt
Oct 17 11:23:04 client nomad[1947]:     2024-10-17T11:23:04.573+0300 [INFO]  agent: requesting shutdown
Oct 17 11:23:04 client nomad[1947]:     2024-10-17T11:23:04.574+0300 [INFO]  client: shutting down
Oct 17 11:23:04 client nomad[1947]:     2024-10-17T11:23:04.589+0300 [INFO]  client.plugin: shutting down plugin manager: plugin-type=device
Oct 17 11:23:04 client nomad[1947]:     2024-10-17T11:23:04.589+0300 [INFO]  client.plugin: plugin manager finished: plugin-type=device
Oct 17 11:23:04 client nomad[1947]:     2024-10-17T11:23:04.589+0300 [INFO]  client.plugin: shutting down plugin manager: plugin-type=driver
Oct 17 11:23:04 client nomad[1947]:     2024-10-17T11:23:04.603+0300 [INFO]  client.plugin: plugin manager finished: plugin-type=driver
Oct 17 11:23:04 client nomad[1947]:     2024-10-17T11:23:04.603+0300 [INFO]  client.plugin: shutting down plugin manager: plugin-type=csi
Oct 17 11:23:04 client nomad[1947]:     2024-10-17T11:23:04.603+0300 [INFO]  client.plugin: plugin manager finished: plugin-type=csi
Oct 17 11:23:04 client consul[1073]: 2024-10-17T11:23:04.715+0300 [INFO]  agent: Synced service: service=_nomad-task-f8ebc7da-cf91-f55d-a6fd-3cb0437d40e4-group-traefik-traefik-ingress-http
Oct 17 11:23:04 client consul[1073]: agent: Synced service: service=_nomad-task-f8ebc7da-cf91-f55d-a6fd-3cb0437d40e4-group-traefik-traefik-ingress-http
Oct 17 11:23:04 client consul[1073]: 2024-10-17T11:23:04.882+0300 [INFO]  agent: Deregistered service: service=_nomad-client-ne3isvaq5zltnpoffewy3hm6xifswcvs
Oct 17 11:23:04 client consul[1073]: agent: Deregistered service: service=_nomad-client-ne3isvaq5zltnpoffewy3hm6xifswcvs
Oct 17 11:23:04 client nomad[1947]:     2024-10-17T11:23:04.883+0300 [INFO]  agent: shutdown complete
Oct 17 11:23:04 client systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
Oct 17 11:23:04 client systemd[1]: nomad.service: Failed with result 'exit-code'.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2274 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2275 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2285 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2293 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2307 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2308 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2309 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2310 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2702 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2841 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 2918 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 3148 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 3519 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 3622 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 3695 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: nomad.service: Unit process 3768 (nomad) remains running after unit stopped.
Oct 17 11:23:04 client systemd[1]: Stopped Nomad.
Oct 17 11:23:04 client systemd[1]: nomad.service: Consumed 15.103s CPU time.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2274 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2275 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2285 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2293 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2307 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2308 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2309 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2310 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2702 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2841 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 2918 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 3148 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 3519 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 3622 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 3695 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: nomad.service: Found left-over process 3768 (nomad) in control group while starting unit. Ignoring.
Oct 17 11:23:04 client systemd[1]: Starting Nomad...
Oct 17 11:23:04 client nomad[5669]: ==> Loaded configuration from /usr/local/etc/nomad.d/base.hcl, /usr/local/etc/nomad.d/client.hcl, /usr/local/etc/nomad.d/mtls.hcl
Oct 17 11:23:04 client nomad[5669]: ==> Starting Nomad agent...
Oct 17 11:23:12 client consul[1073]: 2024-10-17T11:23:12.971+0300 [WARN]  agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
Oct 17 11:23:12 client consul[1073]: agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
Oct 17 11:23:12 client consul[1073]: 2024-10-17T11:23:12.974+0300 [WARN]  agent: Check is now critical: check=_nomad-check-370a1f8dbd09fe9ce51cdceed8f2f84c0c99e9b8
Oct 17 11:23:12 client consul[1073]: agent: Check is now critical: check=_nomad-check-370a1f8dbd09fe9ce51cdceed8f2f84c0c99e9b8
Oct 17 11:23:14 client consul[1073]: 2024-10-17T11:23:14.361+0300 [WARN]  agent: Check is now critical: check=_nomad-check-67b7f231e55133b7383914af1d40a06c2285835c
Oct 17 11:23:14 client consul[1073]: agent: Check is now critical: check=_nomad-check-67b7f231e55133b7383914af1d40a06c2285835c
Oct 17 11:23:14 client consul[1073]: 2024-10-17T11:23:14.456+0300 [INFO]  agent: Synced check: check=_nomad-check-67b7f231e55133b7383914af1d40a06c2285835c
Oct 17 11:23:14 client consul[1073]: agent: Synced check: check=_nomad-check-67b7f231e55133b7383914af1d40a06c2285835c
Oct 17 11:23:15 client nomad[5669]: 2024-10-17T11:23:15.412+0300 [TRACE] plugin.stdio: waiting for stdio data
Oct 17 11:23:15 client nomad[5669]: 2024-10-17T11:23:15.419+0300 [TRACE] plugin.stdio: waiting for stdio data
Oct 17 11:23:15 client nomad[5669]: 2024-10-17T11:23:15.429+0300 [TRACE] plugin.stdio: waiting for stdio data
Oct 17 11:23:15 client nomad[5669]: 2024-10-17T11:23:15.435+0300 [TRACE] plugin.stdio: waiting for stdio data
Oct 17 11:23:15 client nomad[5669]: 2024-10-17T11:23:15.445+0300 [TRACE] plugin.stdio: waiting for stdio data
Oct 17 11:23:15 client nomad[5669]: 2024-10-17T11:23:15.456+0300 [TRACE] plugin.stdio: waiting for stdio data
Oct 17 11:23:15 client nomad[5669]: 2024-10-17T11:23:15.463+0300 [TRACE] plugin.stdio: waiting for stdio data
Oct 17 11:23:15 client nomad[5669]: 2024-10-17T11:23:15.472+0300 [TRACE] plugin.stdio: waiting for stdio data
Oct 17 11:23:15 client nomad[5669]: ==> Nomad agent configuration:
Oct 17 11:23:15 client nomad[5669]:        Advertise Addrs: HTTP: 10.100.20.37:4646
Oct 17 11:23:15 client nomad[5669]:             Bind Addrs: HTTP: [0.0.0.0:4646]
Oct 17 11:23:15 client nomad[5669]:                 Client: true
Oct 17 11:23:15 client nomad[5669]:              Log Level: INFO
Oct 17 11:23:15 client nomad[5669]:                 Region: test (DC: test)
Oct 17 11:23:15 client nomad[5669]:                 Server: false
Oct 17 11:23:15 client nomad[5669]:                Version: 1.8.4
Oct 17 11:23:15 client nomad[5669]: ==> Nomad agent started! Log data will stream in below:
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.935+0300 [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/var/local/nomad/plugins
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.936+0300 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.936+0300 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.936+0300 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.936+0300 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.936+0300 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.936+0300 [INFO]  client: using state directory: state_dir=/var/local/nomad/client
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.936+0300 [INFO]  client: using alloc directory: alloc_dir=/var/local/nomad/alloc
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.936+0300 [INFO]  client: using dynamic ports: min=20000 max=32000 reserved=""
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.942+0300 [INFO]  client.fingerprint_mgr.consul: consul agent is available: cluster=default
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:04.948+0300 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:05.386+0300 [INFO]  client.fingerprint_mgr.vault: Vault is available: cluster=default
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.389+0300 [INFO]  client.proclib.cg2: initializing nomad cgroups: cores=0-11
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.390+0300 [INFO]  client.plugin: starting plugin manager: plugin-type=csi
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.390+0300 [INFO]  client.plugin: starting plugin manager: plugin-type=driver
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.390+0300 [INFO]  client.plugin: starting plugin manager: plugin-type=device
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.402+0300 [WARN]  client.server_mgr: no servers available
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.405+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=53ccf457-ddce-2d3b-3f56-48b5a384f6cf task=node-red type=Received msg="Task received by client" failed=false
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.405+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=53ccf457-ddce-2d3b-3f56-48b5a384f6cf task=promtail type=Received msg="Task received by client" failed=false
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.422+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=594865ac-39a6-8b24-013a-10266c3eed7c task=node-web type=Received msg="Task received by client" failed=false
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.422+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=594865ac-39a6-8b24-013a-10266c3eed7c task=promtail type=Received msg="Task received by client" failed=false
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.438+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=6cd39838-d4d4-0e1a-ff3d-f7cc5f9887d5 task=nginx-helper type=Received msg="Task received by client" failed=false
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.449+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b4d43287-27a2-218a-864c-b21efdd8752a task=redis type=Received msg="Task received by client" failed=false
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.449+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b4d43287-27a2-218a-864c-b21efdd8752a task=redis-commander type=Received msg="Task received by client" failed=false
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.465+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=f8ebc7da-cf91-f55d-a6fd-3cb0437d40e4 task=traefik type=Received msg="Task received by client" failed=false
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.474+0300 [INFO]  client: started client: node_id=9716bd4b-95b6-53fb-645f-030f2f0fbdea
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.474+0300 [WARN]  client.server_mgr: no servers available
Oct 17 11:23:15 client systemd[1]: Started Nomad.
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.485+0300 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.486+0300 [INFO]  agent: (runner) creating watcher
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.487+0300 [INFO]  agent: (runner) starting
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.514+0300 [WARN]  client.server_mgr: no servers available
Oct 17 11:23:15 client consul[1073]: 2024-10-17T11:23:15.550+0300 [INFO]  agent: Synced service: service=_nomad-client-ne3isvaq5zltnpoffewy3hm6xifswcvs
Oct 17 11:23:15 client consul[1073]: agent: Synced service: service=_nomad-client-ne3isvaq5zltnpoffewy3hm6xifswcvs
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.554+0300 [INFO]  client.consul: discovered following servers: servers=[10.100.10.11:4647, 10.100.10.12:4647, 10.100.10.13:4647]
Oct 17 11:23:15 client consul[1073]: 2024-10-17T11:23:15.616+0300 [INFO]  agent: Synced check: check=_nomad-check-61d7fb6fe8c1dc5d8d63715e58d5506ac318a5a7
Oct 17 11:23:15 client consul[1073]: agent: Synced check: check=_nomad-check-61d7fb6fe8c1dc5d8d63715e58d5506ac318a5a7
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.717+0300 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.717+0300 [INFO]  agent: (runner) creating watcher
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.717+0300 [INFO]  agent: (runner) starting
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.795+0300 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.795+0300 [INFO]  agent: (runner) creating watcher
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.795+0300 [INFO]  agent: (runner) starting
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.863+0300 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.863+0300 [INFO]  agent: (runner) creating watcher
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.863+0300 [INFO]  agent: (runner) starting
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.888+0300 [INFO]  client: node registration complete
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.924+0300 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.924+0300 [INFO]  agent: (runner) creating watcher
Oct 17 11:23:15 client nomad[5669]:     2024-10-17T11:23:15.924+0300 [INFO]  agent: (runner) starting
Oct 17 11:23:16 client nomad[5669]:     2024-10-17T11:23:16.016+0300 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Oct 17 11:23:16 client nomad[5669]:     2024-10-17T11:23:16.017+0300 [INFO]  agent: (runner) creating watcher
Oct 17 11:23:16 client nomad[5669]:     2024-10-17T11:23:16.017+0300 [INFO]  agent: (runner) starting
Oct 17 11:23:16 client nomad[5669]:     2024-10-17T11:23:16.078+0300 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Oct 17 11:23:16 client nomad[5669]:     2024-10-17T11:23:16.078+0300 [INFO]  agent: (runner) creating watcher
Oct 17 11:23:16 client nomad[5669]:     2024-10-17T11:23:16.078+0300 [INFO]  agent: (runner) starting
Oct 17 11:23:16 client nomad[5669]:     2024-10-17T11:23:16.170+0300 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Oct 17 11:23:16 client nomad[5669]:     2024-10-17T11:23:16.170+0300 [INFO]  agent: (runner) creating watcher
Oct 17 11:23:16 client nomad[5669]:     2024-10-17T11:23:16.170+0300 [INFO]  agent: (runner) starting
Oct 17 11:23:22 client consul[1073]: 2024-10-17T11:23:22.972+0300 [WARN]  agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
Oct 17 11:23:22 client consul[1073]: agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
Oct 17 11:23:22 client consul[1073]: 2024-10-17T11:23:22.975+0300 [WARN]  agent: Check is now critical: check=_nomad-check-370a1f8dbd09fe9ce51cdceed8f2f84c0c99e9b8
Oct 17 11:23:22 client consul[1073]: agent: Check is now critical: check=_nomad-check-370a1f8dbd09fe9ce51cdceed8f2f84c0c99e9b8
Oct 17 11:23:23 client nomad[5669]:     2024-10-17T11:23:23.319+0300 [INFO]  client: node registration complete
Oct 17 11:23:24 client consul[1073]: 2024-10-17T11:23:24.331+0300 [INFO]  agent: Synced check: check=_nomad-check-61d7fb6fe8c1dc5d8d63715e58d5506ac318a5a7
Oct 17 11:23:24 client consul[1073]: agent: Synced check: check=_nomad-check-61d7fb6fe8c1dc5d8d63715e58d5506ac318a5a7
Oct 17 11:23:24 client consul[1073]: 2024-10-17T11:23:24.454+0300 [INFO]  agent: Synced check: check=_nomad-check-67b7f231e55133b7383914af1d40a06c2285835c
Oct 17 11:23:24 client consul[1073]: agent: Synced check: check=_nomad-check-67b7f231e55133b7383914af1d40a06c2285835c
Oct 17 11:23:32 client consul[1073]: 2024-10-17T11:23:32.974+0300 [WARN]  agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
Oct 17 11:23:32 client consul[1073]: agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
Oct 17 11:23:32 client consul[1073]: 2024-10-17T11:23:32.975+0300 [WARN]  agent: Check is now critical: check=_nomad-check-370a1f8dbd09fe9ce51cdceed8f2f84c0c99e9b8
Oct 17 11:23:32 client consul[1073]: agent: Check is now critical: check=_nomad-check-370a1f8dbd09fe9ce51cdceed8f2f84c0c99e9b8
Oct 17 11:23:42 client consul[1073]: 2024-10-17T11:23:42.974+0300 [WARN]  agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
Oct 17 11:23:42 client consul[1073]: agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
Oct 17 11:23:42 client consul[1073]: 2024-10-17T11:23:42.976+0300 [WARN]  agent: Check is now critical: check=_nomad-check-370a1f8dbd09fe9ce51cdceed8f2f84c0c99e9b8
Oct 17 11:23:42 client consul[1073]: agent: Check is now critical: check=_nomad-check-370a1f8dbd09fe9ce51cdceed8f2f84c0c99e9b8
Oct 17 11:23:45 client nomad[5669]: ==> Newer Nomad version available: 1.9.0 (currently running: 1.8.4)
Oct 17 11:23:52 client consul[1073]: 2024-10-17T11:23:52.976+0300 [WARN]  agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
Oct 17 11:23:52 client consul[1073]: agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
Oct 17 11:23:52 client consul[1073]: 2024-10-17T11:23:52.977+0300 [WARN]  agent: Check is now critical: check=_nomad-check-370a1f8dbd09fe9ce51cdceed8f2f84c0c99e9b8
Oct 17 11:23:52 client consul[1073]: agent: Check is now critical: check=_nomad-check-370a1f8dbd09fe9ce51cdceed8f2f84c0c99e9b8
Oct 17 11:24:02 client consul[1073]: 2024-10-17T11:24:02.977+0300 [WARN]  agent: Check is now critical: check=_nomad-check-70969a4c677959b7b48fbc69fbad5f30a2e1a5eb
....
@tgross tgross changed the title Nomad does not use CNI plugins on restart Nomad does not use CNI plugins on host restart Oct 17, 2024
@tgross
Copy link
Member

tgross commented Oct 17, 2024

@Alexsandr-Random creating the bridge and iptables chains at startup has been something we've wanted to do for a while. See #6618.

But in the case you're talking about here, the tasks are all dead when the client comes up so it needs to restart the tasks. But because the host has power-cycled, all the allocation-level hooks (like CNI plugins) will have lost local state as well. So what we need to cover this case is for the client to not just restart the tasks but restart the allocation.

The tricky thing is that we've done this intentionally -- the client has been offline so it can't know whether the server has rescheduled the workload (or maybe the user has stopped it!). So if the tasks are gone we fail the "restore" and then we wait until the client contacts the server to attempt to create a new allocation.

You may want to look at the disconnect block to get closer to the behavior you're looking for here.

@Alexsandr-Random
Copy link
Author

Hi!
Thanks for quick reply!
What we changed (all masters in cluster restarted and configs updated):

server {
heatbeat_grace = 60s 
}

and than we setup the following policy in job spec:

    restart {
      interval = "24h"
      delay    = "3m"
      mode     = "delay"
    }
    disconnect {
        lost_after = "12h"
        replace = false
        reconcile = "keep_original"
    }

So we trying to set job state as lost after 12h before it - we need to set it as unknown.
Than we trying some tests.

  1. Just 5-10 minutes of network downtime simulation for host (cluster mark node as disconnected) and job states as unknown. And then recover it properly.
  2. Simulate reboot (internet connection persist after host comes up) and here we get some problems:
    2.1 Nomad interface created and CNI rules (general like CHAINS) - created, but
    we can't retrieve vault secrets from storage for first few seconds (nomad starts after consul via systemd), task fails and than - restarted, meanwhile i could see significant nomad delay (2m) while he trying to recover state of allocs, also i think if host rebooted and internet connection lost it will not recover job with that policy (without - nomad could do it)
    That's very strange because all templates with secrets have change_mode = "noop"
    Here some logs:
Oct 18 12:44:59 a784 nomad[1771]:     2024-10-18T12:44:49.329+0300 [INFO]  client.fingerprint_mgr.vault: Vault is available: cluster=default
Oct 18 12:45:00 a784 nomad[1771]:     2024-10-18T12:45:00.475+0300 [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.DeriveVaultToken server=10.100.10.11:4647
Oct 18 12:45:00 a784 nomad[1771]:     2024-10-18T12:45:00.475+0300 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: EOF" rpc=Node.DeriveVaultToken server=10.100.10.11:4647
Oct 18 12:45:00 a784 nomad[1771]:     2024-10-18T12:45:00.475+0300 [ERROR] client.vault: error making derive token RPC: error="rpc error: EOF"
Oct 18 12:45:00 a784 nomad[1771]:     2024-10-18T12:45:00.475+0300 [ERROR] client.vault: error deriving token: name=default error="DeriveVaultToken RPC failed: rpc error: EOF" alloc_id=9905abcd-c65b-9940-d2c1-093797184f77 task_names=["node-web"]
Oct 18 12:45:00 a784 nomad[1771]:     2024-10-18T12:45:00.475+0300 [ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=9905abcd-c65b-9940-d2c1-093797184f77 task=node-web error="DeriveVaultToken RPC failed: rpc error: EOF" recoverable=false
Oct 18 12:45:00 a784 nomad[1771]:     2024-10-18T12:45:00.476+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=9905abcd-c65b-9940-d2c1-093797184f77 task=node-web type=Killing msg="Vault: failed to derive vault token: DeriveVaultToken RPC failed: rpc error: EOF" failed=true
Oct 18 12:45:00 a784 nomad[1771]:     2024-10-18T12:45:00.881+0300 [WARN]  agent: (view) vault.read(test/data/registry-auth): vault.read(test/data/registry-auth): Error making API request.
Oct 18 12:45:00 a784 nomad[1771]: URL: GET http://vault-proxy.service.consul:8100/v1/test/data/registry-auth
Oct 18 12:45:00 a784 nomad[1771]:     2024-10-18T12:45:00.881+0300 [WARN]  agent: (view) vault.read(test/data/loki-auth): vault.read(test/data/loki-auth): Error making API request.
Oct 18 12:45:00 a784 nomad[1771]: URL: GET http://vault-proxy.service.consul:8100/v1/test/data/loki-auth
Oct 18 12:45:01 a784 nomad[1771]:     2024-10-18T12:45:01.184+0300 [WARN]  agent: (view) vault.read(test/data/loki-auth): vault.read(test/data/loki-auth): Error making API request.
Oct 18 12:45:01 a784 nomad[1771]: URL: GET http://vault-proxy.service.consul:8100/v1/test/data/loki-auth
Oct 18 12:45:01 a784 nomad[1771]:     2024-10-18T12:45:01.184+0300 [WARN]  agent: (view) vault.read(test/data/registry-auth): vault.read(test/data/registry-auth): Error making API request.
Oct 18 12:45:01 a784 nomad[1771]: URL: GET http://vault-proxy.service.consul:8100/v1/test/data/registry-auth
Oct 18 12:45:01 a784 nomad[1771]:     2024-10-18T12:45:01.747+0300 [WARN]  agent: (view) vault.read(test/data/loki-auth): vault.read(test/data/loki-auth): Error making API request.
Oct 18 12:45:01 a784 nomad[1771]: URL: GET http://vault-proxy.service.consul:8100/v1/test/data/loki-auth
Oct 18 12:45:01 a784 nomad[1771]:     2024-10-18T12:45:01.777+0300 [WARN]  agent: (view) vault.read(test/data/registry-auth): vault.read(test/data/registry-auth): Error making API request.
Oct 18 12:45:01 a784 nomad[1771]: URL: GET http://vault-proxy.service.consul:8100/v1/test/data/registry-auth
Oct 18 12:45:02 a784 nomad[1771]:     2024-10-18T12:45:02.811+0300 [WARN]  agent: (view) vault.read(test/data/loki-auth): vault.read(test/data/loki-auth): Error making API request.
Oct 18 12:45:02 a784 nomad[1771]: URL: GET http://vault-proxy.service.consul:8100/v1/test/data/loki-auth
Oct 18 12:45:02 a784 nomad[1771]:     2024-10-18T12:45:02.865+0300 [WARN]  agent: (view) vault.read(test/data/registry-auth): vault.read(test/data/registry-auth): Error making API request.
Oct 18 12:45:02 a784 nomad[1771]: URL: GET http://vault-proxy.service.consul:8100/v1/test/data/registry-auth
Oct 18 12:45:03 a784 nomad[1771]:     2024-10-18T12:45:03.480+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=9905abcd-c65b-9940-d2c1-093797184f77 task=node-web type=Template msg="Missing: vault.read(test/data/registry-auth)" failed=false
Oct 18 12:45:03 a784 nomad[1771]:     2024-10-18T12:45:03.503+0300 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=9905abcd-c65b-9940-d2c1-093797184f77 task=promtail type=Template msg="Missing: vault.read(test/data/loki-auth)" failed=false

2.2 Also with that policy even after alloc restarted from failed state - CNI rules for this job not created. and we have situation where alloc not listen on port what it should be. (dynamic port mapping)

@tgross
Copy link
Member

tgross commented Oct 18, 2024

tl;dr The root cause of the problem here is that the task's Vault token is written to the secrets dir which is a tmpfs, which gets destroyed on client reboot. Read on for more... 😀

Simulate reboot

Can you describe this? There's a good bit of difference between a host reboot and restarting the Nomad agent, so I want to make sure that's well-covered.

But assuming that's ok, I'm pretty sure I understand why you're seeing the sequence of events you've described.

If a task dies while a client is stopped, what's happens is the following:

  1. The client starts and attempts to restore the allocation client.go#L1286-L1294.
  2. The allocrunner attempts to restore the task alloc_runner.go#L485-L487.
  3. The taskrunner fails to restore the handle to the task task_runner.go#L1247-L1271.
  • This should be writing an event for the task: "failed to restore task; will not run until server is contacted". Are you seeing that?
  • This does not return an error back up to ar.Restore(), so the client will then attempt to run the allocrunner.
  1. The allocrunner runs (firing the CNI hooks) and completes the "prestart" phase.
  2. The taskrunner runs, and hits the vault_hook.
  3. The vault_hook find the Vault token in the NOMAD_SECRETS_DIR (a tmpfs) and restores that into its own state.
  4. Any template blocks that read from Vault use this token.

Where things are going wrong for the reboot case is step (6). That's because the host has been rebooted, so the tmpfs has been cleared! So we need to get the Vault token again and using the Legacy Workflow that happens by contacting the Nomad server.

We should _not be seeing subsequent requests to Vault for the template blocks though. We might have a bug in how we handle the failure case. At vault_hook.go#L384-L392 we write a task event and kill the task:

// Check if this is a server side error
if structs.IsServerSide(err) {
    h.logger.Error("failed to derive Vault token", "error", err, "server_side", true)
    h.lifecycle.Kill(h.ctx,
        structs.NewTaskEvent(structs.TaskKilling).
            SetFailsTask().
            SetDisplayMessage(fmt.Sprintf("Vault: server failed to derive vault token: %v", err)))
    return "", true
}

But for some reason we're seeing subsequent Vault API requests from what looks like a template block. Unfortunately the logs you've provided are at INFO-level so I'm not able to dig in much further than that as to why that is.

However, as noted in the tl;dr, the real problem you're facing is actually about not being able restore from a tmpfs! The only way to fix this would be to write the Vault token (and Workload Identity token, if using that workflow) to durable storage. This would be less secure on the client but maybe useful for folks who anticipate having clients completely disconnected for long periods of time.

Let me bring that back to the team for discussion and see if there's a way we could be comfortable with that. It probably doesn't help with template blocks though, as those still need to get data from their data sources to render (or even check if anything has changed while the client was down!)

@tgross tgross changed the title Nomad does not use CNI plugins on host restart tasks fail to recover on host restart until connection to servers restored Oct 18, 2024
@tgross
Copy link
Member

tgross commented Oct 18, 2024

One more thing are you getting the task event "failed to restore task; will not run until server is contacted" for these failed tasks when we try to restore them?

@Alexsandr-Random
Copy link
Author

Can you describe this? There's a good bit of difference between a host reboot and restarting the Nomad agent, so I want to make sure that's well-covered.

I mean reboot of host machine via "reboot command"), that simulates lost of power.

The only way to fix this would be to write the Vault token (and Workload Identity token, if using that workflow) to durable storage.

Maybe it's good option to start with. Can we do it via nomad config files?

One more thing are you getting the task event "failed to restore task; will not run until server is contacted" for these failed tasks when we try to restore them?

I am searched this key words and phrases in logs (not only syslog files) and did not find anything.

@tgross
Copy link
Member

tgross commented Oct 21, 2024

Maybe it's good option to start with. Can we do it via nomad config files?

Not currently, so that's something we need to discuss as a team whether that can be done in a way that won't break our security model.

I am searched this key words and phrases in logs (not only syslog files) and did not find anything.

It should be on the Task Events (visible via nomad alloc status $allocid

@Alexsandr-Random
Copy link
Author

@tgross
For now we are back to the original scheme, without any manipulations with heatbeat_grace and without the disconnect block, but I still have the results of tests with the disconnect block and heatbeat_grace = 60s
This is what Task Events looks like when viewed in the Nomad GUI:
image

@tgross
Copy link
Member

tgross commented Oct 23, 2024

@Alexsandr-Random that set of task events doesn't really reflect the problem at hand. It looks like the task died and then was still pending restart when the host was rebooted. You can see "task restarting in 3m39s" and then there's only a gap of 2m32s before the client reconnected. At which point we identify that the sibling is already gone and halt this task too.

But that's ok, it's pretty clear what the overall problem is here with the tmpfs. I'm going to mark this for roadmapping.

@tgross tgross changed the title tasks fail to recover on host restart until connection to servers restored tasks with Vault tokens fail to recover on host restart until connection to servers restored Oct 23, 2024
@tgross tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed theme/cni labels Oct 23, 2024
@tgross tgross moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Oct 23, 2024
@tgross tgross removed their assignment Oct 23, 2024
@tgross tgross changed the title tasks with Vault tokens fail to recover on host restart until connection to servers restored disconnected client that reboots can't recover allocs w/ Vault tokens Oct 23, 2024
@Alexsandr-Random
Copy link
Author

@tgross Thanks a lot for quick responses!
How quickly can we expect this situation to be resolved in theory? I assume you classify it as "low priority", but if possible, could you give us an approximate time frame for when it could be released?

@jrasell
Copy link
Member

jrasell commented Oct 24, 2024

Hi @Alexsandr-Random. The issue has not been assigned to be worked on yet and therefore we cannot provide an approximate time frame if/when a potential fix will be released. When this issue is assigned to an engineer, they will update this issue accordingly.

@Alexsandr-Random
Copy link
Author

@tgross I thought about the phrase for a long time

things are going wrong for the reboot case is step (6). That's because the host has been rebooted, so the tmpfs has been cleared! So we need to get the Vault token again and using the Legacy Workflow that happens by contacting the Nomad server.

So if we upgrade to new workflow should this bug disappeared and we don't depend on tmpfs?

@Alexsandr-Random
Copy link
Author

One more question. Can we somehow cache secrets on a remote server, say using nomad\vault binary?
so that secrets remain on the server side even after a reboot.
Because jobs cannot be started if there is no internet and the server was rebooted.

@tgross
Copy link
Member

tgross commented Nov 15, 2024

So if we upgrade to new workflow should this bug disappeared and we don't depend on tmpfs?

No, because in the new workflow you still would need to contact Vault for the new token. The new workflow just doesn't hop through the Nomad server to do so.

One more question. Can we somehow cache secrets on a remote server, say using nomad\vault binary?
so that secrets remain on the server side even after a reboot.
Because jobs cannot be started if there is no internet and the server was rebooted.

If they're static secrets without a TTL, you could copy them out of the tmpdir to somewhere durable. But that wouldn't help the problem that the template block expects to be able to run. In that scenario, #16272 would help you but that hasn't been implemented yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/client-restart type/bug
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

3 participants