Allow migrating the stored value of the agent checkin long polling timeout #2597

cmacknz · 2023-05-03T14:15:15Z

The changes in #2408 update the agent to report its configured checkin timeout to Fleet Server. When this parameter is set, Fleet server will dynamically set the server side timeout to the agent's checkin timeout minus 2 minutes. This makes it so that the agent always gets a complete response from Fleet server before its request times out. See elastic/fleet-server#2491.

To work with this new Fleet server feature, #2408 must change the default checkin timeout to 7m so that we keep our existing 5m total checkin duration.

The problem is that for already installed agents, the checkin timeout is persisted to disk at 10m and we currently have no way to override it. We need to add the ability to migrate these saved 10m timeouts to 7m timeouts on upgrade. This must be done prior to merging #2408.

pchila · 2023-05-10T12:41:16Z

I started implementing this yesterday: the basic idea is to have a set of migrations to apply in order (saved in the upgrade marker) and roll them back in reverse order in case of rollback of the agent upgrade.

Looking through the code I found out that the watcher process is launched from the "old" binary version before restarting the agent to spin up the "new" version: this means that the decision to rollback and the cleanup is performed by the old binary image acting as watcher.

My initial idea was to have the migration rolled back in reverse order to undo the changes when rolling back, however if the process taking care of the rollback is launched from the old agent binary, it may not have all the applied migrations defined, hence it could not roll them back properly.

There are a few ways we could solve this:

we could be launching the watcher process from the new agent binary but that also poses risks as the new version could be buggy and crash often impeding the rollback mechanism
we could introduce the migrate/rollback config subcommands to the agent which would work only when an upgrade marker is present and the watcher could call this (this would work except for the 1st upgrade where the watcher wouldn't know that it needs to call a rollback subcommand)
we could be making backups of the config files and during rollback we blindly restore them (they could contain stale data though and if any config file has been added by migration it would just stay there littering the installation directory)
we give up on the idea of undoing migrations during rollback (not liking this very much)

None of the solutions I came up with is 100% safe but I think no. 1 has a better chance provided we keep the watcher simple and bug free but I am open to suggestions :)

@cmacknz @michalpristas any ideas ?

cmacknz · 2023-05-10T18:54:31Z

The goal here is to ensure that the agent can always roll back to a working state. For that reason I do not like 1 because we are relying on the new version of the agent that could have a critical bug. What if the reason for the rollback was triggered by a bug in the migration logic of the new agent, which would also be in the new agent's watcher?

we could be making backups of the config files and during rollback we blindly restore them (they could contain stale data though and if any config file has been added by migration it would just stay there littering the installation directory)

Making a configuration backup is the safest approach, that always gets us back into the state we were in before the upgrade regardless of what the new version of the agent we are attempting to upgrade to does.

I think we can live with an unused configuration file if that is the only negative impact. As long as the old agent version ignores it there is no harm.

blakerouse · 2023-05-11T02:12:01Z

we could be launching the watcher process from the new agent binary but that also poses risks as the new version could be buggy and crash often impeding the rollback mechanism

We should absolutely do this and I have talked about this before. We have been stuck with issues because of this, we should be running the newer watcher so we get the fixes in the watcher on upgrade. Otherwise if there is a bug in the watcher in the current version we cannot upgrade to the new version.

Example: Going to 8.6 where the watcher didn't support the V2 daemon protocol, so we had to support both V1 and V2 or the upgrade would not succeed.

pchila · 2023-05-11T14:09:41Z

I see the reasons for both @cmacknz and @blakerouse choices but there are risks for both solutions so I would like to propose a compromise:

the agent will always create a backup of config before starting the migrations
the migrations will be performed by the new version of the agent
the watcher should be spawned from the new agent binary
if we need to rollback the upgrade and migration rollback fails, it will restore the saved config as a fallback
we can decide to leave the backup within the agent to be cleaned up at the next upgrade as an escape hatch in case things go REALLY wrong and a manual restore of config is necessary (maybe after an uninstall/reinstall of the agent if that's the only way to get an agent working again)

What do you think? Would this give us the ability to selectively apply migrations/rollbacks while still having the escape hatch of restoring the agent (manually if necessary) ?

cmacknz · 2023-05-16T21:00:23Z

That mostly works for me.

Do we need to try and rollback the migrations at all? Could we instead always restore the config backup instead of relying on the migration to fail before we attempt this. I see the attempt to rollback migrations as additional complexity, and we need the config backup path to exist and be well tested regardless.

pchila · 2023-05-18T07:21:21Z

@cmacknz following our talk yesterday, I am going with the backup of the config before migration and restore of such config during upgrade rollback.

elasticmachine · 2024-06-03T15:59:02Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

cmacknz added the Team:Elastic-Agent Label for the Agent team label May 3, 2023

cmacknz assigned pchila May 3, 2023

cmacknz added the 8.9-candidate label May 3, 2023

This was referenced May 3, 2023

Feature/increase fleet long poll timeout #2408

Closed

Support longer checkin intervals when the agent status has not changed #2257

Open

pchila mentioned this issue May 11, 2023

Create installation marker on upgrade from versions < 8.8.0 #2655

Merged

7 tasks

pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow migrating the stored value of the agent checkin long polling timeout #2597

Allow migrating the stored value of the agent checkin long polling timeout #2597

cmacknz commented May 3, 2023

pchila commented May 10, 2023

cmacknz commented May 10, 2023

blakerouse commented May 11, 2023

pchila commented May 11, 2023 •

edited

Loading

cmacknz commented May 16, 2023

pchila commented May 18, 2023

elasticmachine commented Jun 3, 2024

Allow migrating the stored value of the agent checkin long polling timeout #2597

Allow migrating the stored value of the agent checkin long polling timeout #2597

Comments

cmacknz commented May 3, 2023

pchila commented May 10, 2023

cmacknz commented May 10, 2023

blakerouse commented May 11, 2023

pchila commented May 11, 2023 • edited Loading

cmacknz commented May 16, 2023

pchila commented May 18, 2023

elasticmachine commented Jun 3, 2024

pchila commented May 11, 2023 •

edited

Loading