Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow migrating the stored value of the agent checkin long polling timeout #2597

Open
cmacknz opened this issue May 3, 2023 · 7 comments
Open
Assignees
Labels
8.9-candidate Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@cmacknz
Copy link
Member

cmacknz commented May 3, 2023

The changes in #2408 update the agent to report its configured checkin timeout to Fleet Server. When this parameter is set, Fleet server will dynamically set the server side timeout to the agent's checkin timeout minus 2 minutes. This makes it so that the agent always gets a complete response from Fleet server before its request times out. See elastic/fleet-server#2491.

To work with this new Fleet server feature, #2408 must change the default checkin timeout to 7m so that we keep our existing 5m total checkin duration.

The problem is that for already installed agents, the checkin timeout is persisted to disk at 10m and we currently have no way to override it. We need to add the ability to migrate these saved 10m timeouts to 7m timeouts on upgrade. This must be done prior to merging #2408.

@pchila
Copy link
Member

pchila commented May 10, 2023

I started implementing this yesterday: the basic idea is to have a set of migrations to apply in order (saved in the upgrade marker) and roll them back in reverse order in case of rollback of the agent upgrade.

Looking through the code I found out that the watcher process is launched from the "old" binary version before restarting the agent to spin up the "new" version: this means that the decision to rollback and the cleanup is performed by the old binary image acting as watcher.

My initial idea was to have the migration rolled back in reverse order to undo the changes when rolling back, however if the process taking care of the rollback is launched from the old agent binary, it may not have all the applied migrations defined, hence it could not roll them back properly.

There are a few ways we could solve this:

  1. we could be launching the watcher process from the new agent binary but that also poses risks as the new version could be buggy and crash often impeding the rollback mechanism
  2. we could introduce the migrate/rollback config subcommands to the agent which would work only when an upgrade marker is present and the watcher could call this (this would work except for the 1st upgrade where the watcher wouldn't know that it needs to call a rollback subcommand)
  3. we could be making backups of the config files and during rollback we blindly restore them (they could contain stale data though and if any config file has been added by migration it would just stay there littering the installation directory)
  4. we give up on the idea of undoing migrations during rollback (not liking this very much)

None of the solutions I came up with is 100% safe but I think no. 1 has a better chance provided we keep the watcher simple and bug free but I am open to suggestions :)

@cmacknz @michalpristas any ideas ?

@cmacknz
Copy link
Member Author

cmacknz commented May 10, 2023

The goal here is to ensure that the agent can always roll back to a working state. For that reason I do not like 1 because we are relying on the new version of the agent that could have a critical bug. What if the reason for the rollback was triggered by a bug in the migration logic of the new agent, which would also be in the new agent's watcher?

we could be making backups of the config files and during rollback we blindly restore them (they could contain stale data though and if any config file has been added by migration it would just stay there littering the installation directory)

Making a configuration backup is the safest approach, that always gets us back into the state we were in before the upgrade regardless of what the new version of the agent we are attempting to upgrade to does.

I think we can live with an unused configuration file if that is the only negative impact. As long as the old agent version ignores it there is no harm.

@blakerouse
Copy link
Contributor

  1. we could be launching the watcher process from the new agent binary but that also poses risks as the new version could be buggy and crash often impeding the rollback mechanism

We should absolutely do this and I have talked about this before. We have been stuck with issues because of this, we should be running the newer watcher so we get the fixes in the watcher on upgrade. Otherwise if there is a bug in the watcher in the current version we cannot upgrade to the new version.

Example: Going to 8.6 where the watcher didn't support the V2 daemon protocol, so we had to support both V1 and V2 or the upgrade would not succeed.

@pchila
Copy link
Member

pchila commented May 11, 2023

I see the reasons for both @cmacknz and @blakerouse choices but there are risks for both solutions so I would like to propose a compromise:

  • the agent will always create a backup of config before starting the migrations
  • the migrations will be performed by the new version of the agent
  • the watcher should be spawned from the new agent binary
  • if we need to rollback the upgrade and migration rollback fails, it will restore the saved config as a fallback
  • we can decide to leave the backup within the agent to be cleaned up at the next upgrade as an escape hatch in case things go REALLY wrong and a manual restore of config is necessary (maybe after an uninstall/reinstall of the agent if that's the only way to get an agent working again)

What do you think? Would this give us the ability to selectively apply migrations/rollbacks while still having the escape hatch of restoring the agent (manually if necessary) ?

@cmacknz
Copy link
Member Author

cmacknz commented May 16, 2023

That mostly works for me.

Do we need to try and rollback the migrations at all? Could we instead always restore the config backup instead of relying on the migration to fail before we attempt this. I see the attempt to rollback migrations as additional complexity, and we need the config backup path to exist and be well tested regardless.

@pchila
Copy link
Member

pchila commented May 18, 2023

@cmacknz following our talk yesterday, I am going with the backup of the config before migration and restore of such config during upgrade rollback.

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jun 3, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.9-candidate Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

5 participants