-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow migrating the stored value of the agent checkin long polling timeout #2597
Comments
I started implementing this yesterday: the basic idea is to have a set of migrations to apply in order (saved in the upgrade marker) and roll them back in reverse order in case of rollback of the agent upgrade. Looking through the code I found out that the watcher process is launched from the "old" binary version before restarting the agent to spin up the "new" version: this means that the decision to rollback and the cleanup is performed by the old binary image acting as watcher. My initial idea was to have the migration rolled back in reverse order to undo the changes when rolling back, however if the process taking care of the rollback is launched from the old agent binary, it may not have all the applied migrations defined, hence it could not roll them back properly. There are a few ways we could solve this:
None of the solutions I came up with is 100% safe but I think no. 1 has a better chance provided we keep the watcher simple and bug free but I am open to suggestions :) @cmacknz @michalpristas any ideas ? |
The goal here is to ensure that the agent can always roll back to a working state. For that reason I do not like 1 because we are relying on the new version of the agent that could have a critical bug. What if the reason for the rollback was triggered by a bug in the migration logic of the new agent, which would also be in the new agent's watcher?
Making a configuration backup is the safest approach, that always gets us back into the state we were in before the upgrade regardless of what the new version of the agent we are attempting to upgrade to does. I think we can live with an unused configuration file if that is the only negative impact. As long as the old agent version ignores it there is no harm. |
We should absolutely do this and I have talked about this before. We have been stuck with issues because of this, we should be running the newer watcher so we get the fixes in the watcher on upgrade. Otherwise if there is a bug in the watcher in the current version we cannot upgrade to the new version. Example: Going to 8.6 where the watcher didn't support the V2 daemon protocol, so we had to support both V1 and V2 or the upgrade would not succeed. |
I see the reasons for both @cmacknz and @blakerouse choices but there are risks for both solutions so I would like to propose a compromise:
What do you think? Would this give us the ability to selectively apply migrations/rollbacks while still having the escape hatch of restoring the agent (manually if necessary) ? |
That mostly works for me. Do we need to try and rollback the migrations at all? Could we instead always restore the config backup instead of relying on the migration to fail before we attempt this. I see the attempt to rollback migrations as additional complexity, and we need the config backup path to exist and be well tested regardless. |
@cmacknz following our talk yesterday, I am going with the backup of the config before migration and restore of such config during upgrade rollback. |
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
The changes in #2408 update the agent to report its configured checkin timeout to Fleet Server. When this parameter is set, Fleet server will dynamically set the server side timeout to the agent's checkin timeout minus 2 minutes. This makes it so that the agent always gets a complete response from Fleet server before its request times out. See elastic/fleet-server#2491.
To work with this new Fleet server feature, #2408 must change the default checkin timeout to 7m so that we keep our existing 5m total checkin duration.
The problem is that for already installed agents, the checkin timeout is persisted to disk at 10m and we currently have no way to override it. We need to add the ability to migrate these saved 10m timeouts to 7m timeouts on upgrade. This must be done prior to merging #2408.
The text was updated successfully, but these errors were encountered: