-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change Elastic Agent behaviour when getting 404 on checkin calls #2414
Comments
cc @pierrehilbert @cmacknz this came out of our cloud telemetry. |
I don't think disabling the agent makes sense here. I'm actually not sure that shutting down on a 403 makes sense. The agent should never shut itself down unless explicitly instructed to. This feels like it goes against the tamper protection work we have going on. Why is retrying forever on a 404 undesirable? We could make the backoff more aggressive to retry less frequently, like every 15 minutes. |
As of today Agent is retrying forever and call the api every second which at scale generate a lot of traffic on our end if we host fleet server. If we do not shut down the agent, this can continue forever even with an aggressive backoff in place. |
Why are we getting 404s at all? Is that an expected result? What put the system into this state? |
If the agent document is removed due to either a manual deletion or a rollback to a previous ES snapshost, then the agent will call a deployment that don't know it and end up getting a 404. |
Right, my preferred outcome would be to make the backoff on a 404 more aggressive with a longer maximum duration. Retrying every 15 minutes seems reasonable, definitely not every second which is what we do today. The agent has no way to know if the 404 is an intentional error, I don't think shutting down or unenrolling is a good way to handle this error. As far as I can tell we don't shut down or do any special handling for 403s. We used to unenroll automatically when getting an invalid API key response but this was removed (likely because of predictably bad results when the API key error was a result of a bug). Here is this code in v8.3.3 for reference: elastic-agent/internal/pkg/agent/application/gateway/fleet/fleet_gateway.go Lines 337 to 349 in 0ffbedf
|
I will update the issue description then💪 |
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
@jlind23 is it still an issue? |
Didn't show up in a while, i'm fine closing it for now. |
We observed that when an agent is getting an
http 404
while performing a checkin it will retry forever.We should change this behavior by putting in place a mechanism that will every 15min and not every second as of today.
The text was updated successfully, but these errors were encountered: