Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Elastic Agent behaviour when getting 404 on checkin calls #2414

Closed
jlind23 opened this issue Mar 29, 2023 · 10 comments
Closed

Change Elastic Agent behaviour when getting 404 on checkin calls #2414

jlind23 opened this issue Mar 29, 2023 · 10 comments
Labels
Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@jlind23
Copy link
Contributor

jlind23 commented Mar 29, 2023

We observed that when an agent is getting an http 404 while performing a checkin it will retry forever.

We should change this behavior by putting in place a mechanism that will every 15min and not every second as of today.

@jlind23 jlind23 added the Team:Elastic-Agent Label for the Agent team label Mar 29, 2023
@jlind23
Copy link
Contributor Author

jlind23 commented Mar 29, 2023

cc @pierrehilbert @cmacknz this came out of our cloud telemetry.

@cmacknz
Copy link
Member

cmacknz commented Mar 29, 2023

The best path forward here will be to shutdown the agent after X attempt in order to avoid infinite retry.

I don't think disabling the agent makes sense here. I'm actually not sure that shutting down on a 403 makes sense. The agent should never shut itself down unless explicitly instructed to. This feels like it goes against the tamper protection work we have going on.

Why is retrying forever on a 404 undesirable? We could make the backoff more aggressive to retry less frequently, like every 15 minutes.

@jlind23
Copy link
Contributor Author

jlind23 commented Mar 29, 2023

As of today Agent is retrying forever and call the api every second which at scale generate a lot of traffic on our end if we host fleet server. If we do not shut down the agent, this can continue forever even with an aggressive backoff in place.

@cmacknz
Copy link
Member

cmacknz commented Mar 29, 2023

Why are we getting 404s at all? Is that an expected result? What put the system into this state?

@jlind23
Copy link
Contributor Author

jlind23 commented Mar 29, 2023

If the agent document is removed due to either a manual deletion or a rollback to a previous ES snapshost, then the agent will call a deployment that don't know it and end up getting a 404.

@cmacknz
Copy link
Member

cmacknz commented Mar 29, 2023

Right, my preferred outcome would be to make the backoff on a 404 more aggressive with a longer maximum duration. Retrying every 15 minutes seems reasonable, definitely not every second which is what we do today.

The agent has no way to know if the 404 is an intentional error, I don't think shutting down or unenrolling is a good way to handle this error.

As far as I can tell we don't shut down or do any special handling for 403s. We used to unenroll automatically when getting an invalid API key response but this was removed (likely because of predictably bad results when the API key error was a result of a bug). Here is this code in v8.3.3 for reference:

resp, err := cmd.Execute(ctx, req)
if isUnauth(err) {
f.unauthCounter++
if f.shouldUnenroll() {
f.log.Warnf("retrieved an invalid api key error '%d' times. Starting to unenroll the elastic agent.", f.unauthCounter)
return &fleetapi.CheckinResponse{
Actions: []fleetapi.Action{&fleetapi.ActionUnenroll{ActionID: "", ActionType: "UNENROLL", IsDetected: true}},
}, nil
}
return nil, err
}

@jlind23
Copy link
Contributor Author

jlind23 commented Mar 29, 2023

I will update the issue description then💪

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jun 3, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pierrehilbert
Copy link
Contributor

@jlind23 is it still an issue?

@jlind23
Copy link
Contributor Author

jlind23 commented Jun 4, 2024

Didn't show up in a while, i'm fine closing it for now.

@jlind23 jlind23 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

4 participants