Change Elastic Agent behaviour when getting 404 on checkin calls #2414

jlind23 · 2023-03-29T12:59:07Z

We observed that when an agent is getting an http 404 while performing a checkin it will retry forever.

We should change this behavior by putting in place a mechanism that will every 15min and not every second as of today.

The text was updated successfully, but these errors were encountered:

jlind23 · 2023-03-29T12:59:49Z

cc @pierrehilbert @cmacknz this came out of our cloud telemetry.

cmacknz · 2023-03-29T14:42:53Z

The best path forward here will be to shutdown the agent after X attempt in order to avoid infinite retry.

I don't think disabling the agent makes sense here. I'm actually not sure that shutting down on a 403 makes sense. The agent should never shut itself down unless explicitly instructed to. This feels like it goes against the tamper protection work we have going on.

Why is retrying forever on a 404 undesirable? We could make the backoff more aggressive to retry less frequently, like every 15 minutes.

jlind23 · 2023-03-29T14:59:27Z

As of today Agent is retrying forever and call the api every second which at scale generate a lot of traffic on our end if we host fleet server. If we do not shut down the agent, this can continue forever even with an aggressive backoff in place.

cmacknz · 2023-03-29T15:31:22Z

Why are we getting 404s at all? Is that an expected result? What put the system into this state?

jlind23 · 2023-03-29T17:25:30Z

If the agent document is removed due to either a manual deletion or a rollback to a previous ES snapshost, then the agent will call a deployment that don't know it and end up getting a 404.

cmacknz · 2023-03-29T18:50:44Z

Right, my preferred outcome would be to make the backoff on a 404 more aggressive with a longer maximum duration. Retrying every 15 minutes seems reasonable, definitely not every second which is what we do today.

The agent has no way to know if the 404 is an intentional error, I don't think shutting down or unenrolling is a good way to handle this error.

As far as I can tell we don't shut down or do any special handling for 403s. We used to unenroll automatically when getting an invalid API key response but this was removed (likely because of predictably bad results when the API key error was a result of a bug). Here is this code in v8.3.3 for reference:

elastic-agent/internal/pkg/agent/application/gateway/fleet/fleet_gateway.go

Lines 337 to 349 in 0ffbedf

    
           resp, err := cmd.Execute(ctx, req) 
        
           if isUnauth(err) { 
        
           	f.unauthCounter++ 
        
           	if f.shouldUnenroll() { 
        
           		f.log.Warnf("retrieved an invalid api key error '%d' times. Starting to unenroll the elastic agent.", f.unauthCounter) 
        
           		return &fleetapi.CheckinResponse{ 
        
           			Actions: []fleetapi.Action{&fleetapi.ActionUnenroll{ActionID: "", ActionType: "UNENROLL", IsDetected: true}}, 
        
           		}, nil 
        
           	} 
        
           	return nil, err 
        
           }

jlind23 · 2023-03-29T19:27:19Z

I will update the issue description then💪

elasticmachine · 2024-06-03T15:42:49Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

pierrehilbert · 2024-06-03T18:40:18Z

@jlind23 is it still an issue?

jlind23 · 2024-06-04T05:23:23Z

Didn't show up in a while, i'm fine closing it for now.

jlind23 added the Team:Elastic-Agent Label for the Agent team label Mar 29, 2023

pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jun 3, 2024

jlind23 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change Elastic Agent behaviour when getting 404 on checkin calls #2414

Change Elastic Agent behaviour when getting 404 on checkin calls #2414

jlind23 commented Mar 29, 2023 •

edited

Loading

jlind23 commented Mar 29, 2023

cmacknz commented Mar 29, 2023

jlind23 commented Mar 29, 2023

cmacknz commented Mar 29, 2023

jlind23 commented Mar 29, 2023 •

edited

Loading

cmacknz commented Mar 29, 2023

jlind23 commented Mar 29, 2023

elasticmachine commented Jun 3, 2024

pierrehilbert commented Jun 3, 2024

jlind23 commented Jun 4, 2024

Change Elastic Agent behaviour when getting 404 on checkin calls #2414

Change Elastic Agent behaviour when getting 404 on checkin calls #2414

Comments

jlind23 commented Mar 29, 2023 • edited Loading

jlind23 commented Mar 29, 2023

cmacknz commented Mar 29, 2023

jlind23 commented Mar 29, 2023

cmacknz commented Mar 29, 2023

jlind23 commented Mar 29, 2023 • edited Loading

cmacknz commented Mar 29, 2023

jlind23 commented Mar 29, 2023

elasticmachine commented Jun 3, 2024

pierrehilbert commented Jun 3, 2024

jlind23 commented Jun 4, 2024

jlind23 commented Mar 29, 2023 •

edited

Loading

jlind23 commented Mar 29, 2023 •

edited

Loading