Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indefinite retries on delayed enrollment retry too quickly #6761

Open
cmacknz opened this issue Feb 7, 2025 · 1 comment
Open

Indefinite retries on delayed enrollment retry too quickly #6761

cmacknz opened this issue Feb 7, 2025 · 1 comment
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@cmacknz
Copy link
Member

cmacknz commented Feb 7, 2025

#4727 made us retry indefinitely when delayed enrollment is used. This works, and does include an exponential backoff, but with a very short duration.

The backoff for delayed enrollment is implemented in

err = c.enrollWithBackoff(ctx, persistentConfig)

backExp := backoff.NewExpBackoff(signal, enrollBackoffInit, enrollBackoffMax)

enrollBackoffInit = time.Second
enrollBackoffMax = 10 * time.Second

The initial delay is 1s ramping up to 10s. We have a user report that delayed enrollment when there are many agent VM images starting before Fleet Server is available and ready to accept connections can DDoS their network infrastructure.

We should make the following changes:

  1. There must be a random delay added before the first connection attempt to avoid each agent making it's initial request concurrnetly. The Fleet Gateway uses 500ms for this jitter duration.
    var defaultGatewaySettings = &fleetGatewaySettings{
    Duration: 1 * time.Second, // time between successful calls
    Jitter: 500 * time.Millisecond, // used as a jitter for duration
    Backoff: backoffSettings{ // time after a failed call
    Init: 60 * time.Second,
    Max: 10 * time.Minute,
    },
    }
  2. The maximum backoff duration when using delayed enrollment should be increased. The Fleet gateway for checkin requests uses 10 minutes for the maximum period.

Possibly, the delayed enrollment and fleet gateway checkins should use the same backoff algorithm since they are both critical operations that reach out to Fleet Server indefinitely.

@cmacknz cmacknz added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Feb 7, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

2 participants