Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.14](backport #5034) Fix indefinite memory and CPU consumption when waiting fleet to be ready #5040

Merged
merged 1 commit into from
Jul 3, 2024

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Jul 2, 2024

What does this PR do?

Fixes the wait for Fleet Server to be ready

Why is it important?

When waiting for Fleet Server to start, the Elastic Agent does not account for the timeout when waiting for Fleet Server to be ready.

Currently, when the timeout is reached, the operation isn't interrupted and the goroutine waiting for Fleet Server to be ready gets stuck in an infinite loop without any delay between iterations. It continually prints a log like:

{"log.level":"info","@timestamp":"2024-07-02T13:18:59.354Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":812},"message":"Waiting for Elastic Agent to start: rpc error: code = Canceled desc = context canceled","ecs.version":"1.6.0"} .

This causes a spike in memory and CPU consumption until the agent is killed by the OS, potentially jeopardising the normal operation of the host.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

Disruptive User Impact

How to test this PR locally

Try to reproduce #5033, the issue should not be reproducible with this fix.

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

This is an automatic backport of pull request #5034 done by [Mergify](https://mergify.com).

…ady (#5034)

* exit if timeout is reached while waiting for fleet server to start

* clarify exponential backoff behaviour

* add test

* add changelog

* fix changelog

(cherry picked from commit 8aa3477)
@mergify mergify bot requested a review from a team as a code owner July 2, 2024 19:28
@mergify mergify bot added the backport label Jul 2, 2024
@mergify mergify bot requested review from michel-laterman and pchila and removed request for a team July 2, 2024 19:28
@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jul 2, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@AndersonQ AndersonQ merged commit c131275 into 8.14 Jul 3, 2024
15 checks passed
@AndersonQ AndersonQ deleted the mergify/bp/8.14/pr-5034 branch July 3, 2024 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants