Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Self-Managed]: Fleet Server permanently goes offline and memory consumption increases on changing logging level to debug. #3485

Closed
amolnater-qasource opened this issue Apr 23, 2024 · 12 comments
Assignees
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Fleet Label for the Fleet team

Comments

@amolnater-qasource
Copy link
Collaborator

amolnater-qasource commented Apr 23, 2024

Kibana Build details:

VERSION: 8.14.0 BC1
BUILD: 73520
COMMIT: c1513cd7e5a00eab209ba02d30cafd6945d75470

Artifact Link: https://staging.elastic.co/8.14.0-a40d088a/summary-8.14.0.html

Host OS: All

Preconditions:

  1. 8.14.0-BC1 Kibana self-managed environment should be available.
  2. Fleet Server should be installed.

Steps to reproduce:

  1. Navigate to Fleet>Agents>Agent logs tab.
  2. Update logging level to debug.
  3. Observe fleet-server gets offline permanently and memory consumption increases.

Expected Result:
Fleet Server should remain Healthy on changing logging level to debug.

Logs:
elastic-agent-diagnostics-2024-04-23T04-48-12Z-00.zip

Screenshot:
image

Note:

  • Issue is consistently reproducible at our end.
@amolnater-qasource amolnater-qasource added bug Something isn't working Team:Fleet Label for the Fleet team impact:high Short-term priority; add to current release, or definitely next. labels Apr 23, 2024
@amolnater-qasource
Copy link
Collaborator Author

@manishgupta-qasource Please review.

@manishgupta-qasource
Copy link
Collaborator

Secondary review for this ticket is Done

@cmacknz
Copy link
Member

cmacknz commented Apr 23, 2024

components:
    - id: fleet-server-default
      state:
        component:
            apmconfig: null
            limits:
                gomaxprocs: 0
                source:
                    fields:
                        go_max_procs:
                            kind:
                                numbervalue: 0
        component_idx: 2
        features_idx: 2
        message: 'Healthy: communicating with pid ''6060'''
        state: 2
        units:
            input-fleet-server-default-fleet-server-fleet_server-a4eeee2f-bf68-436c-8c3f-f860be6f8299:
                message: 'Error - could not start the HTTP server for the API: failed to listen on the named pipe \\.\pipe\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: open \\.\pipe\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: Access is denied.'
                state: 4
            output-fleet-server-default:
                message: 'Error - could not start the HTTP server for the API: failed to listen on the named pipe \\.\pipe\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: open \\.\pipe\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: Access is denied.'
                state: 4
        version_info:
            build_hash: "11861004"
            meta:
                build_time: 2024-04-18 09:05:58 +0000 UTC
                commit: "11861004"
            name: fleet-server

fleet_message: |+
    fail to checkin to fleet-server: all hosts failed: 1 error occurred:
    	* requester 0/1 to host https://localhost:8221/ errored: Post "https://localhost:8221/api/fleet/agents/f9489d84-c941-40ef-84eb-e07adcf4b37c/checkin?": dial tcp 127.0.0.1:8221: connectex: No connection could be made because the target machine actively refused it.

fleet_state: 4
log_level: debug
message: 1 or more components/units in a failed state
state: 3

@cmacknz
Copy link
Member

cmacknz commented Apr 23, 2024

I see logs like this frequently repeating:

{"log.level":"info","@timestamp":"2024-04-23T04:47:39.249Z","message":"Error - could not start the HTTP server for the API: failed to listen on the named pipe \\\\.\\pipe\\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: open \\\\.\\pipe\\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: Access is denied.","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"service.name":"fleet-server","service.type":"fleet-server","state":"FAILED","ecs.version":"1.6.0","ecs.version":"1.6.0"}

@michel-laterman
Copy link
Contributor

This error is from when the fleet-server tries to start the local metrics server, specifically in github.com/elastic/elastic-agent-libs/api; with https://github.com/elastic/elastic-agent-libs/blob/main/api/routes.go#L39

@michel-laterman
Copy link
Contributor

The changes in elastic-agent-api are:

@michel-laterman
Copy link
Contributor

Is this recreateable on any other OS, or is it just on windows?

@amolnater-qasource
Copy link
Collaborator Author

Hi @michel-laterman

Thank you for looking into this issue.

We have revalidated this issue for linux fleet server on 8.14.0 BC1 kibana cloud environment and had below observations:

Observations:

  • Linux fleet server gets offline for sometime on setting logging level to debug.
  • However it gets back Healthy and memory consumption also doesn't increase like Windows fleet-server.

Logs for Linux fleet-server:
elastic-agent-diagnostics-2024-04-24T06-12-49Z-00 (1).zip

Build details:
VERSION: 8.14.0 BC1
BUILD: 73520
COMMIT: c1513cd7e5a00eab209ba02d30cafd6945d75470

Screenshot:
image

Please let us know if anything else is required from our end.
Thanks!

@michel-laterman
Copy link
Contributor

From what I can see this could have been caused by the policy output reload work we tried to add; The PRs have been reverted in 8.15 and 8.14 as of this morning

@kpollich
Copy link
Member

Thanks Michel. @amolnater-qasource can we retest when the next BC is available? There should be one built tomorrow April 25.

@ycombinator ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Apr 30, 2024
@ycombinator
Copy link
Contributor

Hi @amolnater-qasource did you get a chance to retest this one? Thanks!

@amolnater-qasource
Copy link
Collaborator Author

Hi Team,

We have revalidated this issue on latest 8.14.0 BC4 kibana self-managed environment and found it fixed now:

Observations:

  • Fleet Server remains Healthy on changing logging level to debug.

Logs:
elastic-agent-diagnostics-2024-05-13T06-30-55Z-00.zip

Screenshot:
image

Build details:
VERSION: 8.14.0 BC4
BUILD: 73836
COMMIT: 23ed1207772b3ae958cb05bc4cdbe39b83507707

Hence we are closing and marking this issue as QA:Validated.

Thanks!

@amolnater-qasource amolnater-qasource added QA:Ready For Testing Code is merged and ready for QA to validate QA:Validated Validated by the QA Team and removed QA:Ready For Testing Code is merged and ready for QA to validate labels May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Fleet Label for the Fleet team
Projects
None yet
Development

No branches or pull requests

6 participants