Support for remote ES output #3051

juliaElastic · 2023-10-24T15:04:27Z

Changes in this pr:

Fleet Server handles remote_elasticsearch outputs: creates a new child bulker for each, uses this bulker to read, create, update API keys for remote ES
if remote config changed, stop and start a new child bulker
API keys invalidated if remote ES output is removed from agent policy
added unit and integration tests to cover the changes
will be added in another pr: report error state if remote ES is not accessible

What is the problem this PR solves?

Remote ES output support in Fleet Server

How does this PR solve the problem?

Generate API keys for agents using remote ES output host and service_token created for fleet-server-remote

How to test this PR locally

Start kibana locally with the changes in this pr
Create a cloud deployment to use as remote monitoring cluster, tested with pr deployment, but can be any 8.12-SNAPSHOT
Start fleet-server locally with the changes in this pr (used standalone mode)
Create a remote ES output with a valid service token generated in the remote cluster

// create service token in remote kibana console
POST kbn:/api/fleet/service_tokens
{
  "remote": true
}

Create an agent policy that uses this output as monitoring output
Enroll an agent locally/from VM to this policy
Verify that the agent API key is created on the remote cluster, and the monitoring data arrives from the enrolled agent

Design Checklist

I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Closes https://github.com/elastic/ingest-dev/issues/1017

internal/pkg/policy/policy_output.go

michel-laterman

i need to reacquaint myself with output policy generation in order to better review the actual work being done

internal/pkg/policy/policy_output.go

internal/pkg/bulk/engine.go

internal/pkg/policy/policy_output.go

juliaElastic · 2023-11-23T08:22:08Z

thanks, I had a look at it and talked to them team and there seems to be a known issue the beats not properly reporting or updating their status when an output unit is failing :/

Is there an open issue that we can track?

michel-laterman

Looks good, great test coverage!

I just think we need to warn if we are going to orphan keys, and we should decide on how much concurrency control we need for the bulker map

michel-laterman · 2023-11-23T20:06:46Z

internal/pkg/api/handleAck.go

+			// read output config from .fleet-policies, not filtering by policy id as agent could be reassigned
+			policy, err := dl.QueryOutputFromPolicy(ctx, ack.bulk, outputName)
+			if err != nil || policy == nil {
+				zlog.Debug().Str("outputName", outputName).Msg("Output policy not found")


If we can't find the policy associated with an output and need to invalidate the API key, that means it's an orphaned key, right?
Should we emit a WARN log about the key for the agent being orphaned?

changed to warning log

internal/pkg/api/handleAck.go

michel-laterman · 2023-11-23T20:08:17Z

internal/pkg/api/handleAck_test.go

+	remoteBulker2.AssertExpectations(t)
+}
+
+func TestInvalidateAPIKeysRemoteOutputReadFromPolicies(t *testing.T) {


👍
great test cases

internal/pkg/bulk/bulk_test.go

michel-laterman · 2023-11-23T20:15:55Z

internal/pkg/bulk/bulk_test.go

+	go func() {
+		defer wg.Done()
+
+		_, err := childBulker.APIKeyAuth(ctx, apikey.APIKey{})


I think that APIKeyAuth calls are synchronous; they don't actually get bulked or use the bulkCtx that is created for the child context.

I see, I might be able to test it with APIKeyUpdate

it doesn't seem to work with APIKeyUpdate either, now the test times out after 10m. I tried to Run the bulker but doesn't seem to help.
I'm still getting the error with the invalid es host, do you know how can we mock the es client in bulker for this test?

dial tcp: lookup remote-es: no such host","message":"Error sending bulk API Key update request to Elasticsearch

EDIT: I managed to mock the es client, and now not getting any errors. Shouldn't APIKeyUpdate use bulkCtx? It calls waitBulkAction.

engine.go:256: {"level":"info","name":"remote","message":"remote output configuration has changed"} engine.go:187: {"level":"debug","outputName":"remote","message":"Bulk context done"} engine.go:176: {"level":"debug","outputName":"remote","message":"Bulker started"} opApiKey.go:231: {"level":"debug","IDs":[""],"role":,"message":"API Keys updated."} bulk_test.go:372: <nil> bulk_test.go:375: Expected context cancel err: <nil> --- FAIL: TestCancelCtxChildBulkerReplaced (5.00s)

Maybe there is a way to validate that "Bulk context done" is logged out?
I feel like I'm spending too much time on this test.

I'm also seeing a data race warning in buildkite for this test.

@michel-laterman removed this test for now as I couldn't get it working

internal/pkg/bulk/engine.go

michel-laterman · 2023-11-23T20:22:24Z

internal/pkg/policy/policy_output.go

+			toRetireAPIKeys = &model.ToRetireAPIKeyIdsItems{
+				ID:        agentOutput.APIKeyID,
+				RetiredAt: time.Now().UTC().Format(time.RFC3339),
+				Output:    agentOutputName,


This seems like a good solution to get this out, thanks for all the effort!

joshdover · 2023-11-27T09:44:52Z

How complex would it be to report this via the control protocol? Since we aren't supporting this on Serverless for now, we may want to just do that option first so we don't have to block this feature or only release it as beta. It could still be a follow up to this PR though.

I'm not familiar with the control protocol, how can Fleet Server send a UnitObserved message? Is it the same as the component unit state in the agent doc?

This is done here in Fleet Server's "agent" module for interacting with the control protocol:

fleet-server/internal/pkg/server/agent.go

Line 170 in 0dbce8c

    
           func (a *Agent) UpdateState(state client.UnitState, message string, payload map[string]interface{}) error {

EDIT: I think the state is actually updates in the fleet module, see these examples:

fleet-server/internal/pkg/server/fleet.go

Line 199 in 0dbce8c

    
           f.reporter.UpdateState(client.UnitStateFailed, fmt.Sprintf("Error - %s", err), nil) //nolint:errcheck // unclear on what should we do if updating the status fails?

I believe we could modify the output unit's state to degraded with an informative message whenever one of the remote outputs are not accessible.

juliaElastic · 2023-11-27T10:00:39Z

How complex would it be to report this via the control protocol? Since we aren't supporting this on Serverless for now, we may want to just do that option first so we don't have to block this feature or only release it as beta. It could still be a follow up to this PR though.

I'm not familiar with the control protocol, how can Fleet Server send a UnitObserved message? Is it the same as the component unit state in the agent doc?

This is done here in Fleet Server's "agent" module for interacting with the control protocol:

fleet-server/internal/pkg/server/agent.go

Line 170 in 0dbce8c

func (a *Agent) UpdateState(state client.UnitState, message string, payload map[string]interface{}) error {

EDIT: I think the state is actually updates in the fleet module, see these examples:

fleet-server/internal/pkg/server/fleet.go

Line 199 in 0dbce8c

f.reporter.UpdateState(client.UnitStateFailed, fmt.Sprintf("Error - %s", err), nil) //nolint:errcheck // unclear on what should we do if updating the status fails?

I believe we could modify the output unit's state to degraded with an informative message whenever one of the remote outputs are not accessible.

Yes, Craig gave some pointers last week, though it's not as easy as it sounds as the outputs are not assigned to the Fleet server policy, so these units should be added as extra units to the state reporter. I'll have to dig into this to understand how this works, which I'll do in a follow up pr.

joshdover · 2023-11-27T10:46:40Z

Yeah we might have to hack it somehow and just use the single output Fleet Server is assigned to report on all remote outputs it connects to. Agree on a follow up 👍

juliaElastic · 2023-11-27T13:45:45Z

internal/pkg/api/handleCheckin.go

@@ -748,6 +748,13 @@ func processPolicy(ctx context.Context, zlog zerolog.Logger, bulker bulk.Bulk, a
 	}

 	data := model.ClonePolicyData(pp.Policy.Data)
+	for policyName, policyOutput := range data.Outputs {
+		err := policy.ProcessOutputSecret(ctx, policyOutput, bulker)


this is needed to read remote ES output's service_token from .fleet-secrets before calling Prepare.
Tested together with the kibana changes to save service_token as secret: elastic/kibana#171875
This function is a noop if the output secrets are not enabled.

juliaElastic · 2023-11-28T13:54:55Z

Yeah we might have to hack it somehow and just use the single output Fleet Server is assigned to report on all remote outputs it connects to. Agree on a follow up 👍

@joshdover I'm coming back to this, and if we start reporting degraded state on at least one unit, Fleet Server will go to overall unhealthy state and show degraded state on /api/status API. We want to avoid this as discussed before, so I'm not sure if there is any other option rather than reporting output state separately from fleet-server component state.

EDIT: I'm planning to continue with the data stream output health reporting instead, supposing it is required for 8.12: #3116

## Summary Related to elastic#104986 Making remote ES output's service_token a secret. fleet-server change here: elastic/fleet-server#3051 (comment) Steps to verify: - Enable remote ES output and output secrets in `kibana.dev.yml` locally: ``` xpack.fleet.enableExperimental: ['remoteESOutput', 'outputSecretsStorage'] ``` - Start es, kibana, fleet-server locally and start a second es locally - see detailed steps here: elastic/fleet-server#3051 - Create a remote ES output, verify that the service_token is stored as a secret reference ``` GET .kibana_ingest/_search?q=type:ingest-outputs ``` - Verify that the enrolled agent sends data to the remote ES successfully <img width="561" alt="image" src="https://github.com/elastic/kibana/assets/90178898/122d9800-a2ec-47f8-97a7-acf64b87172a"> <img width="549" alt="image" src="https://github.com/elastic/kibana/assets/90178898/e1751bdd-5aaf-4f68-9f92-7076b306cdfe"> ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

elastic-sonarqube · 2023-11-29T09:07:07Z

SonarQube Quality Gate

0 Bugs
0 Vulnerabilities
0 Security Hotspots
6 Code Smells

62.8% Coverage
0.0% Duplication

first draft of remote output

44b6985

juliaElastic added the enhancement New feature or request label Oct 24, 2023

juliaElastic self-assigned this Oct 24, 2023

juliaElastic commented Oct 24, 2023

View reviewed changes

internal/pkg/policy/policy_output.go Outdated Show resolved Hide resolved

juliaElastic commented Oct 24, 2023

View reviewed changes

internal/pkg/policy/policy_output.go Outdated Show resolved Hide resolved

juliaElastic requested review from AndersonQ, cmacknz and michel-laterman October 24, 2023 15:26

remove commented code

f870e55

juliaElastic commented Oct 24, 2023

View reviewed changes

internal/pkg/policy/policy_output.go Outdated Show resolved Hide resolved

michel-laterman reviewed Oct 24, 2023

View reviewed changes

internal/pkg/policy/policy_output.go Outdated Show resolved Hide resolved

AndersonQ reviewed Oct 24, 2023

View reviewed changes

juliaElastic mentioned this pull request Oct 25, 2023

[Fleet] Support for remote ES output elastic/kibana#169252

Merged

2 tasks

juliaElastic added 2 commits October 26, 2023 13:09

storing remote es clients in bulker

4605e04

remote service_token from policy sent to agent

b6b47af

AndersonQ reviewed Oct 26, 2023

View reviewed changes

internal/pkg/bulk/engine.go Outdated Show resolved Hide resolved

internal/pkg/bulk/engine.go Outdated Show resolved Hide resolved

juliaElastic added 3 commits October 26, 2023 14:17

fix lint

d82a117

rename remoteEsList and type

0ceb201

added missing check

c63fb1e

juliaElastic commented Oct 26, 2023

View reviewed changes

internal/pkg/policy/policy_output.go Outdated Show resolved Hide resolved

juliaElastic and others added 5 commits October 30, 2023 10:23

Merge branch 'main' into remote-es

6795a82

fixes after merge main

6cd8491

fix lint

4b72d71

added changelog

1d6c31f

create bulker for remote es

53de755

juliaElastic commented Oct 30, 2023

View reviewed changes

internal/pkg/policy/policy_output.go Show resolved Hide resolved

juliaElastic commented Oct 30, 2023

View reviewed changes

internal/pkg/policy/policy_output.go Outdated Show resolved Hide resolved

juliaElastic added 3 commits October 30, 2023 14:21

cleanup unused remoteEsClients and functions

5bbc1f1

fix linter

ab7487d

renamed to remote_elasticsearch

e901a80

juliaElastic added 3 commits November 23, 2023 09:51

fix test

053fbed

fix integration test

60cd6ee

added integration test for invalidate api key

760ea2f

juliaElastic mentioned this pull request Nov 23, 2023

Add mappings on .fleet-policies index to filter on data.outputs to invalidate remote API keys #3121

Open

2 tasks

test for child bulker cancel

9382975

michel-laterman reviewed Nov 23, 2023

View reviewed changes

juliaElastic added 2 commits November 27, 2023 10:09

fixed test, replace semaphore with mutex

0006533

removed unused arg

41219f0

added warning log if api keys orphaned

75c890b

juliaElastic added 2 commits November 27, 2023 11:02

removed unused error

4d2d67d

fix lint

cd877f1

juliaElastic and others added 3 commits November 27, 2023 13:12

Merge branch 'main' into remote-es

aa7bf11

try to fix test

d1ffe3f

read output secret before prepare remote es

5885068

juliaElastic commented Nov 27, 2023

View reviewed changes

juliaElastic mentioned this pull request Nov 27, 2023

[Fleet] making service_token an output secret elastic/kibana#171875

Merged

1 task

juliaElastic added 3 commits November 27, 2023 15:39

try to fix test

4d5b906

removed test

d39141a

removed unused imports

299cdae

michel-laterman approved these changes Nov 28, 2023

View reviewed changes

Merge branch 'main' into remote-es

6fdba62

juliaElastic merged commit 992c2dc into elastic:main Nov 29, 2023
9 checks passed

cmacknz mentioned this pull request Dec 8, 2023

[Elastic Agent] Beats don't report unhealthy state if cannot connect to output elastic/beats#39801

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for remote ES output #3051

Support for remote ES output #3051

juliaElastic commented Oct 24, 2023 •

edited

Loading

michel-laterman left a comment

juliaElastic commented Nov 23, 2023

michel-laterman left a comment

michel-laterman Nov 23, 2023

juliaElastic Nov 27, 2023

michel-laterman Nov 23, 2023

michel-laterman Nov 23, 2023

juliaElastic Nov 27, 2023

juliaElastic Nov 27, 2023 •

edited

Loading

juliaElastic Nov 28, 2023

michel-laterman Nov 23, 2023

joshdover commented Nov 27, 2023 •

edited

Loading

juliaElastic commented Nov 27, 2023 •

edited

Loading

joshdover commented Nov 27, 2023

juliaElastic Nov 27, 2023 •

edited

Loading

juliaElastic commented Nov 28, 2023 •

edited

Loading

elastic-sonarqube bot commented Nov 29, 2023

Support for remote ES output #3051

Support for remote ES output #3051

Conversation

juliaElastic commented Oct 24, 2023 • edited Loading

What is the problem this PR solves?

How does this PR solve the problem?

How to test this PR locally

Design Checklist

Checklist

Related issues

michel-laterman left a comment

Choose a reason for hiding this comment

juliaElastic commented Nov 23, 2023

michel-laterman left a comment

Choose a reason for hiding this comment

michel-laterman Nov 23, 2023

Choose a reason for hiding this comment

juliaElastic Nov 27, 2023

Choose a reason for hiding this comment

michel-laterman Nov 23, 2023

Choose a reason for hiding this comment

michel-laterman Nov 23, 2023

Choose a reason for hiding this comment

juliaElastic Nov 27, 2023

Choose a reason for hiding this comment

juliaElastic Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

juliaElastic Nov 28, 2023

Choose a reason for hiding this comment

michel-laterman Nov 23, 2023

Choose a reason for hiding this comment

joshdover commented Nov 27, 2023 • edited Loading

juliaElastic commented Nov 27, 2023 • edited Loading

joshdover commented Nov 27, 2023

juliaElastic Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

juliaElastic commented Nov 28, 2023 • edited Loading

elastic-sonarqube bot commented Nov 29, 2023

juliaElastic commented Oct 24, 2023 •

edited

Loading

juliaElastic Nov 27, 2023 •

edited

Loading

joshdover commented Nov 27, 2023 •

edited

Loading

juliaElastic commented Nov 27, 2023 •

edited

Loading

juliaElastic Nov 27, 2023 •

edited

Loading

juliaElastic commented Nov 28, 2023 •

edited

Loading