Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for remote ES output #3051

Merged
merged 99 commits into from
Nov 29, 2023
Merged

Conversation

juliaElastic
Copy link
Contributor

@juliaElastic juliaElastic commented Oct 24, 2023

Changes in this pr:

  • Fleet Server handles remote_elasticsearch outputs: creates a new child bulker for each, uses this bulker to read, create, update API keys for remote ES

  • if remote config changed, stop and start a new child bulker

  • API keys invalidated if remote ES output is removed from agent policy

  • added unit and integration tests to cover the changes

  • will be added in another pr: report error state if remote ES is not accessible

What is the problem this PR solves?

Remote ES output support in Fleet Server

How does this PR solve the problem?

Generate API keys for agents using remote ES output host and service_token created for fleet-server-remote

How to test this PR locally

  1. Start kibana locally with the changes in this pr
  2. Create a cloud deployment to use as remote monitoring cluster, tested with pr deployment, but can be any 8.12-SNAPSHOT
  3. Start fleet-server locally with the changes in this pr (used standalone mode)
  4. Create a remote ES output with a valid service token generated in the remote cluster
// create service token in remote kibana console
POST kbn:/api/fleet/service_tokens
{
  "remote": true
}
  1. Create an agent policy that uses this output as monitoring output
  2. Enroll an agent locally/from VM to this policy
  3. Verify that the agent API key is created on the remote cluster, and the monitoring data arrives from the enrolled agent
image image

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Closes https://github.com/elastic/ingest-dev/issues/1017

@juliaElastic juliaElastic added the enhancement New feature or request label Oct 24, 2023
@juliaElastic juliaElastic self-assigned this Oct 24, 2023
Copy link
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i need to reacquaint myself with output policy generation in order to better review the actual work being done

internal/pkg/policy/policy_output.go Outdated Show resolved Hide resolved
internal/pkg/policy/policy_output.go Outdated Show resolved Hide resolved
internal/pkg/policy/policy_output.go Outdated Show resolved Hide resolved
internal/pkg/policy/policy_output.go Outdated Show resolved Hide resolved
internal/pkg/policy/policy_output.go Outdated Show resolved Hide resolved
internal/pkg/policy/policy_output.go Outdated Show resolved Hide resolved
internal/pkg/policy/policy_output.go Outdated Show resolved Hide resolved
internal/pkg/bulk/engine.go Outdated Show resolved Hide resolved
internal/pkg/bulk/engine.go Outdated Show resolved Hide resolved
@juliaElastic
Copy link
Contributor Author

thanks, I had a look at it and talked to them team and there seems to be a known issue the beats not properly reporting or updating their status when an output unit is failing :/

Is there an open issue that we can track?

Copy link
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, great test coverage!

I just think we need to warn if we are going to orphan keys, and we should decide on how much concurrency control we need for the bulker map

// read output config from .fleet-policies, not filtering by policy id as agent could be reassigned
policy, err := dl.QueryOutputFromPolicy(ctx, ack.bulk, outputName)
if err != nil || policy == nil {
zlog.Debug().Str("outputName", outputName).Msg("Output policy not found")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can't find the policy associated with an output and need to invalidate the API key, that means it's an orphaned key, right?
Should we emit a WARN log about the key for the agent being orphaned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to warning log

internal/pkg/api/handleAck.go Outdated Show resolved Hide resolved
remoteBulker2.AssertExpectations(t)
}

func TestInvalidateAPIKeysRemoteOutputReadFromPolicies(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
great test cases

internal/pkg/bulk/bulk_test.go Outdated Show resolved Hide resolved
go func() {
defer wg.Done()

_, err := childBulker.APIKeyAuth(ctx, apikey.APIKey{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that APIKeyAuth calls are synchronous; they don't actually get bulked or use the bulkCtx that is created for the child context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I might be able to test it with APIKeyUpdate

Copy link
Contributor Author

@juliaElastic juliaElastic Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't seem to work with APIKeyUpdate either, now the test times out after 10m. I tried to Run the bulker but doesn't seem to help.
I'm still getting the error with the invalid es host, do you know how can we mock the es client in bulker for this test?

dial tcp: lookup remote-es: no such host","message":"Error sending bulk API Key update request to Elasticsearch

EDIT: I managed to mock the es client, and now not getting any errors. Shouldn't APIKeyUpdate use bulkCtx? It calls waitBulkAction.

    engine.go:256: {"level":"info","name":"remote","message":"remote output configuration has changed"}
    engine.go:187: {"level":"debug","outputName":"remote","message":"Bulk context done"}
    engine.go:176: {"level":"debug","outputName":"remote","message":"Bulker started"}
    opApiKey.go:231: {"level":"debug","IDs":[""],"role":,"message":"API Keys updated."}
    bulk_test.go:372: <nil>
    bulk_test.go:375: Expected context cancel err:  <nil>
--- FAIL: TestCancelCtxChildBulkerReplaced (5.00s)

Maybe there is a way to validate that "Bulk context done" is logged out?
I feel like I'm spending too much time on this test.

I'm also seeing a data race warning in buildkite for this test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michel-laterman removed this test for now as I couldn't get it working

internal/pkg/bulk/engine.go Show resolved Hide resolved
internal/pkg/bulk/engine.go Outdated Show resolved Hide resolved
toRetireAPIKeys = &model.ToRetireAPIKeyIdsItems{
ID: agentOutput.APIKeyID,
RetiredAt: time.Now().UTC().Format(time.RFC3339),
Output: agentOutputName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good solution to get this out, thanks for all the effort!

@joshdover
Copy link
Contributor

joshdover commented Nov 27, 2023

How complex would it be to report this via the control protocol? Since we aren't supporting this on Serverless for now, we may want to just do that option first so we don't have to block this feature or only release it as beta. It could still be a follow up to this PR though.

I'm not familiar with the control protocol, how can Fleet Server send a UnitObserved message? Is it the same as the component unit state in the agent doc?

This is done here in Fleet Server's "agent" module for interacting with the control protocol:

func (a *Agent) UpdateState(state client.UnitState, message string, payload map[string]interface{}) error {

EDIT: I think the state is actually updates in the fleet module, see these examples:

f.reporter.UpdateState(client.UnitStateFailed, fmt.Sprintf("Error - %s", err), nil) //nolint:errcheck // unclear on what should we do if updating the status fails?

I believe we could modify the output unit's state to degraded with an informative message whenever one of the remote outputs are not accessible.

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Nov 27, 2023

How complex would it be to report this via the control protocol? Since we aren't supporting this on Serverless for now, we may want to just do that option first so we don't have to block this feature or only release it as beta. It could still be a follow up to this PR though.

I'm not familiar with the control protocol, how can Fleet Server send a UnitObserved message? Is it the same as the component unit state in the agent doc?

This is done here in Fleet Server's "agent" module for interacting with the control protocol:

func (a *Agent) UpdateState(state client.UnitState, message string, payload map[string]interface{}) error {

EDIT: I think the state is actually updates in the fleet module, see these examples:

f.reporter.UpdateState(client.UnitStateFailed, fmt.Sprintf("Error - %s", err), nil) //nolint:errcheck // unclear on what should we do if updating the status fails?

I believe we could modify the output unit's state to degraded with an informative message whenever one of the remote outputs are not accessible.

Yes, Craig gave some pointers last week, though it's not as easy as it sounds as the outputs are not assigned to the Fleet server policy, so these units should be added as extra units to the state reporter. I'll have to dig into this to understand how this works, which I'll do in a follow up pr.

@joshdover
Copy link
Contributor

Yeah we might have to hack it somehow and just use the single output Fleet Server is assigned to report on all remote outputs it connects to. Agree on a follow up 👍

@@ -748,6 +748,13 @@ func processPolicy(ctx context.Context, zlog zerolog.Logger, bulker bulk.Bulk, a
}

data := model.ClonePolicyData(pp.Policy.Data)
for policyName, policyOutput := range data.Outputs {
err := policy.ProcessOutputSecret(ctx, policyOutput, bulker)
Copy link
Contributor Author

@juliaElastic juliaElastic Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is needed to read remote ES output's service_token from .fleet-secrets before calling Prepare.
Tested together with the kibana changes to save service_token as secret: elastic/kibana#171875
This function is a noop if the output secrets are not enabled.

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Nov 28, 2023

Yeah we might have to hack it somehow and just use the single output Fleet Server is assigned to report on all remote outputs it connects to. Agree on a follow up 👍

@joshdover I'm coming back to this, and if we start reporting degraded state on at least one unit, Fleet Server will go to overall unhealthy state and show degraded state on /api/status API. We want to avoid this as discussed before, so I'm not sure if there is any other option rather than reporting output state separately from fleet-server component state.

EDIT: I'm planning to continue with the data stream output health reporting instead, supposing it is required for 8.12: #3116

jpdjere pushed a commit to jpdjere/kibana that referenced this pull request Nov 28, 2023
## Summary

Related to elastic#104986

Making remote ES output's service_token a secret.

fleet-server change here:
elastic/fleet-server#3051 (comment)

Steps to verify:
- Enable remote ES output and output secrets in `kibana.dev.yml`
locally:
 ```
xpack.fleet.enableExperimental: ['remoteESOutput',
'outputSecretsStorage']
```
- Start es, kibana, fleet-server locally and start a second es locally
 - see detailed steps here: elastic/fleet-server#3051
- Create a remote ES output, verify that the service_token is stored as a secret reference
```
GET .kibana_ingest/_search?q=type:ingest-outputs
```
- Verify that the enrolled agent sends data to the remote ES successfully

<img width="561" alt="image" src="https://github.com/elastic/kibana/assets/90178898/122d9800-a2ec-47f8-97a7-acf64b87172a">
<img width="549" alt="image" src="https://github.com/elastic/kibana/assets/90178898/e1751bdd-5aaf-4f68-9f92-7076b306cdfe">



### Checklist

- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
Copy link

@juliaElastic juliaElastic merged commit 992c2dc into elastic:main Nov 29, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants