Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support serverless #36649

Merged
merged 48 commits into from
Oct 4, 2023
Merged

Support serverless #36649

merged 48 commits into from
Oct 4, 2023

Conversation

fearful-symmetry
Copy link
Contributor

@fearful-symmetry fearful-symmetry commented Sep 21, 2023

This is a rather large PR that updates beats to provide DSL support and setup on serverless. Some of the parts are:

  • Update the config templates to add a setup.dsl.* section
  • Renaming a considerable number of files in libbeat/idxmgmt, either to remote "ILM" or for the sake of clarity
  • Refactor of the lifecycle managers and index manages to support serverless awareness as part of the setup process
  • Major refactor of the setup client handlers, moving large parts of the index management process to the client handlers, to enable simpler switching between ILM/DSL
  • Changes to the index template process to enable the DSL setup
  • Additional tests

As an added note: the index management code is pretty complex and labyrinthine, and while I tried to simplify some things, it's still (needlessly) complex. I'm fairly certain that it would be possible to significantly simplify a lot of this code, but due to the tight deadlines of serverless support, it probably won't happen as part of this PR.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

When I started working on this, my hope was that I could just glue on DSL support without touching the existing ILM code, but that wasn't the case. This means we'll need to test new DSL support, and existing ILM support.

Here's a high-level overview of suggested tests, which will need to be performed against both serverless and stateful instances

  • Test [beat] setup --index-management against a fresh ES instance, with all default settings
  • Test [beat] setup --index-management against a fresh ES instance, with a custom policy file
  • Test [beat] setup --index-management against a fresh ES instance, with both ILM and DSL enabled in the config, ensure beats fails
  • Test [beat] setup --index-management against an existing ES instance with existing datastream config, all default settings, insure we don't overwrite
  • Test [beat] setup --index-management against an existing ES instance with existing datastream config, with overwrite enabled, ensure we have overwritten data.
  • Test [beat] setup --index-management with no setup.* flags set in the config, ensure defaults are properly set up.
  • Run the beat against a new ES instance
  • Run the beat against an existing ES instance that has existing setup data from an older beat version
  • Test the setup process twice with overwrite enabled and a custom policy on the second setup. Ensure the policy is successfully updated.

Tests to run over serverless only:

  • set setup.template.name and setup.dsl.policy_name to different values, run setup --index-management. A warning message should be printed.
  • set setup.template.name and setup.dsl.policy_name to the same value, run setup --index-management, then run again with a custom policy, and setup.dsl.overwrite set to true

To create a custom policy for testing:

For DSL: create a file with {"data_retention": "71d"}, point to it with the setup.dsl.policy_file value

For ILM: ILM policies are a little more complex, see here for an example, point to it with the setup.ilm.policy_file value

Validating setup

To validate the setup --index-management, there are two ES endpoints to check: _data_stream points to the data stream, and contains a lifecycle section that should contain a valid lifecycle policy (see the above section for an example). The other endpoint is _index_template, which should contain a matching index template that links to the data stream.

Known issues

  • Due to the weird behavior of the config library in elastic-agent-libs (issue forthcoming), if a user has both setup.ilm.* and setup.dsl.* config values, one will need to be explicitly set to disabled, even if the other is explicitly set to enabled

Config principles

Beats now has to care about both ILM and DSL config, as well as the upstream elasticsearch. Beats should behave based on these principles:

  • If a user has enabled one lifecycle type but the upstream ES is the other type, return an error
  • If a user has enabled both lifecycle types, return an error
  • If the user has configured nothing, create the lifecycle policy based on the connected ES type
  • During initial setup, create the lifecycle policy by injecting it directly into the template. To update the policy, perform a separate PUT call without modifying the template.

Undefined behavior

Right now, there's some edge cases where I'm not sure about the correct behavior.

  • If the user defines a custom template name but not a custom DSL policy name, then initial setup will work, but a second setup with overwrite enabled will fail, This is because the initial index management setup injects the lifecycle policy directly into the template policy. However, to update the DSL policy, we must make a separate REST call to an endpoint that includes the template name. If the user has set a custom template name but not a policy name, the code will revert to a default (and incorrect) template name. Should the user be expected to set template.name and policy_name? Should the code silently defer to template.name? Should the initial setup fail and tell the user to correct their config if one is set but not the other?

@fearful-symmetry fearful-symmetry self-assigned this Sep 21, 2023
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Sep 21, 2023
@mergify
Copy link
Contributor

mergify bot commented Sep 21, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@fearful-symmetry fearful-symmetry added the Team:Elastic-Agent Label for the Agent team label Sep 21, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Sep 21, 2023
@elasticmachine
Copy link
Collaborator

elasticmachine commented Sep 21, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-10-03T23:22:25.660+0000

  • Duration: 28 min 57 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@joshdover
Copy link
Contributor

joshdover commented Sep 22, 2023

Excited to see this! I will review more next week, but I first wanted to just test it out but I'm having some trouble. After building metricbeat with this PR, I'm running setup against a Serverless instance on production and the process is hanging on the initial ping. I tried running regular setup and with the --index-management flag and got the same results.

$ ./metricbeat -e -d '*' -E output.elasticsearch.hosts=https://playground-o11y-f45408.es.us-east-1.aws.elastic.cloud -E output.elasticsearch.api_key=XXXX setup

{"log.level":"info","@timestamp":"2023-09-22T19:00:48.004+0200","log.origin":{"file.name":"instance/beat.go","file.line":808},"message":"Home path: [/Users/jdover/src/beats/metricbeat] Config path: [/Users/jdover/src/beats/metricbeat] Data path: [/Users/jdover/src/beats/metricbeat/data] Logs path: [/Users/jdover/src/beats/metricbeat/logs]","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-09-22T19:00:48.005+0200","log.logger":"beat","log.origin":{"file.name":"instance/beat.go","file.line":899},"message":"Beat metadata path: /Users/jdover/src/beats/metricbeat/data/meta.json","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-09-22T19:00:48.006+0200","log.origin":{"file.name":"instance/beat.go","file.line":816},"message":"Beat ID: faab8154-7ceb-470f-b2ca-0aee636951c3","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-09-22T19:00:48.053+0200","log.logger":"esclientleg","log.origin":{"file.name":"eslegclient/connection.go","file.line":122},"message":"elasticsearch url: https://playground-o11y-f45408.es.us-east-1.aws.elastic.cloud:9200","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-09-22T19:00:48.055+0200","log.logger":"esclientleg","log.origin":{"file.name":"eslegclient/connection.go","file.line":284},"message":"ES Ping(url=https://playground-o11y-f45408.es.us-east-1.aws.elastic.cloud:9200)","service.name":"metricbeat","ecs.version":"1.6.0"}

I also tried completely omitting the API key to see if I got an auth error and adding the port number explicitly to the ES host URL but it all hangs on this initial ping.

Anything I might be missing?

@fearful-symmetry
Copy link
Contributor Author

fearful-symmetry commented Sep 22, 2023

@joshdover So, I've run into this problem a few times. The URL that beats is trying to connect to is

https://playground-o11y-f45408.es.us-east-1.aws.elastic.cloud:9200

and I did a curl myself, no response. I'm guessing you entered https://playground-o11y-f45408.es.us-east-1.aws.elastic.cloud and beats "helpfully" added :9200 when the port is just vanilla TLS 443. Just manually specify the port and it should be fine.

@cmacknz cmacknz added the QA:Needs Validation Needs validation by the QA Team label Oct 3, 2023
@cmacknz
Copy link
Member

cmacknz commented Oct 3, 2023

Yah, agreed, we need to set aside time and decide what various edge cases are, since for some things like template name versus policy name I'm not entirely sure what the correct behavior should be.

@fearful-symmetry since you are closest to this you can create this list :)

In particular create issues for any parts of this that are unclear to you so that we can discuss them.

@cmacknz
Copy link
Member

cmacknz commented Oct 3, 2023

Test [beat] setup --index-management against a fresh ES instance, with all default settings

./filebeat setup --index-management
Overwriting lifecycle policy is disabled. Set `setup.ilm.overwrite: true` or `setup.dsl.overwrite: true` to overwrite.
Index setup finished.

This error message should be special cased to either ILM or DSL depending on which type we detect we are connected to. Remember this error is going to be seen by users who have never used Filebeat before, if we don't tell them exactly what to do they likely won't know what the solution is themselves.

#setup.dsl.check_exists: true

# Overwrite the lifecycle policy at startup. The default is false.
#setup.dsl.overwrite: false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a newline at the end of the file otherwise there is a missing space between the end of this section and the next one:

# Overwrite the lifecycle policy at startup. The default is false.
#setup.dsl.overwrite: false
# =================================== Kibana ===================================

# Starting with Beats version 6.0.0, the dashboards are loaded via the Kibana API.
# This requires a Kibana endpoint configuration.

@cmacknz
Copy link
Member

cmacknz commented Oct 3, 2023

Similarly for the error:

./filebeat setup --index-management
Exiting: error creating index management handler: error creating ES handler: only one lifecycle management type can be used, but both ILM and DSL are enabled

We should be able to suggest which one of ILM or DSL should be used because we know which ES type we are connected to.

@cmacknz
Copy link
Member

cmacknz commented Oct 3, 2023

If I set the following DSL configuration I get an error. If I remove the policy_name setting and use the default it works:

setup.dsl.enabled: true
# Set the lifecycle policy name. The default policy name is
# 'filebeat'.
setup.dsl.policy_name: mypolicy
setup.dsl.policy_file: "dsl_policy.json"
setup.dsl.overwrite: true
❯ cat dsl_policy.json
{"data_retention": "71d"}

❯ ./filebeat setup --index-management
Exiting: error loading template: error updating lifecycle policy: error creating policy from config: error submitting policy: error creating lifecycle policy: got 404 from elasticsearch: {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [mypolicy]","resource.type":"index_or_alias","resource.id":"mypolicy","index_uuid":"_na_","index":"mypolicy"}],"type":"index_not_found_exception","reason":"no such index [mypolicy]","resource.type":"index_or_alias","resource.id":"mypolicy","index_uuid":"_na_","index":"mypolicy"},"status":404}

If I set setup.dsl.policy_name: filebeat-* then it works.

It seems like the policy_name for DSL is actually an index pattern. Is that correct? If so the parameter name should be changed and the documentation updated.

@fearful-symmetry
Copy link
Contributor Author

@cmacknz Yah, the policy_name is the name of the data stream that the policy is for.

My initial idea was just to keep the DSL and ILM config the same, but you're probably right, policy_name is a tad confusing here. Perhaps data_stream_name makes more sense.

@cmacknz
Copy link
Member

cmacknz commented Oct 3, 2023

Perhaps data_stream_name makes more sense.

It seems like it can be a pattern so data_stream_pattern would be best, assuming that is right.

#setup.dsl.enabled: true

# Set the lifecycle policy name or pattern. For DSL, this name must match the data stream that the lifecycle is for.
# The default data stream pattern is %{[beat.name]}-%{[beat.version]}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you intended to actually have %{[beat.name]}-%{[beat.version]} be templated or did you want this literal text in the docs?

I don't know that I've seen this format used elsewhere but I might be missing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right, that's probably not the right way to do that...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, turns out there's no template variable for .VersionName or anything similar, so the next best thing is metricbeat-%{[agent.version]} an such.

@fearful-symmetry
Copy link
Contributor Author

Kinda baffled by the tar error in the packaging step, let's try that again...

@fearful-symmetry
Copy link
Contributor Author

/test

@fearful-symmetry
Copy link
Contributor Author

Alright, think we're at the point where we can force-merge?

@cmacknz
Copy link
Member

cmacknz commented Oct 4, 2023

Yes, one more test and I'll merge it. I still think some of the error messages can be improved but that doesn't need to block this PR, will file a follow up for that.

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-tested the latest DSL changes with the data_stream_pattern config, still works. Thank!

Merging this one.

@cmacknz
Copy link
Member

cmacknz commented Oct 4, 2023

@cmacknz cmacknz added the QA:Ready For Testing Code is merged and ready for QA to validate label Oct 4, 2023
@cmacknz
Copy link
Member

cmacknz commented Oct 4, 2023

@elastic/fleet-qasource-external please test that ILM continues to work.

You will need to add manual test cases for the new data stream lifecycle (DSL) configuration when using a serverless project. The instructions are in the PR description.

It would also be a good idea to do some exploratory testing around ILM and DSL.

@cmacknz
Copy link
Member

cmacknz commented Oct 10, 2023

@elastic/fleet-qasource-external we will want to test this for each Beat individually to confirm that:

  1. Configuring DSL with a serverless project works as expected.
  2. Configuring ILM with a stateful deployment continues to work as it did before.

@cmacknz
Copy link
Member

cmacknz commented Oct 11, 2023

@elastic/fleet-qasource-external The reference documentation for configuring ILM can be found in each Beat's reference configuration file:

ILM

# ====================== Index Lifecycle Management (ILM) ======================
# Configure index lifecycle management (ILM) to manage the backing indices
# of your data streams.
# Enable ILM support. Valid values are true, or false.
#setup.ilm.enabled: true
# Set the lifecycle policy name. The default policy name is
# 'beatname'.
#setup.ilm.policy_name: "mypolicy"
# The path to a JSON file that contains a lifecycle policy configuration. Used
# to load your own lifecycle policy.
#setup.ilm.policy_file:
# Disable the check for an existing lifecycle policy. The default is true. If
# you disable this check, set setup.ilm.overwrite: true so the lifecycle policy
# can be installed.
#setup.ilm.check_exists: true
# Overwrite the lifecycle policy at startup. The default is false.
#setup.ilm.overwrite: false

A sample ILM configuration is below which can be saved to a file for testing:

{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_age": "10d",
            "max_primary_shard_size": "50gb"
          }
        }
      }
    }
  }
}

A sample ILM configuration for use with filebeat setup --index-management where the configuration is saved to ilm_template.json above is:

# ====================== Index Lifecycle Management (ILM) ======================
setup.ilm.enabled: false
setup.ilm.policy_name: "mypolicy"
setup.ilm.policy_file: "ilm_template.json"
#setup.ilm.check_exists: true
setup.ilm.overwrite: true

To setup ILM you will need to first generate an API key which can be done from stack management. Note that the API key has to have the Beats format.

Screenshot 2023-10-11 at 2 44 48 PM

The loaded lifecycle policies can be viewed in Stack Management. The default ILM policy for filebeat will be named mypolicy per the configuration above and it should match the policy you uploaded with the setup command.

Screenshot 2023-10-11 at 2 45 45 PM

DSL

# ======================== Data Stream Lifecycle (DSL) =========================
# Configure Data Stream Lifecycle to manage data streams while connected to Serverless elasticsearch.
# These settings are mutually exclusive with ILM settings which are not supported in Serverless projects.
# Enable DSL support. Valid values are true, or false.
#setup.dsl.enabled: true
# Set the lifecycle policy name or pattern. For DSL, this name must match the data stream that the lifecycle is for.
# The default data stream pattern is filebeat-%{[agent.version]}"
# The template string `%{[agent.version]}` will resolve to the current stack version.
# The other possible template value is `%{[beat.name]}`.
#setup.dsl.data_stream_pattern: "filebeat-%{[agent.version]}"
# The path to a JSON file that contains a lifecycle policy configuration. Used
# to load your own lifecycle policy.
# If no custom policy is specified, a default policy with a lifetime of 7 days will be created.
#setup.dsl.policy_file:
# Disable the check for an existing lifecycle policy. The default is true. If
# you disable this check, set setup.dsl.overwrite: true so the lifecycle policy
# can be installed.
#setup.dsl.check_exists: true
# Overwrite the lifecycle policy at startup. The default is false.
#setup.dsl.overwrite: false

To setup DSL you will need to generate an API key which can be done from the Serverless project security page:

Screenshot 2023-10-11 at 2 48 03 PM

A sample DSL configuration is below which can be saved to a file for testing:

{"data_retention": "5d"}

If you run filebeat setup --index-management with the example configuration below where dsl_policy.json contains the policy above you should see the following on the Index Management page:

# ======================== Data Stream Lifecycle (DSL) =========================
setup.dsl.enabled: true
setup.dsl.data_stream_pattern: "filebeat-*"
setup.dsl.policy_file: "dsl_policy.json"
# setup.dsl.check_exists: true
setup.dsl.overwrite: true
Screenshot 2023-10-11 at 2 38 46 PM

@amolnater-qasource
Copy link

Hi @cmacknz

Thank you for all the details and helping us setting the beats over Serverless.

We have raised an issue for data unavailability under Discover tab at #36873

Further, we will continue our testing on the same.

Thanks!

@amolnater-qasource
Copy link

Hi @cmacknz

We have completed the testing by installing all beats on stateful 8.12.0 SNAPSHOT kibana cloud environment and 8.12.0 SNAPSHOT serverless environment.

Beats Installed:

  • Filebeat
  • Metricbeat
  • Winlogbeat
  • Heartbeat
  • Auditbeat
  • Packetbeat

Build details:

8.12.0 SNAPSHOT Stateful
BUILD: 68707
COMMIT: c03b54d44cf3945a6d0b04b2534d034d9b39ec41

8.12.0 SNAPSHOT Serverless
BUILD: 68602
COMMIT: 7a559ff0cadf81226fd5901445a3e05545c6123d
S.No. Scenario Filebeat Metricbeat Winlogbeat Auditbeat Heartbeat Packetbeat
1 [Windows]ES instance, with both ILM and DSL enabled in the config, ensure beats fails for serverless/stateful. PASS PASS PASS PASS PASS PASS
2 [Windows]ES instance with existing datastream config, all default settings, with ILM overwrite disabled. PASS PASS PASS PASS PASS PASS
3 [Windows]ES instance with existing datastream config, with ILM overwrite enabled. PASS PASS PASS PASS PASS PASS
4 [Windows]ES instance with existing datastream config, all default settings, with DSL overwrite disabled. PASS PASS PASS PASS PASS PASS
5 [Windows]ES instance with existing datastream config, with DSL overwrite enabled. PASS PASS PASS PASS PASS PASS
6 [Windows]Beats data on for 8.12.0 SNAPSHOT Stateful environment with ILM enabled. PASS PASS PASS PASS PASS PASS
7 [Windows]Beats data on for 8.12.0 SNAPSHOT Serverless environment with DSL enabled. PASS PASS PASS PASS PASS PASS
8 [Ubuntu22]ES instance, with both ILM and DSL enabled in the config, ensure beats fails for serverless/stateful. PASS PASS NA PASS PASS PASS
9 [Ubuntu22]ES instance with existing datastream config, all default settings, with ILM overwrite disabled. PASS PASS NA PASS PASS PASS
10 [Ubuntu22]ES instance with existing datastream config, with ILM overwrite enabled. PASS PASS NA PASS PASS PASS
11 [Ubuntu22]ES instance with existing datastream config, all default settings, with DSL overwrite disabled. PASS PASS NA PASS PASS PASS
12 [Ubuntu22]ES instance with existing datastream config, with DSL overwrite enabled. PASS PASS NA PASS PASS PASS
13 Ubuntu22]Beats data on for 8.12.0 SNAPSHOT Stateful environment with ILM enabled. PASS PASS NA PASS PASS PASS
14 [Ubuntu22]Beats data on for 8.12.0 SNAPSHOT Serverless environment with DSL enabled. PASS PASS NA PASS PASS PASS

Screenshots:
Stateful:
image
image

Serverless:
image

cc: @pierrehilbert @jlind23

Please let us know if we are missing any scenario needs to be covered here.

Thanks!!

@pierrehilbert
Copy link
Collaborator

Thanks for those tests, looks great!

Scholar-Li pushed a commit to Scholar-Li/beats that referenced this pull request Feb 5, 2024
* make serverless integration tests run

* update deps

* linter, error handling

* still fixing error handling

* fixing old formatting verbs

* still finding format verbs

* add docs, fix typos

* initial functional pass

* fix setup, config

* fix naming of config section

* add headers

* make linter happy

* still making linter happy

* tinkering with tests

* still fixing tests

* revert file

* tinker with export

* fix logging in tests

* fix load checking in setup

* fix url in integration test

* fix commented out test line

* stil tinkering with integraton test

* fix bad init in tests, add more check to ES handler

* add init checks for client handler, add more unit tests

* make template loader serverless aware

* change naming, error handling, rework config system

* fix up integration tests

* clean up load tests

* stil making linter happy

* simplify manager init, fix tests, update docs

* minor test fixes

* clean up tests

* clean up typos, remove legacy error handling

* expand logging

* logging, error handling changes

* change error messages

* update lifetimes for serverless elasticsearch

* fix integration tests

* change error handling, clean up log messages

* tinker with DSL config name

* update docs

* fix name example
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
QA:Needs Validation Needs validation by the QA Team QA:Ready For Testing Code is merged and ready for QA to validate Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants