Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up / document metrics monitor fields #39413

Merged
merged 18 commits into from
Jun 6, 2024

Conversation

faec
Copy link
Contributor

@faec faec commented May 6, 2024

Document the semantics of many metrics monitoring variables, and rename some metrics APIs to more clearly indicate their function. As a side effect, fix several metrics reporting / publishing bugs in the Elasticsearch output, including #39146.

Add the output.events.dead_letter metric to distinguish events that were ingested to the dead letter index after a fatal error (previously these events were just reported as "acked").

This could have been a shorter fix, but it was hard to properly test since the metrics were changed from two separate functions with a lot of special cases. I ended up reorganizing the Elasticsearch Publish helpers to make the logic more clear. The new layout makes it much easier to test the error handling and metrics reporting.

The bugs fixed by this refactor are:

  • When a batch was split, the events in it were not reported to the observer via RetryableErrors. This caused activeEvents to increase permanently even after the events were handled.
  • When a previously-failed event was ingested to the dead letter index as a raw string, it was reported to the observer as Acked (success). The new logic creates a new dead_letter metric specifically for this case.
  • When a previously-failed event encountered a fatal error ingesting to the dead letter index:
    • It was reported to the observer as both a permanent error and a retryable error, giving incorrect event counts. It's now reported as just a permanent error.
    • It was added to the retry list anyway, which would block all ingestion if the error really was fatal since the queue could never advance past that event. It's now dropped, the same as with a permanent error in the main index.
  • If the Elasticsearch bulk index response was invalid, the associated events were dropped and reported as acknowledged

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

@faec faec added bug cleanup Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels May 6, 2024
@faec faec self-assigned this May 6, 2024
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels May 6, 2024
@faec faec requested a review from fearful-symmetry May 6, 2024 13:06
Copy link
Contributor

mergify bot commented May 6, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @faec? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

💔 Tests Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2024-05-06T13:06:41.317+0000

  • Duration: 122 min 9 sec

Test stats 🧪

Test Results
Failed 2
Passed 29365
Skipped 2044
Total 31411

Test errors 2

Expand to view the tests failures

Build&Test / libbeat-unitTest / TestCollectPublishFailsNone – github.com/elastic/beats/v7/libbeat/outputs/elasticsearch
    Expand to view the error details

     Failed 
    

    Expand to view the stacktrace

     === RUN   TestCollectPublishFailsNone
    --- FAIL: TestCollectPublishFailsNone (0.00s)
    panic: interface conversion: interface {} is nil, not *elasticsearch.encodedEvent [recovered]
    	panic: interface conversion: interface {} is nil, not *elasticsearch.encodedEvent
    
    goroutine 122 [running]:
    testing.tRunner.func1.2({0x147ce40, 0xc000383bc0})
    	/var/lib/jenkins/workspace/PR-39413-1-fd020dae-7cd0-4ea4-9adb-7c7ff81cb105/.gvm/versions/go1.21.9.linux.amd64/src/testing/testing.go:1545 +0x238
    testing.tRunner.func1()
    	/var/lib/jenkins/workspace/PR-39413-1-fd020dae-7cd0-4ea4-9adb-7c7ff81cb105/.gvm/versions/go1.21.9.linux.amd64/src/testing/testing.go:1548 +0x397
    panic({0x147ce40?, 0xc000383bc0?})
    	/var/lib/jenkins/workspace/PR-39413-1-fd020dae-7cd0-4ea4-9adb-7c7ff81cb105/.gvm/versions/go1.21.9.linux.amd64/src/runtime/panic.go:914 +0x21f
    github.com/elastic/beats/v7/libbeat/outputs/elasticsearch.(*Client).bulkCollectPublishFails(0xc0003eb040, {0xc000029800, 0xafd, 0xc00}, {0xc0001f8000, 0x64, 0x64})
    	/var/lib/jenkins/workspace/PR-39413-1-fd020dae-7cd0-4ea4-9adb-7c7ff81cb105/src/github.com/elastic/beats/libbeat/outputs/elasticsearch/client.go:398 +0xdcd
    github.com/elastic/beats/v7/libbeat/outputs/elasticsearch.TestCollectPublishFailsNone(0xc0003bc360?)
    	/var/lib/jenkins/workspace/PR-39413-1-fd020dae-7cd0-4ea4-9adb-7c7ff81cb105/src/github.com/elastic/beats/libbeat/outputs/elasticsearch/client_test.go:266 +0x256
    testing.tRunner(0xc0003eab60, 0x1699f30)
    	/var/lib/jenkins/workspace/PR-39413-1-fd020dae-7cd0-4ea4-9adb-7c7ff81cb105/.gvm/versions/go1.21.9.linux.amd64/src/testing/testing.go:1595 +0xff
    created by testing.(*T).Run in goroutine 1
    	/var/lib/jenkins/workspace/PR-39413-1-fd020dae-7cd0-4ea4-9adb-7c7ff81cb105/.gvm/versions/go1.21.9.linux.amd64/src/testing/testing.go:1648 +0x3ad
     
    

Build&Test / libbeat-goIntegTest / TestClientBulkPublishEventsWithDeadletterIndex – github.com/elastic/beats/v7/libbeat/outputs/elasticsearch
    Expand to view the error details

     Failed 
    

    Expand to view the stacktrace

     === RUN   TestClientBulkPublishEventsWithDeadletterIndex
        client_integration_test.go:243: Expecting mapping conflict
    --- FAIL: TestClientBulkPublishEventsWithDeadletterIndex (0.39s)
     
    

Steps errors 8

Expand to view the steps failures

filebeat-goIntegTest - mage goIntegTest
  • Took 18 min 15 sec . View more details here
  • Description: mage goIntegTest
libbeat-unitTest - mage build unitTest
  • Took 12 min 39 sec . View more details here
  • Description: mage build unitTest
libbeat-unitTest - mage build unitTest
  • Took 9 min 17 sec . View more details here
  • Description: mage build unitTest
libbeat-unitTest - mage build unitTest
  • Took 9 min 18 sec . View more details here
  • Description: mage build unitTest
libbeat-goIntegTest - mage goIntegTest
  • Took 15 min 40 sec . View more details here
  • Description: mage goIntegTest
libbeat-goIntegTest - mage goIntegTest
  • Took 11 min 1 sec . View more details here
  • Description: mage goIntegTest
libbeat-goIntegTest - mage goIntegTest
  • Took 11 min 5 sec . View more details here
  • Description: mage goIntegTest
Error signal
  • Took 0 min 0 sec . View more details here
  • Description: Error 'hudson.AbortException: script returned exit code 1'

🐛 Flaky test report

❕ There are test failures but not known flaky tests.

Expand to view the summary

Genuine test errors 2

💔 There are test failures but not known flaky tests, most likely a genuine test failure.

  • Name: Build&Test / libbeat-unitTest / TestCollectPublishFailsNone – github.com/elastic/beats/v7/libbeat/outputs/elasticsearch
  • Name: Build&Test / libbeat-goIntegTest / TestClientBulkPublishEventsWithDeadletterIndex – github.com/elastic/beats/v7/libbeat/outputs/elasticsearch

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Contributor

mergify bot commented May 17, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b libbeat-metrics-cleanup upstream/libbeat-metrics-cleanup
git merge upstream/main
git push upstream libbeat-metrics-cleanup

Copy link
Contributor

mergify bot commented May 25, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b libbeat-metrics-cleanup upstream/libbeat-metrics-cleanup
git merge upstream/main
git push upstream libbeat-metrics-cleanup

@faec faec requested a review from belimawr May 25, 2024 00:10
@faec faec marked this pull request as ready for review May 25, 2024 00:10
@faec faec requested a review from a team as a code owner May 25, 2024 00:10
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Comment on lines 463 to 464
// 409 is used to indicate an event with same ID already exists if
// `create` op_type is used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// 409 is used to indicate an event with same ID already exists if
// `create` op_type is used.
// 409 is used to indicate an event with same ID already exists if
// `create` op_type is used.

You now also get a 409 if the TSDB dimensions are the same for two events, they form an implicit ID. This is probably the more common reason to hit this now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

const (
defaultEventType = "doc"
)

var bulkRequestParams = map[string]string{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get some comments explaining this ES param? It's a bit odd without any context

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// which are also reported in events.dropped.
queueACKed: monitoring.NewUint(reg, "queue.acked"),

// (Gauge) queue.filled.pct.events measures the fraction (from 0 to 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we drop the .events from this, ahead of adding support for the queue size in bytes?

People will want to alert on this, they shouldn't have to define it twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already part of #39774

@faec faec merged commit 37db599 into elastic:main Jun 6, 2024
105 of 108 checks passed
@faec faec deleted the libbeat-metrics-cleanup branch June 6, 2024 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug cleanup Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Elasticsearch output can report an incorrect count for active events
5 participants