[7.17](backport #39392) [Bug] fix high IO after sudden filebeat stop (#35893) #39795

mergify · 2024-06-04T12:49:13Z

Proposed commit message

In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing filebeat to fall behind).

This change resets the logInvalid flag after a successful checkpointing.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

[ ]

How to test this PR locally

I have not in fact tested the PR. I have only checked that it builds. I'm also not 100% sure of the correctness, but the change is really simple and I hope that a maintainer can quickly confirm whether the fix is correct. In any case the current code clearly will cause the issues I described, since logInvalid is only ever set to true, never to false. Therefore I think that the beat trivially cannot recover.

Related issues

Closes High io consumption after sudden filebeat stop #35893

Logs

See details in #35893 but the issue we fix otherwise manifests itself with a message of:

WARN memlog/store.go:130 Incomplete or corrupted log file in /var/lib/filebeat/registry/filebeat. Continue with last known complete and consistent state. Reason: unexpected EOF

(or invalid character '\x00' instead of EOF)

This is an automatic backport of pull request #39392 done by Mergify.

In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes Filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing Filebeat to fall behind). This change resets the logInvalid flag after a successful checkpointing. Co-authored-by: Tiago Queiroz <[email protected]> (cherry picked from commit 217f5a6) # Conflicts: # libbeat/statestore/backend/memlog/diskstore.go

mergify · 2024-06-04T12:49:16Z

Cherry-pick of 217f5a6 has failed:

On branch mergify/bp/7.17/pr-39392
Your branch is up to date with 'origin/7.17'.

You are currently cherry-picking commit 217f5a6264.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   CHANGELOG.next.asciidoc
	new file:   libbeat/statestore/backend/memlog/store_test.go

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   libbeat/statestore/backend/memlog/diskstore.go

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

Fix lint warning without changing behaviour, all errors that were not handled are only logged.

mergify · 2024-06-10T04:02:04Z

This pull request has not been merged yet. Could you please review and merge it @belimawr? 🙏

elasticmachine · 2024-06-10T07:16:54Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

belimawr · 2024-06-10T16:42:59Z

I'm not quite sure why the test failed, I didn't find a clear error message aside from Error: failed modules: kubernetes. I'll try re-running the tests.

This commit improves error reporting in Go integration tests, when a module fails, its name and error are collected and printed at the end. The deprecated `batch/v1beta1` is replaced by `batch/v1` in Kubernetes manifests.

mergify · 2024-06-17T04:00:55Z

This pull request has not been merged yet. Could you please review and merge it @belimawr? 🙏

mergify · 2024-06-24T04:01:04Z

This pull request has not been merged yet. Could you please review and merge it @belimawr? 🙏

mergify bot requested a review from a team as a code owner June 4, 2024 12:49

mergify bot added backport conflicts There is a conflict in the backported pull request labels Jun 4, 2024

mergify bot requested review from belimawr and fearful-symmetry and removed request for a team June 4, 2024 12:49

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 4, 2024

belimawr self-assigned this Jun 4, 2024

belimawr added 3 commits June 7, 2024 09:25

Fix backport merge conflicts

429d341

Fix lint warnings

38119b8

Fix lint warning without changing behaviour, all errors that were not handled are only logged.

fix lint warning

20ad844

belimawr approved these changes Jun 7, 2024

View reviewed changes

pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jun 10, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jun 10, 2024

Merge branch '7.17' into mergify/bp/7.17/pr-39392

6e7dafd

pierrehilbert and others added 2 commits June 13, 2024 15:36

Merge branch '7.17' into mergify/bp/7.17/pr-39392

c151f10

Update deprecated Kubernets API and improve error reporting

0cf9cd1

This commit improves error reporting in Go integration tests, when a module fails, its name and error are collected and printed at the end. The deprecated `batch/v1beta1` is replaced by `batch/v1` in Kubernetes manifests.

belimawr requested a review from a team as a code owner June 13, 2024 21:49

pierrehilbert added 2 commits June 17, 2024 08:51

Merge branch '7.17' into mergify/bp/7.17/pr-39392

bbb1c21

Merge branch '7.17' into mergify/bp/7.17/pr-39392

94650a9

Merge branch '7.17' into mergify/bp/7.17/pr-39392

a089708

belimawr merged commit 4cc14a1 into 7.17 Jun 24, 2024
111 of 115 checks passed

belimawr deleted the mergify/bp/7.17/pr-39392 branch June 24, 2024 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[7.17](backport #39392) [Bug] fix high IO after sudden filebeat stop (#35893) #39795

[7.17](backport #39392) [Bug] fix high IO after sudden filebeat stop (#35893) #39795

mergify bot commented Jun 4, 2024 •

edited by zube bot

Loading

mergify bot commented Jun 4, 2024

mergify bot commented Jun 10, 2024

elasticmachine commented Jun 10, 2024

belimawr commented Jun 10, 2024

mergify bot commented Jun 17, 2024

mergify bot commented Jun 24, 2024

[7.17](backport #39392) [Bug] fix high IO after sudden filebeat stop (#35893) #39795

[7.17](backport #39392) [Bug] fix high IO after sudden filebeat stop (#35893) #39795

Conversation

mergify bot commented Jun 4, 2024 • edited by zube bot Loading

Proposed commit message

Checklist

Author's Checklist

How to test this PR locally

Related issues

Logs

mergify bot commented Jun 4, 2024

mergify bot commented Jun 10, 2024

elasticmachine commented Jun 10, 2024

belimawr commented Jun 10, 2024

mergify bot commented Jun 17, 2024

mergify bot commented Jun 24, 2024

mergify bot commented Jun 4, 2024 •

edited by zube bot

Loading