Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[7.17](backport #39392) [Bug] fix high IO after sudden filebeat stop (#35893) #39795

Merged
merged 10 commits into from
Jun 24, 2024

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Jun 4, 2024

Proposed commit message

In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing filebeat to fall behind).

This change resets the logInvalid flag after a successful checkpointing.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • [ ]

How to test this PR locally

I have not in fact tested the PR. I have only checked that it builds. I'm also not 100% sure of the correctness, but the change is really simple and I hope that a maintainer can quickly confirm whether the fix is correct. In any case the current code clearly will cause the issues I described, since logInvalid is only ever set to true, never to false. Therefore I think that the beat trivially cannot recover.

Related issues

Logs

See details in #35893 but the issue we fix otherwise manifests itself with a message of:

WARN memlog/store.go:130 Incomplete or corrupted log file in /var/lib/filebeat/registry/filebeat. Continue with last known complete and consistent state. Reason: unexpected EOF

(or invalid character '\x00' instead of EOF)


This is an automatic backport of pull request #39392 done by Mergify.

In case of corrupted log file (which has good chances to happen in case
of sudden unclean system shutdown), we set a flag which causes us to
checkpoint immediately, but never do anything else besides that. This
causes Filebeat to just checkpoint on each log operation (therefore
causing a high IO load on the server and also causing Filebeat to fall
behind).

This change resets the logInvalid flag after a successful checkpointing.

Co-authored-by: Tiago Queiroz <[email protected]>
(cherry picked from commit 217f5a6)

# Conflicts:
#	libbeat/statestore/backend/memlog/diskstore.go
@mergify mergify bot requested a review from a team as a code owner June 4, 2024 12:49
@mergify mergify bot added backport conflicts There is a conflict in the backported pull request labels Jun 4, 2024
@mergify mergify bot requested review from belimawr and fearful-symmetry and removed request for a team June 4, 2024 12:49
Copy link
Contributor Author

mergify bot commented Jun 4, 2024

Cherry-pick of 217f5a6 has failed:

On branch mergify/bp/7.17/pr-39392
Your branch is up to date with 'origin/7.17'.

You are currently cherry-picking commit 217f5a6264.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   CHANGELOG.next.asciidoc
	new file:   libbeat/statestore/backend/memlog/store_test.go

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   libbeat/statestore/backend/memlog/diskstore.go

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 4, 2024
@belimawr belimawr self-assigned this Jun 4, 2024
Fix lint warning without changing behaviour, all errors that were not
handled are only logged.
Copy link
Contributor Author

mergify bot commented Jun 10, 2024

This pull request has not been merged yet. Could you please review and merge it @belimawr? 🙏

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jun 10, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jun 10, 2024
@belimawr
Copy link
Contributor

I'm not quite sure why the test failed, I didn't find a clear error message aside from Error: failed modules: kubernetes. I'll try re-running the tests.

pierrehilbert and others added 2 commits June 13, 2024 15:36
This commit improves error reporting in Go integration tests, when a
module fails, its name and error are collected and printed at the end.

The deprecated `batch/v1beta1` is replaced by `batch/v1` in Kubernetes
manifests.
@belimawr belimawr requested a review from a team as a code owner June 13, 2024 21:49
Copy link
Contributor Author

mergify bot commented Jun 17, 2024

This pull request has not been merged yet. Could you please review and merge it @belimawr? 🙏

Copy link
Contributor Author

mergify bot commented Jun 24, 2024

This pull request has not been merged yet. Could you please review and merge it @belimawr? 🙏

@belimawr belimawr merged commit 4cc14a1 into 7.17 Jun 24, 2024
111 of 115 checks passed
@belimawr belimawr deleted the mergify/bp/7.17/pr-39392 branch June 24, 2024 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport conflicts There is a conflict in the backported pull request Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants