-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[7.17](backport #39392) [Bug] fix high IO after sudden filebeat stop (#35893) #39795
Conversation
In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes Filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing Filebeat to fall behind). This change resets the logInvalid flag after a successful checkpointing. Co-authored-by: Tiago Queiroz <[email protected]> (cherry picked from commit 217f5a6) # Conflicts: # libbeat/statestore/backend/memlog/diskstore.go
Cherry-pick of 217f5a6 has failed:
To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally |
Fix lint warning without changing behaviour, all errors that were not handled are only logged.
This pull request has not been merged yet. Could you please review and merge it @belimawr? 🙏 |
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
I'm not quite sure why the test failed, I didn't find a clear error message aside from |
This commit improves error reporting in Go integration tests, when a module fails, its name and error are collected and printed at the end. The deprecated `batch/v1beta1` is replaced by `batch/v1` in Kubernetes manifests.
This pull request has not been merged yet. Could you please review and merge it @belimawr? 🙏 |
This pull request has not been merged yet. Could you please review and merge it @belimawr? 🙏 |
Proposed commit message
In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing filebeat to fall behind).
This change resets the logInvalid flag after a successful checkpointing.
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
How to test this PR locally
I have not in fact tested the PR. I have only checked that it builds. I'm also not 100% sure of the correctness, but the change is really simple and I hope that a maintainer can quickly confirm whether the fix is correct. In any case the current code clearly will cause the issues I described, since
logInvalid
is only ever set totrue
, never tofalse
. Therefore I think that the beat trivially cannot recover.Related issues
Logs
See details in #35893 but the issue we fix otherwise manifests itself with a message of:
(or
invalid character '\x00'
instead of EOF)This is an automatic backport of pull request #39392 done by Mergify.