-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] fix high IO after sudden filebeat stop (#35893) #39392
Conversation
💚 CLA has been signed |
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
❕ Build Aborted
Expand to view the summary
Build stats
Steps errorsExpand to view the steps failures
|
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
Hi @emmanueltouzery, looking at the code and the original issue, I can see it indeed is a bug. Restarting Filebeat should fix the problem, but I didn't have time to test it. To get this merged we'd need some tests, setting Would you be interested in further working on this PR? I didn't take a deep look at our current tests, but I believe it would be simple to write a unit test that simulates this condition and ensure the fix works as expected. I can try guiding you on this. |
Well first off I'm delighted that the issue will likely be fixed |
Sorry for closing and reopening, I'm typing on a smartphone. I can confirm that restarting filebeat fixes the issue. So if you think it makes sense for you, I can try to improve the PR further with your guidance and worst-case you take over. To start with, could you tell me in which file the test should be added? With regards to when to reset the flag, you mention truncating the log, is that the checkpointing operation or does that happen elsewhere? |
ping? |
@belimawr I'm interested in trying to get a fix merged for this issue. could you tell me in which file the test should be added? With regards to when to reset the flag, you mention truncating the log, is that the checkpointing operation or does that happen elsewhere? Thank you! |
Hey @emmanueltouzery, sorry the delay in replying.
Yes, the checkpoint operation, once the log file is truncated, it will be in a valid state, so we can already reset the beats/libbeat/statestore/backend/memlog/diskstore.go Lines 396 to 417 in 23d1066
As you're modifying Ping me if you need more guidance. The other tests in this package might help you. |
55f74ed
to
36eda84
Compare
I have updated the PR now. I added the test against store, not diskstore, because the read validation is done in store. I also reused an invalid truncated log sample file from another test. Let me know whether that looks good! I saw the CI fail, the errors seem unrelated to my code. Let me know if I'm wrong. |
a7fae3a
to
6d13807
Compare
I also added a second commit to fix deprecation warnings in diskstore. |
6d13807
to
8ab9a5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR is looking great @emmanueltouzery! There are a few things to improve/add and we'll be ready to merge it 🤩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR is looking great! Just one small change and I'll approve it.
Co-authored-by: Tiago Queiroz <[email protected]>
6bae530
to
2c16d59
Compare
applied your suggestion 👍 |
test |
/test |
I didn't look at the 7.17 code, but if the bug is also there, yes, we can backport it. I'll add the label to create the backport PR |
The lastest test run failed because of a known issue: #39698. We'll have to wait for the test to be fixed or skipped and then update this PR with |
I can confirm that 7.x is affected too. |
but i see other things getting merged despite this? and no activity on #39698 in the last five days? i'm mentioning this because from my point of view this issue we're fixing here is not that benign... i'm hoping the fix goes in sometime soon. |
You're right, the issue is not with your PR. We do our best not to force merge PRs if CI is failing, even if it is for unrelated reasons. If there is a flaky tests, some times not all PRs fails on it, so they get merged. I'll update your branch with the latest |
/test |
run docs-build |
In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes Filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing Filebeat to fall behind). This change resets the logInvalid flag after a successful checkpointing. Co-authored-by: Tiago Queiroz <[email protected]> (cherry picked from commit 217f5a6) # Conflicts: # libbeat/statestore/backend/memlog/diskstore.go
Thank you so much for the help @emmanueltouzery !! |
I see this is not in 8.14.0. so this will be released in 8.15.0? Thank you, and great we got this in! |
Yes this will land in 8.15.0 in the current state but I think we should backport it to have it in 8.14.1. |
Yes, it makes sense. I'll add the backport tag and monitor/merge the PR |
In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes Filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing Filebeat to fall behind). This change resets the logInvalid flag after a successful checkpointing. Co-authored-by: Tiago Queiroz <[email protected]> (cherry picked from commit 217f5a6)
…35893) (#39842) * Fix high IO after sudden filebeat stop (#35893) (#39392) In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes Filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing Filebeat to fall behind). This change resets the logInvalid flag after a successful checkpointing. Co-authored-by: Tiago Queiroz <[email protected]> (cherry picked from commit 217f5a6) * Update CHANGELOG.next.asciidoc --------- Co-authored-by: emmanueltouzery <[email protected]> Co-authored-by: Tiago Queiroz <[email protected]>
…35893) (#39795) In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes Filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing Filebeat to fall behind). This change resets the logInvalid flag after a successful checkpointing. Co-authored-by: Tiago Queiroz <[email protected]> (cherry picked from commit 217f5a6) # Conflicts: # libbeat/statestore/backend/memlog/diskstore.go --------- Co-authored-by: emmanueltouzery <[email protected]> Co-authored-by: Tiago Queiroz <[email protected]> Co-authored-by: Pierre HILBERT <[email protected]>
Proposed commit message
In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing filebeat to fall behind).
This change resets the logInvalid flag after a successful checkpointing.
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
How to test this PR locally
I have not in fact tested the PR. I have only checked that it builds. I'm also not 100% sure of the correctness, but the change is really simple and I hope that a maintainer can quickly confirm whether the fix is correct. In any case the current code clearly will cause the issues I described, since
logInvalid
is only ever set totrue
, never tofalse
. Therefore I think that the beat trivially cannot recover.Related issues
Logs
See details in #35893 but the issue we fix otherwise manifests itself with a message of:
(or
invalid character '\x00'
instead of EOF)