[Bug] fix high IO after sudden filebeat stop (#35893) #39392

emmanueltouzery · 2024-05-03T09:34:04Z

Proposed commit message

In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing filebeat to fall behind).

This change resets the logInvalid flag after a successful checkpointing.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

[ ]

How to test this PR locally

I have not in fact tested the PR. I have only checked that it builds. I'm also not 100% sure of the correctness, but the change is really simple and I hope that a maintainer can quickly confirm whether the fix is correct. In any case the current code clearly will cause the issues I described, since logInvalid is only ever set to true, never to false. Therefore I think that the beat trivially cannot recover.

Related issues

Closes High io consumption after sudden filebeat stop #35893

Logs

See details in #35893 but the issue we fix otherwise manifests itself with a message of:

WARN memlog/store.go:130 Incomplete or corrupted log file in /var/lib/filebeat/registry/filebeat. Continue with last known complete and consistent state. Reason: unexpected EOF

(or invalid character '\x00' instead of EOF)

cla-checker-service · 2024-05-03T09:34:08Z

💚 CLA has been signed

mergify · 2024-05-03T09:34:43Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @emmanueltouzery? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2024-05-03T09:39:11Z

❕ Build Aborted

The PR is not allowed to run in the CI yet

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2024-05-03T20:03:25.391+0000
Duration: 5 min 45 sec

Steps errors

Expand to view the steps failures

`Load a resource file from a library`

Took 0 min 0 sec . View more details here
Description: approval-list/elastic/beats.yml

`Error signal`

Took 0 min 0 sec . View more details here
Description: githubApiCall: The REST API call https://api.github.com/orgs/elastic/members/emmanueltouzery return the message : java.lang.Exception: httpRequest: Failure connecting to the service https://api.github.com/orgs/elastic/members/emmanueltouzery : httpRequest: Failure connecting to the service https://api.github.com/orgs/elastic/members/emmanueltouzery : Code: 404Error: {"message":"User does not exist or is not a member of the organization","documentation_url":"https://docs.github.com/rest/orgs/members#check-organization-membership-for-a-user"}

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2024-05-03T14:14:42Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

belimawr · 2024-05-03T19:40:34Z

Hi @emmanueltouzery, looking at the code and the original issue, I can see it indeed is a bug. Restarting Filebeat should fix the problem, but I didn't have time to test it.

To get this merged we'd need some tests, setting logInvalid to false can be done right after truncating the log file.

Would you be interested in further working on this PR? I didn't take a deep look at our current tests, but I believe it would be simple to write a unit test that simulates this condition and ensure the fix works as expected.

I can try guiding you on this.

emmanueltouzery · 2024-05-03T20:02:52Z

Well first off I'm delighted that the issue will likely be fixed

emmanueltouzery · 2024-05-03T20:08:32Z

Sorry for closing and reopening, I'm typing on a smartphone.

I can confirm that restarting filebeat fixes the issue.

So if you think it makes sense for you, I can try to improve the PR further with your guidance and worst-case you take over.

To start with, could you tell me in which file the test should be added?

With regards to when to reset the flag, you mention truncating the log, is that the checkpointing operation or does that happen elsewhere?

emmanueltouzery · 2024-05-07T19:52:21Z

ping?

emmanueltouzery · 2024-05-12T13:45:46Z

@belimawr I'm interested in trying to get a fix merged for this issue.

could you tell me in which file the test should be added?

With regards to when to reset the flag, you mention truncating the log, is that the checkpointing operation or does that happen elsewhere?

Thank you!

belimawr · 2024-05-14T20:34:37Z

Hey @emmanueltouzery, sorry the delay in replying.

With regards to when to reset the flag, you mention truncating the log, is that the checkpointing operation or does that happen elsewhere?

Yes, the checkpoint operation, once the log file is truncated, it will be in a valid state, so we can already reset the logInvalid flag.

beats/libbeat/statestore/backend/memlog/diskstore.go

Lines 396 to 417 in 23d1066

    
           func (s *diskstore) checkpointClearLog() { 
        
           	if s.logFile == nil { 
        
           		s.logNeedsTruncate = true 
        
           		return 
        
           	} 
        
           	err := s.logFile.Truncate(0) 
        
           	if err == nil { 
        
           		_, err = s.logFile.Seek(0, io.SeekStart) 
        
           	} 
        
           	if err != nil { 
        
           		s.logFile.Close() 
        
           		s.logFile = nil 
        
           		s.logBuf = nil 
        
           		s.logNeedsTruncate = true 
        
           		s.logInvalid = true 
        
           	} 
        
           	s.logEntries = 0 
        
           	s.logFileSize = 0 
        
           }

could you tell me in which file the test should be added?

As you're modifying diskstore.go, you can create a test file next to it, diskstore_test.go and add at least one unit test there. I believe it should be simple to instantiate the disk store pointing to a file you can easily corrupt, validate that logInvalid gets reset and new data starts to be added to the log file.

Ping me if you need more guidance. The other tests in this package might help you.

emmanueltouzery · 2024-05-19T14:19:56Z

I have updated the PR now. I added the test against store, not diskstore, because the read validation is done in store. I also reused an invalid truncated log sample file from another test. Let me know whether that looks good!

I saw the CI fail, the errors seem unrelated to my code. Let me know if I'm wrong.

emmanueltouzery · 2024-05-19T14:59:19Z

I also added a second commit to fix deprecation warnings in diskstore.

belimawr

The PR is looking great @emmanueltouzery! There are a few things to improve/add and we'll be ready to merge it 🤩

CHANGELOG.next.asciidoc

libbeat/statestore/backend/memlog/store_test.go

belimawr

The PR is looking great! Just one small change and I'll approve it.

libbeat/statestore/backend/memlog/store_test.go

Co-authored-by: Tiago Queiroz <[email protected]>

emmanueltouzery · 2024-05-23T05:57:49Z

applied your suggestion 👍
i was wondering, will this get backported to 7.x as well?

dliappis · 2024-05-23T07:30:04Z

test

dliappis · 2024-05-23T07:40:24Z

/test

belimawr · 2024-05-23T16:30:06Z

applied your suggestion 👍 i was wondering, will this get backported to 7.x as well?

I didn't look at the 7.17 code, but if the bug is also there, yes, we can backport it. I'll add the label to create the backport PR

belimawr · 2024-05-23T16:31:42Z

The lastest test run failed because of a known issue: #39698. We'll have to wait for the test to be fixed or skipped and then update this PR with main.

emmanueltouzery · 2024-05-23T16:59:30Z

I can confirm that 7.x is affected too.

emmanueltouzery · 2024-05-28T10:06:19Z

The lastest test run failed because of a known issue: #39698. We'll have to wait for the test to be fixed or skipped and then update this PR with main.

but i see other things getting merged despite this? and no activity on #39698 in the last five days?

i'm mentioning this because from my point of view this issue we're fixing here is not that benign... i'm hoping the fix goes in sometime soon.

belimawr · 2024-06-03T19:26:50Z

i'm mentioning this because from my point of view this issue we're fixing here is not that benign... i'm hoping the fix goes in sometime soon.

You're right, the issue is not with your PR. We do our best not to force merge PRs if CI is failing, even if it is for unrelated reasons. If there is a flaky tests, some times not all PRs fails on it, so they get merged.

I'll update your branch with the latest main and re-trigger CI.

dliappis · 2024-06-04T07:12:57Z

/test

dliappis · 2024-06-04T07:13:59Z

run docs-build

In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes Filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing Filebeat to fall behind). This change resets the logInvalid flag after a successful checkpointing. Co-authored-by: Tiago Queiroz <[email protected]> (cherry picked from commit 217f5a6) # Conflicts: # libbeat/statestore/backend/memlog/diskstore.go

belimawr · 2024-06-04T12:50:08Z

Thank you so much for the help @emmanueltouzery !!

emmanueltouzery · 2024-06-08T10:18:56Z

I see this is not in 8.14.0. so this will be released in 8.15.0?

Thank you, and great we got this in!

pierrehilbert · 2024-06-08T15:56:07Z

Yes this will land in 8.15.0 in the current state but I think we should backport it to have it in 8.14.1.
@belimawr WDYT?

belimawr · 2024-06-10T13:21:24Z

Yes this will land in 8.15.0 in the current state but I think we should backport it to have it in 8.14.1. @belimawr WDYT?

Yes, it makes sense. I'll add the backport tag and monitor/merge the PR

In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes Filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing Filebeat to fall behind). This change resets the logInvalid flag after a successful checkpointing. Co-authored-by: Tiago Queiroz <[email protected]> (cherry picked from commit 217f5a6)

…35893) (#39842) * Fix high IO after sudden filebeat stop (#35893) (#39392) In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes Filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing Filebeat to fall behind). This change resets the logInvalid flag after a successful checkpointing. Co-authored-by: Tiago Queiroz <[email protected]> (cherry picked from commit 217f5a6) * Update CHANGELOG.next.asciidoc --------- Co-authored-by: emmanueltouzery <[email protected]> Co-authored-by: Tiago Queiroz <[email protected]>

…35893) (#39795) In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes Filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing Filebeat to fall behind). This change resets the logInvalid flag after a successful checkpointing. Co-authored-by: Tiago Queiroz <[email protected]> (cherry picked from commit 217f5a6) # Conflicts: # libbeat/statestore/backend/memlog/diskstore.go --------- Co-authored-by: emmanueltouzery <[email protected]> Co-authored-by: Tiago Queiroz <[email protected]> Co-authored-by: Pierre HILBERT <[email protected]>

emmanueltouzery requested a review from a team as a code owner May 3, 2024 09:34

emmanueltouzery requested review from rdner and leehinman May 3, 2024 09:34

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 3, 2024

mergify bot assigned emmanueltouzery May 3, 2024

pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label May 3, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 3, 2024

pierrehilbert requested a review from belimawr May 3, 2024 14:15

emmanueltouzery closed this May 3, 2024

emmanueltouzery reopened this May 3, 2024

rdner removed their request for review May 7, 2024 12:47

emmanueltouzery force-pushed the bug_35893 branch 3 times, most recently from 55f74ed to 36eda84 Compare May 19, 2024 14:11

emmanueltouzery force-pushed the bug_35893 branch 2 times, most recently from a7fae3a to 6d13807 Compare May 19, 2024 14:50

emmanueltouzery force-pushed the bug_35893 branch from 6d13807 to 8ab9a5e Compare May 20, 2024 04:22

belimawr requested changes May 21, 2024

View reviewed changes

belimawr requested changes May 22, 2024

View reviewed changes

libbeat/statestore/backend/memlog/store_test.go Outdated Show resolved Hide resolved

Apply suggestions from code review

2c16d59

Co-authored-by: Tiago Queiroz <[email protected]>

emmanueltouzery force-pushed the bug_35893 branch from 6bae530 to 2c16d59 Compare May 23, 2024 05:42

belimawr approved these changes May 23, 2024

View reviewed changes

belimawr added the backport-7.17 Automated backport to the 7.17 branch with mergify label May 23, 2024

dliappis mentioned this pull request May 24, 2024

Migrate needed .ci/* scripts into .buildkite/* ( elastic#3353 ) #39661

Merged

6 tasks

Merge branch 'main' into bug_35893

74b5b8f

belimawr merged commit 217f5a6 into elastic:main Jun 4, 2024
106 checks passed

mergify bot mentioned this pull request Jun 4, 2024

[7.17](backport #39392) [Bug] fix high IO after sudden filebeat stop (#35893) #39795

Merged

6 tasks

belimawr added the backport-v8.14.0 Automated backport with mergify label Jun 10, 2024

mergify bot mentioned this pull request Jun 10, 2024

[8.14](backport #39392) [Bug] fix high IO after sudden filebeat stop (#35893) #39842

Merged

6 tasks

balamurali27 mentioned this pull request Jul 12, 2024

Update filebeat on all servers frappe/press#1946

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] fix high IO after sudden filebeat stop (#35893) #39392

[Bug] fix high IO after sudden filebeat stop (#35893) #39392

emmanueltouzery commented May 3, 2024 •

edited

Loading

cla-checker-service bot commented May 3, 2024 •

edited

Loading

mergify bot commented May 3, 2024

elasticmachine commented May 3, 2024 •

edited

Loading

Build stats

`Load a resource file from a library`

`Error signal`

elasticmachine commented May 3, 2024

belimawr commented May 3, 2024

emmanueltouzery commented May 3, 2024

emmanueltouzery commented May 3, 2024

emmanueltouzery commented May 7, 2024

emmanueltouzery commented May 12, 2024

belimawr commented May 14, 2024

emmanueltouzery commented May 19, 2024

emmanueltouzery commented May 19, 2024

belimawr left a comment

belimawr left a comment

emmanueltouzery commented May 23, 2024

dliappis commented May 23, 2024

dliappis commented May 23, 2024

belimawr commented May 23, 2024

belimawr commented May 23, 2024

emmanueltouzery commented May 23, 2024

emmanueltouzery commented May 28, 2024

belimawr commented Jun 3, 2024

dliappis commented Jun 4, 2024

dliappis commented Jun 4, 2024

belimawr commented Jun 4, 2024

emmanueltouzery commented Jun 8, 2024

pierrehilbert commented Jun 8, 2024

belimawr commented Jun 10, 2024

[Bug] fix high IO after sudden filebeat stop (#35893) #39392

[Bug] fix high IO after sudden filebeat stop (#35893) #39392

Conversation

emmanueltouzery commented May 3, 2024 • edited Loading

Proposed commit message

Checklist

Author's Checklist

How to test this PR locally

Related issues

Logs

cla-checker-service bot commented May 3, 2024 • edited Loading

mergify bot commented May 3, 2024

elasticmachine commented May 3, 2024 • edited Loading

❕ Build Aborted

Build stats

Steps errors

Load a resource file from a library

Error signal

🤖 GitHub comments

elasticmachine commented May 3, 2024

belimawr commented May 3, 2024

emmanueltouzery commented May 3, 2024

emmanueltouzery commented May 3, 2024

emmanueltouzery commented May 7, 2024

emmanueltouzery commented May 12, 2024

belimawr commented May 14, 2024

emmanueltouzery commented May 19, 2024

emmanueltouzery commented May 19, 2024

belimawr left a comment

Choose a reason for hiding this comment

belimawr left a comment

Choose a reason for hiding this comment

emmanueltouzery commented May 23, 2024

dliappis commented May 23, 2024

dliappis commented May 23, 2024

belimawr commented May 23, 2024

belimawr commented May 23, 2024

emmanueltouzery commented May 23, 2024

emmanueltouzery commented May 28, 2024

belimawr commented Jun 3, 2024

dliappis commented Jun 4, 2024

dliappis commented Jun 4, 2024

belimawr commented Jun 4, 2024

emmanueltouzery commented Jun 8, 2024

pierrehilbert commented Jun 8, 2024

belimawr commented Jun 10, 2024

emmanueltouzery commented May 3, 2024 •

edited

Loading

cla-checker-service bot commented May 3, 2024 •

edited

Loading

elasticmachine commented May 3, 2024 •

edited

Loading

`Load a resource file from a library`

`Error signal`