Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] fix high IO after sudden filebeat stop (#35893) #39392

Merged
merged 4 commits into from
Jun 4, 2024

Conversation

emmanueltouzery
Copy link
Contributor

@emmanueltouzery emmanueltouzery commented May 3, 2024

Proposed commit message

In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing filebeat to fall behind).

This change resets the logInvalid flag after a successful checkpointing.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • [ ]

How to test this PR locally

I have not in fact tested the PR. I have only checked that it builds. I'm also not 100% sure of the correctness, but the change is really simple and I hope that a maintainer can quickly confirm whether the fix is correct. In any case the current code clearly will cause the issues I described, since logInvalid is only ever set to true, never to false. Therefore I think that the beat trivially cannot recover.

Related issues

Logs

See details in #35893 but the issue we fix otherwise manifests itself with a message of:

WARN memlog/store.go:130 Incomplete or corrupted log file in /var/lib/filebeat/registry/filebeat. Continue with last known complete and consistent state. Reason: unexpected EOF

(or invalid character '\x00' instead of EOF)

@emmanueltouzery emmanueltouzery requested a review from a team as a code owner May 3, 2024 09:34
@emmanueltouzery emmanueltouzery requested review from rdner and leehinman May 3, 2024 09:34
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 3, 2024
Copy link

cla-checker-service bot commented May 3, 2024

💚 CLA has been signed

Copy link
Contributor

mergify bot commented May 3, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @emmanueltouzery? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented May 3, 2024

❕ Build Aborted

The PR is not allowed to run in the CI yet

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Start Time: 2024-05-03T20:03:25.391+0000

  • Duration: 5 min 45 sec

Steps errors 2

Expand to view the steps failures

Load a resource file from a library
  • Took 0 min 0 sec . View more details here
  • Description: approval-list/elastic/beats.yml
Error signal
  • Took 0 min 0 sec . View more details here
  • Description: githubApiCall: The REST API call https://api.github.com/orgs/elastic/members/emmanueltouzery return the message : java.lang.Exception: httpRequest: Failure connecting to the service https://api.github.com/orgs/elastic/members/emmanueltouzery : httpRequest: Failure connecting to the service https://api.github.com/orgs/elastic/members/emmanueltouzery : Code: 404Error: {"message":"User does not exist or is not a member of the organization","documentation_url":"https://docs.github.com/rest/orgs/members#check-organization-membership-for-a-user"}

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label May 3, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 3, 2024
@pierrehilbert pierrehilbert requested a review from belimawr May 3, 2024 14:15
@belimawr
Copy link
Contributor

belimawr commented May 3, 2024

Hi @emmanueltouzery, looking at the code and the original issue, I can see it indeed is a bug. Restarting Filebeat should fix the problem, but I didn't have time to test it.

To get this merged we'd need some tests, setting logInvalid to false can be done right after truncating the log file.

Would you be interested in further working on this PR? I didn't take a deep look at our current tests, but I believe it would be simple to write a unit test that simulates this condition and ensure the fix works as expected.

I can try guiding you on this.

@emmanueltouzery
Copy link
Contributor Author

Well first off I'm delighted that the issue will likely be fixed

@emmanueltouzery
Copy link
Contributor Author

Sorry for closing and reopening, I'm typing on a smartphone.

I can confirm that restarting filebeat fixes the issue.

So if you think it makes sense for you, I can try to improve the PR further with your guidance and worst-case you take over.

To start with, could you tell me in which file the test should be added?

With regards to when to reset the flag, you mention truncating the log, is that the checkpointing operation or does that happen elsewhere?

@rdner rdner removed their request for review May 7, 2024 12:47
@emmanueltouzery
Copy link
Contributor Author

ping?

@emmanueltouzery
Copy link
Contributor Author

@belimawr I'm interested in trying to get a fix merged for this issue.

could you tell me in which file the test should be added?

With regards to when to reset the flag, you mention truncating the log, is that the checkpointing operation or does that happen elsewhere?

Thank you!

@belimawr
Copy link
Contributor

Hey @emmanueltouzery, sorry the delay in replying.

With regards to when to reset the flag, you mention truncating the log, is that the checkpointing operation or does that happen elsewhere?

Yes, the checkpoint operation, once the log file is truncated, it will be in a valid state, so we can already reset the logInvalid flag.

func (s *diskstore) checkpointClearLog() {
if s.logFile == nil {
s.logNeedsTruncate = true
return
}
err := s.logFile.Truncate(0)
if err == nil {
_, err = s.logFile.Seek(0, io.SeekStart)
}
if err != nil {
s.logFile.Close()
s.logFile = nil
s.logBuf = nil
s.logNeedsTruncate = true
s.logInvalid = true
}
s.logEntries = 0
s.logFileSize = 0
}

could you tell me in which file the test should be added?

As you're modifying diskstore.go, you can create a test file next to it, diskstore_test.go and add at least one unit test there. I believe it should be simple to instantiate the disk store pointing to a file you can easily corrupt, validate that logInvalid gets reset and new data starts to be added to the log file.

Ping me if you need more guidance. The other tests in this package might help you.

@emmanueltouzery emmanueltouzery force-pushed the bug_35893 branch 3 times, most recently from 55f74ed to 36eda84 Compare May 19, 2024 14:11
@emmanueltouzery
Copy link
Contributor Author

I have updated the PR now. I added the test against store, not diskstore, because the read validation is done in store. I also reused an invalid truncated log sample file from another test. Let me know whether that looks good!

I saw the CI fail, the errors seem unrelated to my code. Let me know if I'm wrong.

@emmanueltouzery emmanueltouzery force-pushed the bug_35893 branch 2 times, most recently from a7fae3a to 6d13807 Compare May 19, 2024 14:50
@emmanueltouzery
Copy link
Contributor Author

I also added a second commit to fix deprecation warnings in diskstore.

Copy link
Contributor

@belimawr belimawr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is looking great @emmanueltouzery! There are a few things to improve/add and we'll be ready to merge it 🤩

CHANGELOG.next.asciidoc Outdated Show resolved Hide resolved
libbeat/statestore/backend/memlog/store_test.go Outdated Show resolved Hide resolved
libbeat/statestore/backend/memlog/store_test.go Outdated Show resolved Hide resolved
libbeat/statestore/backend/memlog/store_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@belimawr belimawr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is looking great! Just one small change and I'll approve it.

libbeat/statestore/backend/memlog/store_test.go Outdated Show resolved Hide resolved
@emmanueltouzery
Copy link
Contributor Author

applied your suggestion 👍
i was wondering, will this get backported to 7.x as well?

@dliappis
Copy link
Contributor

test

@dliappis
Copy link
Contributor

/test

@belimawr
Copy link
Contributor

applied your suggestion 👍 i was wondering, will this get backported to 7.x as well?

I didn't look at the 7.17 code, but if the bug is also there, yes, we can backport it. I'll add the label to create the backport PR

@belimawr belimawr added the backport-7.17 Automated backport to the 7.17 branch with mergify label May 23, 2024
@belimawr
Copy link
Contributor

The lastest test run failed because of a known issue: #39698. We'll have to wait for the test to be fixed or skipped and then update this PR with main.

@emmanueltouzery
Copy link
Contributor Author

I can confirm that 7.x is affected too.

@emmanueltouzery
Copy link
Contributor Author

The lastest test run failed because of a known issue: #39698. We'll have to wait for the test to be fixed or skipped and then update this PR with main.

but i see other things getting merged despite this? and no activity on #39698 in the last five days?

i'm mentioning this because from my point of view this issue we're fixing here is not that benign... i'm hoping the fix goes in sometime soon.

@belimawr
Copy link
Contributor

belimawr commented Jun 3, 2024

i'm mentioning this because from my point of view this issue we're fixing here is not that benign... i'm hoping the fix goes in sometime soon.

You're right, the issue is not with your PR. We do our best not to force merge PRs if CI is failing, even if it is for unrelated reasons. If there is a flaky tests, some times not all PRs fails on it, so they get merged.

I'll update your branch with the latest main and re-trigger CI.

@dliappis
Copy link
Contributor

dliappis commented Jun 4, 2024

/test

@dliappis
Copy link
Contributor

dliappis commented Jun 4, 2024

run docs-build

@belimawr belimawr merged commit 217f5a6 into elastic:main Jun 4, 2024
106 checks passed
mergify bot pushed a commit that referenced this pull request Jun 4, 2024
In case of corrupted log file (which has good chances to happen in case
of sudden unclean system shutdown), we set a flag which causes us to
checkpoint immediately, but never do anything else besides that. This
causes Filebeat to just checkpoint on each log operation (therefore
causing a high IO load on the server and also causing Filebeat to fall
behind).

This change resets the logInvalid flag after a successful checkpointing.

Co-authored-by: Tiago Queiroz <[email protected]>
(cherry picked from commit 217f5a6)

# Conflicts:
#	libbeat/statestore/backend/memlog/diskstore.go
@belimawr
Copy link
Contributor

belimawr commented Jun 4, 2024

Thank you so much for the help @emmanueltouzery !!

@emmanueltouzery
Copy link
Contributor Author

I see this is not in 8.14.0. so this will be released in 8.15.0?

Thank you, and great we got this in!

@pierrehilbert
Copy link
Collaborator

Yes this will land in 8.15.0 in the current state but I think we should backport it to have it in 8.14.1.
@belimawr WDYT?

@belimawr
Copy link
Contributor

Yes this will land in 8.15.0 in the current state but I think we should backport it to have it in 8.14.1. @belimawr WDYT?

Yes, it makes sense. I'll add the backport tag and monitor/merge the PR

@belimawr belimawr added the backport-v8.14.0 Automated backport with mergify label Jun 10, 2024
mergify bot pushed a commit that referenced this pull request Jun 10, 2024
In case of corrupted log file (which has good chances to happen in case
of sudden unclean system shutdown), we set a flag which causes us to
checkpoint immediately, but never do anything else besides that. This
causes Filebeat to just checkpoint on each log operation (therefore
causing a high IO load on the server and also causing Filebeat to fall
behind).

This change resets the logInvalid flag after a successful checkpointing.

Co-authored-by: Tiago Queiroz <[email protected]>
(cherry picked from commit 217f5a6)
pierrehilbert pushed a commit that referenced this pull request Jun 13, 2024
…35893) (#39842)

* Fix high IO after sudden filebeat stop (#35893) (#39392)

In case of corrupted log file (which has good chances to happen in case
of sudden unclean system shutdown), we set a flag which causes us to
checkpoint immediately, but never do anything else besides that. This
causes Filebeat to just checkpoint on each log operation (therefore
causing a high IO load on the server and also causing Filebeat to fall
behind).

This change resets the logInvalid flag after a successful checkpointing.

Co-authored-by: Tiago Queiroz <[email protected]>
(cherry picked from commit 217f5a6)

* Update CHANGELOG.next.asciidoc

---------

Co-authored-by: emmanueltouzery <[email protected]>
Co-authored-by: Tiago Queiroz <[email protected]>
belimawr added a commit that referenced this pull request Jun 24, 2024
…35893) (#39795)

In case of corrupted log file (which has good chances to happen in case
of sudden unclean system shutdown), we set a flag which causes us to
checkpoint immediately, but never do anything else besides that. This
causes Filebeat to just checkpoint on each log operation (therefore
causing a high IO load on the server and also causing Filebeat to fall
behind).

This change resets the logInvalid flag after a successful checkpointing.

Co-authored-by: Tiago Queiroz <[email protected]>
(cherry picked from commit 217f5a6)

# Conflicts:
#	libbeat/statestore/backend/memlog/diskstore.go

---------

Co-authored-by: emmanueltouzery <[email protected]>
Co-authored-by: Tiago Queiroz <[email protected]>
Co-authored-by: Pierre HILBERT <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-7.17 Automated backport to the 7.17 branch with mergify backport-v8.14.0 Automated backport with mergify Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

High io consumption after sudden filebeat stop
5 participants