Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High io consumption after sudden filebeat stop #35893

Closed
Hitych opened this issue Jun 23, 2023 · 4 comments · Fixed by #39392
Closed

High io consumption after sudden filebeat stop #35893

Hitych opened this issue Jun 23, 2023 · 4 comments · Fixed by #39392
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@Hitych
Copy link

Hitych commented Jun 23, 2023

Hi! I tried to ask on discuss.elastic.co but no answer.

The problem is very high io, after sudden termination of a filebeat. The reason is a checkpoint action on each log operation. It is because of log_invalid flag set to true, after failed initial log read operation. After abnormal termination of a filebeat, log may be in a inconsistent state and read of log like this can cause error Incomplete or corrupted log file in /usr/share/filebeat/data/registry/filebeat. Continue with last known complete and consistent state. Reason: invalid character '\\x00' looking for beginning of value
After that, filebeat clears log file, but still not trying to write, and just make checkpoint by checkpoint.

  1. Start filebeat
  2. Shutdown machine suddenly
  3. Start machine again
  4. Start filebeat
  5. Check the log for an errors
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 23, 2023
@emmanueltouzery
Copy link
Contributor

emmanueltouzery commented Feb 12, 2024

@Hitych
Copy link
Author

Hitych commented Mar 15, 2024

@elastic/obs-dc can anyone help here?

emmanueltouzery added a commit to emmanueltouzery/beats that referenced this issue May 3, 2024
In case of corrupted log file (which has good chances to happen in case
of sudden unclean system shutdown), we set a flag which causes us to
checkpoint immediately, but never do anything else besides that. This
causes filebeat to just checkpoint on each log operation (therefore
causing a high IO load on the server and also causing filebeat to fall
behind).

This change resets the logInvalid flag after a successful checkpointing.
@belimawr belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label May 3, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 3, 2024
@belimawr
Copy link
Contributor

belimawr commented May 3, 2024

Hey folks, thanks for finding this bug and proposing a fix! Looking at the code I can see it indeed is a bug. Restarting Filebeat should bring it back into a consistent state. While not perfect, it is at least a workaround.

emmanueltouzery added a commit to emmanueltouzery/beats that referenced this issue May 19, 2024
In case of corrupted log file (which has good chances to happen in case
of sudden unclean system shutdown), we set a flag which causes us to
checkpoint immediately, but never do anything else besides that. This
causes filebeat to just checkpoint on each log operation (therefore
causing a high IO load on the server and also causing filebeat to fall
behind).

This change resets the logInvalid flag after a successful checkpointing.
emmanueltouzery added a commit to emmanueltouzery/beats that referenced this issue May 19, 2024
In case of corrupted log file (which has good chances to happen in case
of sudden unclean system shutdown), we set a flag which causes us to
checkpoint immediately, but never do anything else besides that. This
causes filebeat to just checkpoint on each log operation (therefore
causing a high IO load on the server and also causing filebeat to fall
behind).

This change resets the logInvalid flag after a successful checkpointing.
emmanueltouzery added a commit to emmanueltouzery/beats that referenced this issue May 19, 2024
In case of corrupted log file (which has good chances to happen in case
of sudden unclean system shutdown), we set a flag which causes us to
checkpoint immediately, but never do anything else besides that. This
causes filebeat to just checkpoint on each log operation (therefore
causing a high IO load on the server and also causing filebeat to fall
behind).

This change resets the logInvalid flag after a successful checkpointing.
emmanueltouzery added a commit to emmanueltouzery/beats that referenced this issue May 19, 2024
In case of corrupted log file (which has good chances to happen in case
of sudden unclean system shutdown), we set a flag which causes us to
checkpoint immediately, but never do anything else besides that. This
causes filebeat to just checkpoint on each log operation (therefore
causing a high IO load on the server and also causing filebeat to fall
behind).

This change resets the logInvalid flag after a successful checkpointing.
emmanueltouzery added a commit to emmanueltouzery/beats that referenced this issue May 20, 2024
In case of corrupted log file (which has good chances to happen in case
of sudden unclean system shutdown), we set a flag which causes us to
checkpoint immediately, but never do anything else besides that. This
causes filebeat to just checkpoint on each log operation (therefore
causing a high IO load on the server and also causing filebeat to fall
behind).

This change resets the logInvalid flag after a successful checkpointing.
emmanueltouzery added a commit to emmanueltouzery/beats that referenced this issue May 22, 2024
In case of corrupted log file (which has good chances to happen in case
of sudden unclean system shutdown), we set a flag which causes us to
checkpoint immediately, but never do anything else besides that. This
causes filebeat to just checkpoint on each log operation (therefore
causing a high IO load on the server and also causing filebeat to fall
behind).

This change resets the logInvalid flag after a successful checkpointing.
emmanueltouzery added a commit to emmanueltouzery/beats that referenced this issue May 22, 2024
In case of corrupted log file (which has good chances to happen in case
of sudden unclean system shutdown), we set a flag which causes us to
checkpoint immediately, but never do anything else besides that. This
causes filebeat to just checkpoint on each log operation (therefore
causing a high IO load on the server and also causing filebeat to fall
behind).

This change resets the logInvalid flag after a successful checkpointing.
belimawr added a commit that referenced this issue Jun 4, 2024
In case of corrupted log file (which has good chances to happen in case
of sudden unclean system shutdown), we set a flag which causes us to
checkpoint immediately, but never do anything else besides that. This
causes Filebeat to just checkpoint on each log operation (therefore
causing a high IO load on the server and also causing Filebeat to fall
behind).

This change resets the logInvalid flag after a successful checkpointing.

Co-authored-by: Tiago Queiroz <[email protected]>
mergify bot pushed a commit that referenced this issue Jun 4, 2024
In case of corrupted log file (which has good chances to happen in case
of sudden unclean system shutdown), we set a flag which causes us to
checkpoint immediately, but never do anything else besides that. This
causes Filebeat to just checkpoint on each log operation (therefore
causing a high IO load on the server and also causing Filebeat to fall
behind).

This change resets the logInvalid flag after a successful checkpointing.

Co-authored-by: Tiago Queiroz <[email protected]>
(cherry picked from commit 217f5a6)

# Conflicts:
#	libbeat/statestore/backend/memlog/diskstore.go
mergify bot pushed a commit that referenced this issue Jun 10, 2024
In case of corrupted log file (which has good chances to happen in case
of sudden unclean system shutdown), we set a flag which causes us to
checkpoint immediately, but never do anything else besides that. This
causes Filebeat to just checkpoint on each log operation (therefore
causing a high IO load on the server and also causing Filebeat to fall
behind).

This change resets the logInvalid flag after a successful checkpointing.

Co-authored-by: Tiago Queiroz <[email protected]>
(cherry picked from commit 217f5a6)
pierrehilbert pushed a commit that referenced this issue Jun 13, 2024
…35893) (#39842)

* Fix high IO after sudden filebeat stop (#35893) (#39392)

In case of corrupted log file (which has good chances to happen in case
of sudden unclean system shutdown), we set a flag which causes us to
checkpoint immediately, but never do anything else besides that. This
causes Filebeat to just checkpoint on each log operation (therefore
causing a high IO load on the server and also causing Filebeat to fall
behind).

This change resets the logInvalid flag after a successful checkpointing.

Co-authored-by: Tiago Queiroz <[email protected]>
(cherry picked from commit 217f5a6)

* Update CHANGELOG.next.asciidoc

---------

Co-authored-by: emmanueltouzery <[email protected]>
Co-authored-by: Tiago Queiroz <[email protected]>
belimawr added a commit that referenced this issue Jun 24, 2024
…35893) (#39795)

In case of corrupted log file (which has good chances to happen in case
of sudden unclean system shutdown), we set a flag which causes us to
checkpoint immediately, but never do anything else besides that. This
causes Filebeat to just checkpoint on each log operation (therefore
causing a high IO load on the server and also causing Filebeat to fall
behind).

This change resets the logInvalid flag after a successful checkpointing.

Co-authored-by: Tiago Queiroz <[email protected]>
(cherry picked from commit 217f5a6)

# Conflicts:
#	libbeat/statestore/backend/memlog/diskstore.go

---------

Co-authored-by: emmanueltouzery <[email protected]>
Co-authored-by: Tiago Queiroz <[email protected]>
Co-authored-by: Pierre HILBERT <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants