Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry-pick #22673 to 7.x: [Auditbeat] Recover from errors in audit monitoring routine #22724

Merged
merged 1 commit into from
Nov 24, 2020

Conversation

adriansr
Copy link
Contributor

@adriansr adriansr commented Nov 24, 2020

Cherry-pick of PR #22673 to 7.x branch. Original message:

The auditd module spawns a monitoring goroutine that fetches auditd status every 15s. Due to this routine using a single audit client, if an update fails (because a netlink message is late or other causes), the audit client can get out of sync with the stream, failing in all subsequent requests.

For reasons that aren't 100% clear to me at the moment, this error condition leads to a lot of [audit_send_repl] (2.6.x) / [audit_send_reply] (3.x+) kernel threads being created. (Reproduced in 2.6.32, no other versions tested).

The following error will appear every 15s:

ERROR [auditd] auditd/audit_linux.go:183 get status request failed:failed to get audit status ack: unexpected sequence number for reply (expected 6286 but got 6285)

ps -ef will show a lot of audit_send_repl threads:

[...]
root     27790     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27791     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27792     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27793     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27794     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27795     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27796     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27797     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27798     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27799     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27800     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27801     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27802     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27803     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27804     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27805     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27806     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27807     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27808     2  0 12:52 ?        00:00:00 [audit_send_repl]
[...]

This patch updates the error-handling logic to create a new audit client when a status update fails, allowing to recover and preventing the proliferation of audit_send_repl kernel threads.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

It's easy to reproduce this issue by modifying the code at

if ms.kernelLost.enabled {
client, err := libaudit.NewAuditClient(nil)
if err != nil {
reporter.Error(err)
ms.log.Errorw("Failure creating audit monitoring client", "error", err)
}
go func() {
defer client.Close()
timer := time.NewTicker(lostEventsUpdateInterval)
defer timer.Stop()
for {
select {
case <-reporter.Done():
return
case <-timer.C:
if status, err := client.GetStatus(); err == nil {
ms.updateKernelLostMetric(status.Lost)
} else {
ms.log.Error("get status request failed:", err)
}
}
}
}()
}

  • Set a lower update interval (milliseconds)
  • call client.GetStatusAsync(false) outside of the polling loop.

Similar can be used to validate this fix. Ideally sending an async getstatus every few iterations of the loop.

Sorry, something went wrong.

Verified

This commit was signed with the committer’s verified signature. The key has been revoked.
eggplants haruna
…22673)

The auditd module spawns a monitoring goroutine that fetches auditd status
every 15s. Due to this routine using a single audit client, if an update
fails (because a netlink message is late or other causes), the audit client
can get out of sync with the stream, failing in all subsequent requests.

For reasons that aren't 100% clear to me at the moment, this error condition
leads to a lot of `[audit_send_repl]` (2.6.x) / `[audit_send_reply]` (3.x+)
kernel threads being created.

```
ERROR [auditd] auditd/audit_linux.go:183 get status request failed:failed to get audit status ack: unexpected sequence number for reply (expected 6286 but got 6285)
```

```
$ ps -ef
[...]
root     27790     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27791     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27792     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27793     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27794     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27795     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27796     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27797     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27798     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27799     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27800     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27801     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27802     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27803     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27804     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27805     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27806     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27807     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27808     2  0 12:52 ?        00:00:00 [audit_send_repl]
[...]
```

This patch updates the error-handling logic to create a new audit client
when a status update fails, allowing to recover and preventing the
proliferation of `audit_send_repl` kernel threads.

(cherry picked from commit ca9550f)
@adriansr adriansr requested a review from a team as a code owner November 24, 2020 09:29
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 24, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/security-external-integrations (Team:Security-External Integrations)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 24, 2020
@adriansr adriansr requested review from a team and removed request for a team November 24, 2020 09:30
@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: [Pull request #22724 opened]

  • Start Time: 2020-11-24T09:30:42.941+0000

  • Duration: 23 min 52 sec

Test stats 🧪

Test Results
Failed 0
Passed 455
Skipped 59
Total 514

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 455
Skipped 59
Total 514

@adriansr adriansr merged commit 53a2e7e into elastic:7.x Nov 24, 2020
@zube zube bot removed the [zube]: Done label Feb 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants