Cherry-pick #22673 to 7.x: [Auditbeat] Recover from errors in audit monitoring routine #22724

adriansr · 2020-11-24T09:29:58Z

Cherry-pick of PR #22673 to 7.x branch. Original message:

The auditd module spawns a monitoring goroutine that fetches auditd status every 15s. Due to this routine using a single audit client, if an update fails (because a netlink message is late or other causes), the audit client can get out of sync with the stream, failing in all subsequent requests.

For reasons that aren't 100% clear to me at the moment, this error condition leads to a lot of [audit_send_repl] (2.6.x) / [audit_send_reply] (3.x+) kernel threads being created. (Reproduced in 2.6.32, no other versions tested).

The following error will appear every 15s:

ERROR [auditd] auditd/audit_linux.go:183 get status request failed:failed to get audit status ack: unexpected sequence number for reply (expected 6286 but got 6285)

ps -ef will show a lot of audit_send_repl threads:

[...]
root     27790     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27791     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27792     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27793     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27794     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27795     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27796     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27797     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27798     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27799     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27800     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27801     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27802     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27803     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27804     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27805     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27806     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27807     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27808     2  0 12:52 ?        00:00:00 [audit_send_repl]
[...]

This patch updates the error-handling logic to create a new audit client when a status update fails, allowing to recover and preventing the proliferation of audit_send_repl kernel threads.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

It's easy to reproduce this issue by modifying the code at

beats/auditbeat/module/auditd/audit_linux.go

Lines 159 to 182 in bb973c4

    
           if ms.kernelLost.enabled { 
        
           	client, err := libaudit.NewAuditClient(nil) 
        
           	if err != nil { 
        
           		reporter.Error(err) 
        
           		ms.log.Errorw("Failure creating audit monitoring client", "error", err) 
        
           	} 
        
           	go func() { 
        
           		defer client.Close() 
        
           		timer := time.NewTicker(lostEventsUpdateInterval) 
        
           		defer timer.Stop() 
        
           		for { 
        
           			select { 
        
           			case <-reporter.Done(): 
        
           				return 
        
           			case <-timer.C: 
        
           				if status, err := client.GetStatus(); err == nil { 
        
           					ms.updateKernelLostMetric(status.Lost) 
        
           				} else { 
        
           					ms.log.Error("get status request failed:", err) 
        
           				} 
        
           			} 
        
           		} 
        
           	}() 
        
           }

Set a lower update interval (milliseconds)
call client.GetStatusAsync(false) outside of the polling loop.

Similar can be used to validate this fix. Ideally sending an async getstatus every few iterations of the loop.

…22673) The auditd module spawns a monitoring goroutine that fetches auditd status every 15s. Due to this routine using a single audit client, if an update fails (because a netlink message is late or other causes), the audit client can get out of sync with the stream, failing in all subsequent requests. For reasons that aren't 100% clear to me at the moment, this error condition leads to a lot of `[audit_send_repl]` (2.6.x) / `[audit_send_reply]` (3.x+) kernel threads being created. ``` ERROR [auditd] auditd/audit_linux.go:183 get status request failed:failed to get audit status ack: unexpected sequence number for reply (expected 6286 but got 6285) ``` ``` $ ps -ef [...] root 27790 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27791 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27792 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27793 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27794 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27795 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27796 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27797 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27798 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27799 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27800 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27801 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27802 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27803 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27804 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27805 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27806 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27807 2 0 12:52 ? 00:00:00 [audit_send_repl] root 27808 2 0 12:52 ? 00:00:00 [audit_send_repl] [...] ``` This patch updates the error-handling logic to create a new audit client when a status update fails, allowing to recover and preventing the proliferation of `audit_send_repl` kernel threads. (cherry picked from commit ca9550f)

elasticmachine · 2020-11-24T09:30:01Z

Pinging @elastic/security-external-integrations (Team:Security-External Integrations)

elasticmachine · 2020-11-24T09:53:42Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Build Cause: [Pull request #22724 opened]
Start Time: 2020-11-24T09:30:42.941+0000
Duration: 23 min 52 sec

Test stats 🧪

Test	Results
Failed	0
Passed	455
Skipped	59
Total	514

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test	Results
Failed	0
Passed	455
Skipped	59
Total	514

adriansr requested a review from a team as a code owner November 24, 2020 09:29

adriansr added [zube]: In Review backport Team:Security-External Integrations labels Nov 24, 2020

botelastic bot added the needs_team label Nov 24, 2020

botelastic bot removed the needs_team label Nov 24, 2020

adriansr added the review label Nov 24, 2020

adriansr requested review from a team and removed request for a team November 24, 2020 09:30

marc-gr approved these changes Nov 24, 2020

View reviewed changes

adriansr merged commit 53a2e7e into elastic:7.x Nov 24, 2020

zube bot added [zube]: Done and removed [zube]: In Review labels Nov 24, 2020

zube bot removed the [zube]: Done label Feb 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry-pick #22673 to 7.x: [Auditbeat] Recover from errors in audit monitoring routine #22724

Cherry-pick #22673 to 7.x: [Auditbeat] Recover from errors in audit monitoring routine #22724

adriansr commented Nov 24, 2020 •

edited by zube bot

Loading

elasticmachine commented Nov 24, 2020

elasticmachine commented Nov 24, 2020

Build stats

Test stats 🧪

Test stats 🧪

	if ms.kernelLost.enabled {
	client, err := libaudit.NewAuditClient(nil)
	if err != nil {
	reporter.Error(err)
	ms.log.Errorw("Failure creating audit monitoring client", "error", err)
	}
	go func() {
	defer client.Close()
	timer := time.NewTicker(lostEventsUpdateInterval)
	defer timer.Stop()
	for {
	select {
	case <-reporter.Done():
	return
	case <-timer.C:
	if status, err := client.GetStatus(); err == nil {
	ms.updateKernelLostMetric(status.Lost)
	} else {
	ms.log.Error("get status request failed:", err)
	}
	}
	}
	}()
	}

Cherry-pick #22673 to 7.x: [Auditbeat] Recover from errors in audit monitoring routine #22724

Cherry-pick #22673 to 7.x: [Auditbeat] Recover from errors in audit monitoring routine #22724

Conversation

adriansr commented Nov 24, 2020 • edited by zube bot Loading

Checklist

How to test this PR locally

elasticmachine commented Nov 24, 2020

elasticmachine commented Nov 24, 2020

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

Test stats 🧪

adriansr commented Nov 24, 2020 •

edited by zube bot

Loading