auditbeat: Add a cached file hasher for auditbeat #41952

haesbaert · 2024-12-09T08:46:34Z

Proposed commit message

This implements a LRU cache on top of the FileHasher from hasher.go, it will be used in the new backend for the system process module on linux.

The cache is indexed by file path and stores the metadata (what we get from stat(2)/statx(2)) along with the hashes of each file.

When we want to hash a file: we stat() the file, then do cache lookup and compare against the stored metadata, if it differs, we rehash, if not we use the cached values.

The cache ignores access time (atime), it's only interested in write modifications, if the machine doesn't support statx(2) it falls back to stat(2) but uses the same Unix.Statx_t.

With this we end up with a stat() + lookup on the hotpath, and a stat() + stat() + insert on the cold path.

The motivation for this is that the new backend ends up fetching "all processes", which in turn causes it to try to hash at every event, the current/old hasher just can't cope with it:

Hashing for each event is simply too expensive, in the 100us-50ms range on the default configuration, which puts us below 1000/s.
It has a scan rate throttling that on the default configuration ends easily at 40ms per event (25/s).

With the cache things improve considerably, we stay below 5us (200k/s) in all cases:

MISSES
"miss (/usr/sbin/sshd) took 2.571359ms"
"miss (/usr/bin/containerd) took 52.099386ms"
"miss (/usr/sbin/gssproxy) took 160us"
"miss (/usr/sbin/atd) took 50.032us"
HITS
"hit (/usr/sbin/sshd) took 2.163us"
"hit (/usr/lib/systemd/systemd) took 3.024us"
"hit (/usr/lib/systemd/systemd) took 859ns"
"hit (/usr/sbin/sshd) took 805ns"

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~- [ ] I have made corresponding changes to the documentation~~
~~- [ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

This implements a LRU cache on top of the FileHasher from hasher.go, it will be used in the new backend for the system process module on linux. The cache is indexed by file path and stores the metadata (what we get from stat(2)/statx(2)) along with the hashes of each file. When we want to hash a file: we stat() the file, then do cache lookup and compare against the stored metadata, if it differs, we rehash, if not we use the cached values. The cache ignores access time (atime), it's only interested in write modifications, if the machine doesn't support statx(2) it falls back to stat(2) but uses the same Unix.Statx_t. With this we end up with a stat() + lookup on the hotpath, and a stat() + stat() + insert on the cold path. The motivation for this is that the new backend ends up fetching "all processes", which in turn causes it to try to hash at every event, the current/old hasher just can't cope with it: 1. Hashing for each event is simply to expensive, in the 100us-50ms range on the default configuration, which puts us below 1000/s. 2. It has a scan rate throttling that on the default configuration ends easily at 40ms per event (25/s). With the cache things improve considerably, we stay below 5us (200k/s) in all cases: ``` MISSES "miss (/usr/sbin/sshd) took 2.571359ms" "miss (/usr/bin/containerd) took 52.099386ms" "miss (/usr/sbin/gssproxy) took 160us" "miss (/usr/sbin/atd) took 50.032us" HITS "hit (/usr/sbin/sshd) took 2.163us" "hit (/usr/lib/systemd/systemd) took 3.024us" "hit (/usr/lib/systemd/systemd) took 859ns" "hit (/usr/sbin/sshd) took 805ns" ```

mergify · 2024-12-09T08:47:10Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @haesbaert? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

mergify · 2024-12-09T08:47:10Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

elasticmachine · 2024-12-09T12:02:10Z

Pinging @elastic/sec-linux-platform (Team:Security-Linux Platform)

auditbeat/helper/hasher/cached_hasher.go

nicholasberlin

LGTM

This implements a LRU cache on top of the FileHasher from hasher.go, it will be used in the new backend for the system process module on linux. The cache is indexed by file path and stores the metadata (what we get from stat(2)/statx(2)) along with the hashes of each file. When we want to hash a file: we stat() the file, then do cache lookup and compare against the stored metadata, if it differs, we rehash, if not we use the cached values. The cache ignores access time (atime), it's only interested in write modifications, if the machine doesn't support statx(2) it falls back to stat(2) but uses the same Unix.Statx_t. With this we end up with a stat() + lookup on the hotpath, and a stat() + stat() + insert on the cold path. The motivation for this is that the new backend ends up fetching "all processes", which in turn causes it to try to hash at every event, the current/old hasher just can't cope with it: 1. Hashing for each event is simply to expensive, in the 100us-50ms range on the default configuration, which puts us below 1000/s. 2. It has a scan rate throttling that on the default configuration ends easily at 40ms per event (25/s). With the cache things improve considerably, we stay below 5us (200k/s) in all cases: ``` MISSES "miss (/usr/sbin/sshd) took 2.571359ms" "miss (/usr/bin/containerd) took 52.099386ms" "miss (/usr/sbin/gssproxy) took 160us" "miss (/usr/sbin/atd) took 50.032us" HITS "hit (/usr/sbin/sshd) took 2.163us" "hit (/usr/lib/systemd/systemd) took 3.024us" "hit (/usr/lib/systemd/systemd) took 859ns" "hit (/usr/sbin/sshd) took 805ns" ``` (cherry picked from commit 8ec2e31)

This implements a LRU cache on top of the FileHasher from hasher.go, it will be used in the new backend for the system process module on linux. The cache is indexed by file path and stores the metadata (what we get from stat(2)/statx(2)) along with the hashes of each file. When we want to hash a file: we stat() the file, then do cache lookup and compare against the stored metadata, if it differs, we rehash, if not we use the cached values. The cache ignores access time (atime), it's only interested in write modifications, if the machine doesn't support statx(2) it falls back to stat(2) but uses the same Unix.Statx_t. With this we end up with a stat() + lookup on the hotpath, and a stat() + stat() + insert on the cold path. The motivation for this is that the new backend ends up fetching "all processes", which in turn causes it to try to hash at every event, the current/old hasher just can't cope with it: 1. Hashing for each event is simply to expensive, in the 100us-50ms range on the default configuration, which puts us below 1000/s. 2. It has a scan rate throttling that on the default configuration ends easily at 40ms per event (25/s). With the cache things improve considerably, we stay below 5us (200k/s) in all cases: ``` MISSES "miss (/usr/sbin/sshd) took 2.571359ms" "miss (/usr/bin/containerd) took 52.099386ms" "miss (/usr/sbin/gssproxy) took 160us" "miss (/usr/sbin/atd) took 50.032us" HITS "hit (/usr/sbin/sshd) took 2.163us" "hit (/usr/lib/systemd/systemd) took 3.024us" "hit (/usr/lib/systemd/systemd) took 859ns" "hit (/usr/sbin/sshd) took 805ns" ``` (cherry picked from commit 8ec2e31) Co-authored-by: Christiano Haesbaert <[email protected]>

This implements a LRU cache on top of the FileHasher from hasher.go, it will be used in the new backend for the system process module on linux. The cache is indexed by file path and stores the metadata (what we get from stat(2)/statx(2)) along with the hashes of each file. When we want to hash a file: we stat() the file, then do cache lookup and compare against the stored metadata, if it differs, we rehash, if not we use the cached values. The cache ignores access time (atime), it's only interested in write modifications, if the machine doesn't support statx(2) it falls back to stat(2) but uses the same Unix.Statx_t. With this we end up with a stat() + lookup on the hotpath, and a stat() + stat() + insert on the cold path. The motivation for this is that the new backend ends up fetching "all processes", which in turn causes it to try to hash at every event, the current/old hasher just can't cope with it: 1. Hashing for each event is simply to expensive, in the 100us-50ms range on the default configuration, which puts us below 1000/s. 2. It has a scan rate throttling that on the default configuration ends easily at 40ms per event (25/s). With the cache things improve considerably, we stay below 5us (200k/s) in all cases: ``` MISSES "miss (/usr/sbin/sshd) took 2.571359ms" "miss (/usr/bin/containerd) took 52.099386ms" "miss (/usr/sbin/gssproxy) took 160us" "miss (/usr/sbin/atd) took 50.032us" HITS "hit (/usr/sbin/sshd) took 2.163us" "hit (/usr/lib/systemd/systemd) took 3.024us" "hit (/usr/lib/systemd/systemd) took 859ns" "hit (/usr/sbin/sshd) took 805ns" ```

haesbaert added the enhancement label Dec 9, 2024

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Dec 9, 2024

haesbaert added the Team:Security-Linux Platform Linux Platform Team in Security Solution label Dec 9, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Dec 9, 2024

mergify bot assigned haesbaert Dec 9, 2024

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Dec 9, 2024

haesbaert added 4 commits December 9, 2024 09:54

changelog

613e0bf

missing require on tests

c1bac9a

linter

8a7ab88

curb cached_hasher_test to linux

d05c828

haesbaert marked this pull request as ready for review December 9, 2024 12:02

haesbaert requested a review from a team as a code owner December 9, 2024 12:02

nicholasberlin reviewed Dec 9, 2024

View reviewed changes

auditbeat/helper/hasher/cached_hasher.go Show resolved Hide resolved

auditbeat/helper/hasher/cached_hasher.go Show resolved Hide resolved

auditbeat/helper/hasher/cached_hasher.go Show resolved Hide resolved

haesbaert added 2 commits December 11, 2024 10:49

remove if we cant stat

37686a2

Merge branch 'main' into cached-hasher

aaed9e6

nicholasberlin approved these changes Dec 11, 2024

View reviewed changes

haesbaert merged commit 8ec2e31 into main Dec 11, 2024
30 checks passed

haesbaert deleted the cached-hasher branch December 11, 2024 16:04

mergify bot mentioned this pull request Dec 11, 2024

[8.x](backport #41952) auditbeat: Add a cached file hasher for auditbeat #41992

Merged

4 tasks

haesbaert mentioned this pull request Dec 13, 2024

auditbeat: system/process module backed by quark #42032

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auditbeat: Add a cached file hasher for auditbeat #41952

auditbeat: Add a cached file hasher for auditbeat #41952

haesbaert commented Dec 9, 2024 •

edited

Loading

mergify bot commented Dec 9, 2024

mergify bot commented Dec 9, 2024

elasticmachine commented Dec 9, 2024

nicholasberlin left a comment

auditbeat: Add a cached file hasher for auditbeat #41952

auditbeat: Add a cached file hasher for auditbeat #41952

Conversation

haesbaert commented Dec 9, 2024 • edited Loading

Proposed commit message

Checklist

mergify bot commented Dec 9, 2024

mergify bot commented Dec 9, 2024

elasticmachine commented Dec 9, 2024

nicholasberlin left a comment

Choose a reason for hiding this comment

haesbaert commented Dec 9, 2024 •

edited

Loading