-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
auditbeat: Add a cached file hasher for auditbeat #41952
Conversation
This implements a LRU cache on top of the FileHasher from hasher.go, it will be used in the new backend for the system process module on linux. The cache is indexed by file path and stores the metadata (what we get from stat(2)/statx(2)) along with the hashes of each file. When we want to hash a file: we stat() the file, then do cache lookup and compare against the stored metadata, if it differs, we rehash, if not we use the cached values. The cache ignores access time (atime), it's only interested in write modifications, if the machine doesn't support statx(2) it falls back to stat(2) but uses the same Unix.Statx_t. With this we end up with a stat() + lookup on the hotpath, and a stat() + stat() + insert on the cold path. The motivation for this is that the new backend ends up fetching "all processes", which in turn causes it to try to hash at every event, the current/old hasher just can't cope with it: 1. Hashing for each event is simply to expensive, in the 100us-50ms range on the default configuration, which puts us below 1000/s. 2. It has a scan rate throttling that on the default configuration ends easily at 40ms per event (25/s). With the cache things improve considerably, we stay below 5us (200k/s) in all cases: ``` MISSES "miss (/usr/sbin/sshd) took 2.571359ms" "miss (/usr/bin/containerd) took 52.099386ms" "miss (/usr/sbin/gssproxy) took 160us" "miss (/usr/sbin/atd) took 50.032us" HITS "hit (/usr/sbin/sshd) took 2.163us" "hit (/usr/lib/systemd/systemd) took 3.024us" "hit (/usr/lib/systemd/systemd) took 859ns" "hit (/usr/sbin/sshd) took 805ns" ```
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
|
Pinging @elastic/sec-linux-platform (Team:Security-Linux Platform) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This implements a LRU cache on top of the FileHasher from hasher.go, it will be used in the new backend for the system process module on linux. The cache is indexed by file path and stores the metadata (what we get from stat(2)/statx(2)) along with the hashes of each file. When we want to hash a file: we stat() the file, then do cache lookup and compare against the stored metadata, if it differs, we rehash, if not we use the cached values. The cache ignores access time (atime), it's only interested in write modifications, if the machine doesn't support statx(2) it falls back to stat(2) but uses the same Unix.Statx_t. With this we end up with a stat() + lookup on the hotpath, and a stat() + stat() + insert on the cold path. The motivation for this is that the new backend ends up fetching "all processes", which in turn causes it to try to hash at every event, the current/old hasher just can't cope with it: 1. Hashing for each event is simply to expensive, in the 100us-50ms range on the default configuration, which puts us below 1000/s. 2. It has a scan rate throttling that on the default configuration ends easily at 40ms per event (25/s). With the cache things improve considerably, we stay below 5us (200k/s) in all cases: ``` MISSES "miss (/usr/sbin/sshd) took 2.571359ms" "miss (/usr/bin/containerd) took 52.099386ms" "miss (/usr/sbin/gssproxy) took 160us" "miss (/usr/sbin/atd) took 50.032us" HITS "hit (/usr/sbin/sshd) took 2.163us" "hit (/usr/lib/systemd/systemd) took 3.024us" "hit (/usr/lib/systemd/systemd) took 859ns" "hit (/usr/sbin/sshd) took 805ns" ``` (cherry picked from commit 8ec2e31)
This implements a LRU cache on top of the FileHasher from hasher.go, it will be used in the new backend for the system process module on linux. The cache is indexed by file path and stores the metadata (what we get from stat(2)/statx(2)) along with the hashes of each file. When we want to hash a file: we stat() the file, then do cache lookup and compare against the stored metadata, if it differs, we rehash, if not we use the cached values. The cache ignores access time (atime), it's only interested in write modifications, if the machine doesn't support statx(2) it falls back to stat(2) but uses the same Unix.Statx_t. With this we end up with a stat() + lookup on the hotpath, and a stat() + stat() + insert on the cold path. The motivation for this is that the new backend ends up fetching "all processes", which in turn causes it to try to hash at every event, the current/old hasher just can't cope with it: 1. Hashing for each event is simply to expensive, in the 100us-50ms range on the default configuration, which puts us below 1000/s. 2. It has a scan rate throttling that on the default configuration ends easily at 40ms per event (25/s). With the cache things improve considerably, we stay below 5us (200k/s) in all cases: ``` MISSES "miss (/usr/sbin/sshd) took 2.571359ms" "miss (/usr/bin/containerd) took 52.099386ms" "miss (/usr/sbin/gssproxy) took 160us" "miss (/usr/sbin/atd) took 50.032us" HITS "hit (/usr/sbin/sshd) took 2.163us" "hit (/usr/lib/systemd/systemd) took 3.024us" "hit (/usr/lib/systemd/systemd) took 859ns" "hit (/usr/sbin/sshd) took 805ns" ``` (cherry picked from commit 8ec2e31) Co-authored-by: Christiano Haesbaert <[email protected]>
This implements a LRU cache on top of the FileHasher from hasher.go, it will be used in the new backend for the system process module on linux. The cache is indexed by file path and stores the metadata (what we get from stat(2)/statx(2)) along with the hashes of each file. When we want to hash a file: we stat() the file, then do cache lookup and compare against the stored metadata, if it differs, we rehash, if not we use the cached values. The cache ignores access time (atime), it's only interested in write modifications, if the machine doesn't support statx(2) it falls back to stat(2) but uses the same Unix.Statx_t. With this we end up with a stat() + lookup on the hotpath, and a stat() + stat() + insert on the cold path. The motivation for this is that the new backend ends up fetching "all processes", which in turn causes it to try to hash at every event, the current/old hasher just can't cope with it: 1. Hashing for each event is simply to expensive, in the 100us-50ms range on the default configuration, which puts us below 1000/s. 2. It has a scan rate throttling that on the default configuration ends easily at 40ms per event (25/s). With the cache things improve considerably, we stay below 5us (200k/s) in all cases: ``` MISSES "miss (/usr/sbin/sshd) took 2.571359ms" "miss (/usr/bin/containerd) took 52.099386ms" "miss (/usr/sbin/gssproxy) took 160us" "miss (/usr/sbin/atd) took 50.032us" HITS "hit (/usr/sbin/sshd) took 2.163us" "hit (/usr/lib/systemd/systemd) took 3.024us" "hit (/usr/lib/systemd/systemd) took 859ns" "hit (/usr/sbin/sshd) took 805ns" ```
Proposed commit message
This implements a LRU cache on top of the FileHasher from hasher.go, it will be used in the new backend for the system process module on linux.
The cache is indexed by file path and stores the metadata (what we get from stat(2)/statx(2)) along with the hashes of each file.
When we want to hash a file: we stat() the file, then do cache lookup and compare against the stored metadata, if it differs, we rehash, if not we use the cached values.
The cache ignores access time (atime), it's only interested in write modifications, if the machine doesn't support statx(2) it falls back to stat(2) but uses the same Unix.Statx_t.
With this we end up with a stat() + lookup on the hotpath, and a stat() + stat() + insert on the cold path.
The motivation for this is that the new backend ends up fetching "all processes", which in turn causes it to try to hash at every event, the current/old hasher just can't cope with it:
With the cache things improve considerably, we stay below 5us (200k/s) in all cases:
Checklist
- [ ] I have made corresponding changes to the documentation- [ ] I have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.