detect consecutive timeouts without events and alert accordingly to a configurable value #1622

leodido · 2021-04-16T10:32:44Z

What type of PR is this?

/kind feature

Any specific area of the project related to this PR?

NONE

What this PR does / why we need it:

This PR makes Falco able to detect a very uncommon situation and alert the user about it.

As everyone probably already knows today, Falco receives events from the drivers through the libraries.

Not all the events that the libraries emit are of interest to Falco.
For this reason and other more complex reasons (eg., timeout while reading the event from the ring buffer),
Falco receives timeouts (SCAP_TIMEOUTS).

In the majority of cases, when Falco receives a timeout it also receives an event that Falco discards.

But, if Falco receives too many consecutive timeouts without the events it is likely that something is going wrong at the lower level.

These code changes let the user configure how to detect such an unlikely situation and alert.

Through the syscall_event_timeouts.max_consecutive config field the user can instruct Falco after how many consecutive timeouts without an event to emit an alert (with DEBUG priority).

Other than the message, the alert will contain the current time and the time of the last processed (SCAP_SUCCESS) event (if available, otherwise "none").

Which issue(s) this PR fixes:

NONE

Special notes for your reviewer:

On my CPU a default value of 1000 for max_consecutives config value maps to a frequency in the range of 30-40 seconds (depending on the system load too).

In this scenario, Falco alerts if for 30-40 seconds it is not processing events.

Wondering if we're good with such value or if we may need to double it.

Reproduce

The only simple way to reproduce such an unlikely situation and test out this PR is the following one.

Start Falco in userspace mode

sudo ./build/userspace/falco/falco -r rules/falco_rules.yaml -u

Send to it some userspace events through this userspace producer.

sudo ./userspace-example // don't pay attention to the sudo for now, it's an example tool

Observe Falco receiving a fake renameat event

10:00:44.036991000: Warning Shell history had been deleted or renamed (user=<NA> user_loginuid=-1 type=renameat command=<NA> fd.name=<NA> name=<NA> path=<NA> oldpath=/tmp/bash_history host (id=host))

Wait for roughly ~40seconds

10:01:22.199076638: Debug Falco internal: timeouts notification. 1000 consecutive timeouts without event. (last_event_time=10:00:44.036991000)
10:02:00.780884781: Debug Falco internal: timeouts notification. 1000 consecutive timeouts without event. (last_event_time=10:00:44.036991000)

Does this PR introduce a user-facing change?:

new: Falco outputs an alert in the unlikely situation it's receiving too many consecutive timeouts without an event
new: configuration field `syscall_event_timeouts.max_consecutive` to configure after how many consecutive timeouts without an event Falco must alert

…uts without an event is greater than a given threshold The rationale is that in case Falco obtains a consistent number of consecutive timeouts (in a row) without a valid event, something is going wrong. This because, normally, the libs send timeouts to Falco (also) to signal events to discard. In such cases, which are the majority of cases, `ev` exists and is not `null`. Signed-off-by: Leonardo Di Donato <[email protected]>

Signed-off-by: Leonardo Di Donato <[email protected]>

Co-authored-by: Lorenzo Fontana <[email protected]> Signed-off-by: Leonardo Di Donato <[email protected]>

Falco uses a shared buffer between the kernel and userspace to receive the events (eg., system call information) in userspace. Anyways, the underlying libraries can also timeout for various reasons. For example, there could have been issues while reading an event. Or the particular event needs to be skipped. Normally, it's very unlikely that Falco does not receive events consecutively. Falco is able to detect such uncommon situation. Here you can configure the maximum number of consecutive timeouts without an event after which you want Falco to alert. By default this value is set to 1000 consecutive timeouts without an event at all. Signed-off-by: Leonardo Di Donato <[email protected]>

…ication gets emitted Also, print out the time of the last processed event in the output fields of the notification. Signed-off-by: Leonardo Di Donato <[email protected]>

…usly processed event Signed-off-by: Leonardo Di Donato <[email protected]>

leodido · 2021-04-16T13:09:35Z

/milestone 0.28.1

leodido · 2021-04-16T16:04:12Z

/cc @fntlnz

poiana · 2021-04-19T09:05:50Z

LGTM label has been added.

Git tree hash: c4d81e7bfbf8b63ea3dd5ee7986a910b15138a84

poiana · 2021-04-19T14:56:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fntlnz, leogr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [fntlnz,leogr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

leodido and others added 5 commits April 15, 2021 10:31

new(userspace/engine): likely/unlikely macros in utils

00403dd

Signed-off-by: Leonardo Di Donato <[email protected]>

update(userspace/falco): a null event when there's a timeout is unlikely

8fafd34

Co-authored-by: Lorenzo Fontana <[email protected]> Signed-off-by: Leonardo Di Donato <[email protected]>

update(userspace/falco): print out current time when a timeouts notif…

a33b71b

…ication gets emitted Also, print out the time of the last processed event in the output fields of the notification. Signed-off-by: Leonardo Di Donato <[email protected]>

poiana added release-note kind/feature dco-signoff: yes labels Apr 16, 2021

poiana requested review from Kaizhe and leogr April 16, 2021 10:32

poiana added the size/M label Apr 16, 2021

update(userspace/falco): handle the case there wasn't been any previo…

7bcd4a3

…usly processed event Signed-off-by: Leonardo Di Donato <[email protected]>

leodido force-pushed the feature/detect-consecutive-timeouts branch from 71b2ba0 to 7bcd4a3 Compare April 16, 2021 10:40

poiana added this to the 0.28.1 milestone Apr 16, 2021

poiana requested a review from fntlnz April 16, 2021 16:04

leogr approved these changes Apr 19, 2021

View reviewed changes

poiana assigned leogr Apr 19, 2021

poiana added the lgtm label Apr 19, 2021

poiana added the approved label Apr 19, 2021

fntlnz approved these changes Apr 19, 2021

View reviewed changes

poiana assigned fntlnz Apr 19, 2021

poiana merged commit 600501e into master Apr 19, 2021

poiana deleted the feature/detect-consecutive-timeouts branch April 19, 2021 14:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

detect consecutive timeouts without events and alert accordingly to a configurable value #1622

detect consecutive timeouts without events and alert accordingly to a configurable value #1622

leodido commented Apr 16, 2021 •

edited

Loading

leodido commented Apr 16, 2021

leodido commented Apr 16, 2021

poiana commented Apr 19, 2021

poiana commented Apr 19, 2021

detect consecutive timeouts without events and alert accordingly to a configurable value #1622

detect consecutive timeouts without events and alert accordingly to a configurable value #1622

Conversation

leodido commented Apr 16, 2021 • edited Loading

Reproduce

leodido commented Apr 16, 2021

leodido commented Apr 16, 2021

poiana commented Apr 19, 2021

poiana commented Apr 19, 2021

leodido commented Apr 16, 2021 •

edited

Loading