Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detect consecutive timeouts without events and alert accordingly to a configurable value #1622

Merged
merged 6 commits into from
Apr 19, 2021

Conversation

leodido
Copy link
Member

@leodido leodido commented Apr 16, 2021

What type of PR is this?

/kind feature

Any specific area of the project related to this PR?

NONE

What this PR does / why we need it:

This PR makes Falco able to detect a very uncommon situation and alert the user about it.

As everyone probably already knows today, Falco receives events from the drivers through the libraries.

Not all the events that the libraries emit are of interest to Falco.
For this reason and other more complex reasons (eg., timeout while reading the event from the ring buffer),
Falco receives timeouts (SCAP_TIMEOUTS).

In the majority of cases, when Falco receives a timeout it also receives an event that Falco discards.

But, if Falco receives too many consecutive timeouts without the events it is likely that something is going wrong at the lower level.

These code changes let the user configure how to detect such an unlikely situation and alert.

Through the syscall_event_timeouts.max_consecutive config field the user can instruct Falco after how many consecutive timeouts without an event to emit an alert (with DEBUG priority).

Other than the message, the alert will contain the current time and the time of the last processed (SCAP_SUCCESS) event (if available, otherwise "none").

Which issue(s) this PR fixes:

NONE

Special notes for your reviewer:

On my CPU a default value of 1000 for max_consecutives config value maps to a frequency in the range of 30-40 seconds (depending on the system load too).

In this scenario, Falco alerts if for 30-40 seconds it is not processing events.

Wondering if we're good with such value or if we may need to double it.

Reproduce

The only simple way to reproduce such an unlikely situation and test out this PR is the following one.

  1. Start Falco in userspace mode
sudo ./build/userspace/falco/falco -r rules/falco_rules.yaml -u
  1. Send to it some userspace events through this userspace producer.
sudo ./userspace-example // don't pay attention to the sudo for now, it's an example tool
  1. Observe Falco receiving a fake renameat event
10:00:44.036991000: Warning Shell history had been deleted or renamed (user=<NA> user_loginuid=-1 type=renameat command=<NA> fd.name=<NA> name=<NA> path=<NA> oldpath=/tmp/bash_history host (id=host))
  1. Wait for roughly ~40seconds
10:01:22.199076638: Debug Falco internal: timeouts notification. 1000 consecutive timeouts without event. (last_event_time=10:00:44.036991000)
10:02:00.780884781: Debug Falco internal: timeouts notification. 1000 consecutive timeouts without event. (last_event_time=10:00:44.036991000)

Does this PR introduce a user-facing change?:

new: Falco outputs an alert in the unlikely situation it's receiving too many consecutive timeouts without an event
new: configuration field `syscall_event_timeouts.max_consecutive` to configure after how many consecutive timeouts without an event Falco must alert

leodido and others added 5 commits April 15, 2021 10:31
…uts without an event is greater than a given threshold

The rationale is that in case Falco obtains a consistent number of
consecutive timeouts (in a row) without a valid event, something is
going wrong.

This because, normally, the libs send timeouts to Falco (also) to signal events to discard.
In such cases, which are the majority of cases, `ev` exists and is not
`null`.

Signed-off-by: Leonardo Di Donato <[email protected]>
Falco uses a shared buffer between the kernel and userspace to receive
the events (eg., system call information) in userspace.
Anyways, the underlying libraries can also timeout for various reasons.
For example, there could have been issues while reading an event.
Or the particular event needs to be skipped.
Normally, it's very unlikely that Falco does not receive events consecutively.
Falco is able to detect such uncommon situation.
Here you can configure the maximum number of consecutive timeouts without an event
after which you want Falco to alert.
By default this value is set to 1000 consecutive timeouts without an event at all.

Signed-off-by: Leonardo Di Donato <[email protected]>
…ication gets emitted

Also, print out the time of the last processed event in the output
fields of the notification.

Signed-off-by: Leonardo Di Donato <[email protected]>
…usly processed event

Signed-off-by: Leonardo Di Donato <[email protected]>
@leodido leodido force-pushed the feature/detect-consecutive-timeouts branch from 71b2ba0 to 7bcd4a3 Compare April 16, 2021 10:40
@leodido
Copy link
Member Author

leodido commented Apr 16, 2021

/milestone 0.28.1

@poiana poiana added this to the 0.28.1 milestone Apr 16, 2021
@leodido
Copy link
Member Author

leodido commented Apr 16, 2021

/cc @fntlnz

@poiana poiana requested a review from fntlnz April 16, 2021 16:04
@poiana
Copy link
Contributor

poiana commented Apr 19, 2021

LGTM label has been added.

Git tree hash: c4d81e7bfbf8b63ea3dd5ee7986a910b15138a84

@poiana
Copy link
Contributor

poiana commented Apr 19, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fntlnz, leogr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@poiana poiana merged commit 600501e into master Apr 19, 2021
@poiana poiana deleted the feature/detect-consecutive-timeouts branch April 19, 2021 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants