>50% kube node CPU spike with Falco deployed #1710

stephanmiehe · 2021-08-19T22:52:38Z

Describe the bug

We're seeing Falco cause a significant increase in CPU utilization when deployed as a daemonset to our kube clusters. We have removed all rules from the deployment to troubleshoot but the resource utilization remains the same which seems to indicate it's unrelated to our ruleset. We do have 1000+ pods and 100+ nodes. The largest jump is generally seen in the kube-system namespace yet around another 10 other namespaces also spike which are our custom application workloads.

How to reproduce it

Deploy falco 0.29.1 to large kube cluster with eBPF probe enabled

Expected behaviour

Based on the documentation this CPU consumption is significantly higher than what is expected for a cluster this size. Would expect ~5% CPU increase.

Screenshots

Environment

Falco version: 0.29.1 and also tried branch from new: k8s node filtering #1671
Falco version: 0.29.1-2+6016c59
Driver version: f7029e2522cc4c81841817abeeeaa515ed944b6c

System info:

Thu Aug 19 22:50:26 2021: Falco version 0.29.1-2+6016c59 (driver version f7029e2522cc4c81841817abeeeaa515ed944b6c)
Thu Aug 19 22:50:26 2021: Falco initialized with configuration file /etc/falco/falco.yaml
Thu Aug 19 22:50:26 2021: Loading rules from file /etc/falco/falco_rules.yaml:
Thu Aug 19 22:50:26 2021: Loading rules from file /etc/falco/falco_rules.local.yaml:
{
  "machine": "x86_64",
  "nodename": "falco-mh7n8",
  "release": "4.19.0-0.bpo.17-amd64",
  "sysname": "Linux",
  "version": "#1 SMP Debian 4.19.194-1~deb9u1 (2021-06-20)"
}

Cloud provider or hardware configuration:
OS: Debian GNU/Linux 9 (stretch)

Kernel: 4.19.0-0.bpo.17-amd64 # 1 SMP Debian 4.19.194-3~deb9u1 (2021-07-19) x86_64 GNU/Linux

Installation method: Kubernetes Daemonset. Kubernetes server v1.15.4

Additional context

The text was updated successfully, but these errors were encountered:

holyspectral · 2021-08-20T03:13:49Z

Not sure if related but when I tried falco 0.26.0+ebpf, I also noticed performance degrade and syscall event drop in my cluster. After some investigation it looks like a huge amount of events coming from getsockopt() syscall. Removing the syscall from here improved the performance a lot. I didn't find a way to change this option without changing code, so I built falco from source. Hope this helps.

leogr · 2021-08-20T10:13:58Z

Is 0.28.0 affected by this issue too?

MattUebel · 2021-08-20T13:03:21Z

we discussed a bit in slack, but I found similar behavior with 0.28.0

The left side of the graph is with 0.29.1 , a drop in cpu util where I deleted falco, and then an increase again with the 0.28.0 deployment

MattUebel · 2021-08-24T00:57:23Z

did a deploy of docker.io/falcosecurity/falco:0.29.1 earlier with no arguments to see what would happen to kube-system namespace cpu util without the -k and -K options.

Also, no rules were enabled.

Saw similar response as before, here's cpu capacity for the nodes in the cluster, the dip is the duration of the deploy

Some things I'm going to try to figure out, any advice on these would be much appreciated 🙏

can we get perf stats out of falco? Like, worst performing syscall capture
can we exclude specific namespaces? (probably a k8s config if it's possible for a daemonset)
what about the kernel module? does it perform better than the eBPF probe?
what about an os-level install of falco?

MattUebel · 2021-09-02T19:12:16Z

This was discussed yesterday in the community call, and today I believe I can confirm the root cause here was the sysctl net.core.bpf_jti_enabled set to it's default of 0

Setting this to 1 resulted in a dramatic decrease in CPU utilization, with the CPU util increase perhaps in the low single digits.

Issif · 2021-09-05T15:32:31Z

This was discussed yesterday in the community call, and today I believe I can confirm the root cause here was the sysctl net.core.bpf_jti_enabled set to it's default of 0

Setting this to 1 resulted in a dramatic decrease in CPU utilization, with the CPU util increase perhaps in the low single digits.

Seems relevant to add this in main documentation, about tuning. wdyt? cc @leogr @danpopSD

MattUebel · 2021-09-05T21:35:14Z

Seems relevant to add this in main documentation, about tuning. wdyt? cc @leogr @danpopSD

👋 @Issif I agree, and I made https://github.com/falcosecurity/falco/issues/1721 to track updating the docs 😄

leogr · 2021-09-06T11:17:32Z

Great finding, thank you all!

@MattUebel Also thank you for having opened the issue. Btw, I will move it to the falco-website repository.

Anyway, I agree we definitively need to update the documentation adding this case. Is anyone willing to make a PR? 😸

poiana · 2021-12-05T15:38:13Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

stephanmiehe · 2021-12-05T15:53:51Z

/close

poiana · 2021-12-05T15:53:53Z

@stephanmiehe: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

stephanmiehe added the kind/bug label Aug 19, 2021

MattUebel mentioned this issue Sep 6, 2021

the relevance of the sysctl net.core.bpf_jit_enable falcosecurity/falco-website#494

Closed

poiana added the lifecycle/stale label Dec 5, 2021

poiana closed this as completed Dec 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

>50% kube node CPU spike with Falco deployed #1710

>50% kube node CPU spike with Falco deployed #1710

stephanmiehe commented Aug 19, 2021

holyspectral commented Aug 20, 2021

leogr commented Aug 20, 2021

MattUebel commented Aug 20, 2021

MattUebel commented Aug 24, 2021

MattUebel commented Sep 2, 2021

Issif commented Sep 5, 2021

MattUebel commented Sep 5, 2021

leogr commented Sep 6, 2021

poiana commented Dec 5, 2021

stephanmiehe commented Dec 5, 2021

poiana commented Dec 5, 2021

>50% kube node CPU spike with Falco deployed #1710

>50% kube node CPU spike with Falco deployed #1710

Comments

stephanmiehe commented Aug 19, 2021

holyspectral commented Aug 20, 2021

leogr commented Aug 20, 2021

MattUebel commented Aug 20, 2021

MattUebel commented Aug 24, 2021

MattUebel commented Sep 2, 2021

Issif commented Sep 5, 2021

MattUebel commented Sep 5, 2021

leogr commented Sep 6, 2021

poiana commented Dec 5, 2021

stephanmiehe commented Dec 5, 2021

poiana commented Dec 5, 2021