Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

>50% kube node CPU spike with Falco deployed #1710

Closed
stephanmiehe opened this issue Aug 19, 2021 · 11 comments
Closed

>50% kube node CPU spike with Falco deployed #1710

stephanmiehe opened this issue Aug 19, 2021 · 11 comments

Comments

@stephanmiehe
Copy link
Contributor

Describe the bug

We're seeing Falco cause a significant increase in CPU utilization when deployed as a daemonset to our kube clusters. We have removed all rules from the deployment to troubleshoot but the resource utilization remains the same which seems to indicate it's unrelated to our ruleset. We do have 1000+ pods and 100+ nodes. The largest jump is generally seen in the kube-system namespace yet around another 10 other namespaces also spike which are our custom application workloads.

How to reproduce it

Deploy falco 0.29.1 to large kube cluster with eBPF probe enabled

Expected behaviour

Based on the documentation this CPU consumption is significantly higher than what is expected for a cluster this size. Would expect ~5% CPU increase.

Screenshots

Screen Shot 2021-08-19 at 2 19 05 PM

Screen Shot 2021-08-19 at 3 28 18 PM

Environment

  • Falco version: 0.29.1 and also tried branch from new: k8s node filtering #1671
    Falco version: 0.29.1-2+6016c59
    Driver version: f7029e2522cc4c81841817abeeeaa515ed944b6c
  • System info:
Thu Aug 19 22:50:26 2021: Falco version 0.29.1-2+6016c59 (driver version f7029e2522cc4c81841817abeeeaa515ed944b6c)
Thu Aug 19 22:50:26 2021: Falco initialized with configuration file /etc/falco/falco.yaml
Thu Aug 19 22:50:26 2021: Loading rules from file /etc/falco/falco_rules.yaml:
Thu Aug 19 22:50:26 2021: Loading rules from file /etc/falco/falco_rules.local.yaml:
{
  "machine": "x86_64",
  "nodename": "falco-mh7n8",
  "release": "4.19.0-0.bpo.17-amd64",
  "sysname": "Linux",
  "version": "#1 SMP Debian 4.19.194-1~deb9u1 (2021-06-20)"
}
  • Cloud provider or hardware configuration:
  • OS: Debian GNU/Linux 9 (stretch)
  • Kernel: 4.19.0-0.bpo.17-amd64 # 1 SMP Debian 4.19.194-3~deb9u1 (2021-07-19) x86_64 GNU/Linux
  • Installation method: Kubernetes Daemonset. Kubernetes server v1.15.4

Additional context

@holyspectral
Copy link

Not sure if related but when I tried falco 0.26.0+ebpf, I also noticed performance degrade and syscall event drop in my cluster. After some investigation it looks like a huge amount of events coming from getsockopt() syscall. Removing the syscall from here improved the performance a lot. I didn't find a way to change this option without changing code, so I built falco from source. Hope this helps.

@leogr
Copy link
Member

leogr commented Aug 20, 2021

Is 0.28.0 affected by this issue too?

@MattUebel
Copy link

we discussed a bit in slack, but I found similar behavior with 0.28.0
image
The left side of the graph is with 0.29.1 , a drop in cpu util where I deleted falco, and then an increase again with the 0.28.0 deployment

@MattUebel
Copy link

did a deploy of docker.io/falcosecurity/falco:0.29.1 earlier with no arguments to see what would happen to kube-system namespace cpu util without the -k and -K options.

Also, no rules were enabled.

Saw similar response as before, here's cpu capacity for the nodes in the cluster, the dip is the duration of the deploy
image

Some things I'm going to try to figure out, any advice on these would be much appreciated 🙏

  • can we get perf stats out of falco? Like, worst performing syscall capture
  • can we exclude specific namespaces? (probably a k8s config if it's possible for a daemonset)
  • what about the kernel module? does it perform better than the eBPF probe?
  • what about an os-level install of falco?

@MattUebel
Copy link

This was discussed yesterday in the community call, and today I believe I can confirm the root cause here was the sysctl net.core.bpf_jti_enabled set to it's default of 0

Setting this to 1 resulted in a dramatic decrease in CPU utilization, with the CPU util increase perhaps in the low single digits.

@Issif
Copy link
Member

Issif commented Sep 5, 2021

This was discussed yesterday in the community call, and today I believe I can confirm the root cause here was the sysctl net.core.bpf_jti_enabled set to it's default of 0

Setting this to 1 resulted in a dramatic decrease in CPU utilization, with the CPU util increase perhaps in the low single digits.

Seems relevant to add this in main documentation, about tuning. wdyt? cc @leogr @danpopSD

@MattUebel
Copy link

Seems relevant to add this in main documentation, about tuning. wdyt? cc @leogr @danpopSD

👋 @Issif I agree, and I made https://github.com/falcosecurity/falco/issues/1721 to track updating the docs 😄

@leogr
Copy link
Member

leogr commented Sep 6, 2021

Great finding, thank you all!

@MattUebel Also thank you for having opened the issue. Btw, I will move it to the falco-website repository.

Anyway, I agree we definitively need to update the documentation adding this case. Is anyone willing to make a PR? 😸

@poiana
Copy link
Contributor

poiana commented Dec 5, 2021

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@stephanmiehe
Copy link
Contributor Author

/close

@poiana
Copy link
Contributor

poiana commented Dec 5, 2021

@stephanmiehe: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@poiana poiana closed this as completed Dec 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants