Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU Usage when using Falco v0.28.0 and higher #2996

Closed
Junia18 opened this issue Jan 8, 2024 · 17 comments
Closed

High CPU Usage when using Falco v0.28.0 and higher #2996

Junia18 opened this issue Jan 8, 2024 · 17 comments

Comments

@Junia18
Copy link

Junia18 commented Jan 8, 2024

I am encountering a critical issue while running Falco v0.36.2 as a daemonset on an Openshift cluster (OCP v4.12) with three worker nodes. The problem manifests as a substantial drop rate of events and a notable doubling of CPU usage across all three worker nodes.

This behavior is particularly concerning as it impacts the effective functioning of Falco and the overall performance of the cluster. The kernel probe is being utilized in this configuration.

Steps to Reproduce:

  • Deploy Falco v0.36.2 as a daemonset on an Openshift cluster (OCP v4.12) with three worker nodes.
  • Observe the event drop rate and CPU usage on each worker node.

CPU usage before (when using Falco versions upto v0.26.2)

> oc adm top nodes
NAME                             CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
master0.sweden.cp.fyre.ibm.com   1577m        21%    10056Mi         69%
master1.sweden.cp.fyre.ibm.com   1301m        17%    12124Mi         83%
master2.sweden.cp.fyre.ibm.com   899m         11%    10133Mi         69%
worker0.sweden.cp.fyre.ibm.com   2022m        13%    12960Mi         42%
worker1.sweden.cp.fyre.ibm.com   2422m        15%    7887Mi          26%
worker2.sweden.cp.fyre.ibm.com   1844m        11%    11767Mi         38%

CPU usage after (when using Falco versions > v0.27.0)

> oc adm top nodes
NAME                               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
master0.normandy.cp.fyre.ibm.com   2014m        26%    8543Mi          58%
master1.normandy.cp.fyre.ibm.com   1090m        14%    7409Mi          51%
master2.normandy.cp.fyre.ibm.com   1506m        20%    10557Mi         72%
worker0.normandy.cp.fyre.ibm.com   4468m        28%    10971Mi         36%
worker1.normandy.cp.fyre.ibm.com   4059m        26%    12588Mi         41%
worker2.normandy.cp.fyre.ibm.com   4043m        26%    12829Mi         42%

Logs

Fri Jan  5 09:23:35 2024: Setting metadata download max size to 100 MB
Fri Jan  5 09:23:35 2024: Setting metadata download chunk wait time to 1000 μs
Fri Jan  5 09:23:35 2024: Setting metadata download watch frequency to 1 seconds
Fri Jan  5 09:23:35 2024: (12) syscalls in rules: execve, kill, link, linkat, mkdir, mkdirat, open, openat, rename, rmdir, unlink, unlinkat
Fri Jan  5 09:23:35 2024: +(58) syscalls (Falco's state engine set of syscalls): accept, accept4, bind, capset, chdir, chroot, clone, clone3, close, connect, creat, dup, dup2, dup3, epoll_create, epoll_create1, eventfd, eventfd2, execveat, fchdir, fcntl, fork, getsockopt, inotify_init, inotify_init1, io_uring_setup, memfd_create, mount, open_by_handle_at, openat2, pidfd_getfd, pidfd_open, pipe, pipe2, prctl, prlimit, procexit, recvfrom, recvmsg, sendmsg, sendto, setgid, setpgid, setresgid, setresuid, setrlimit, setsid, setuid, shutdown, signalfd, signalfd4, socket, socketpair, timerfd_create, umount, umount2, userfaultfd, vfork
Fri Jan  5 09:23:35 2024: (70) syscalls selected in total (final set): accept, accept4, bind, capset, chdir, chroot, clone, clone3, close, connect, creat, dup, dup2, dup3, epoll_create, epoll_create1, eventfd, eventfd2, execve, execveat, fchdir, fcntl, fork, getsockopt, inotify_init, inotify_init1, io_uring_setup, kill, link, linkat, memfd_create, mkdir, mkdirat, mount, open, open_by_handle_at, openat, openat2, pidfd_getfd, pidfd_open, pipe, pipe2, prctl, prlimit, procexit, recvfrom, recvmsg, rename, rmdir, sendmsg, sendto, setgid, setpgid, setresgid, setresuid, setrlimit, setsid, setuid, shutdown, signalfd, signalfd4, socket, socketpair, timerfd_create, umount, umount2, unlink, unlinkat, userfaultfd, vfork
Fri Jan  5 09:23:35 2024: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Fri Jan  5 09:23:35 2024: Loaded event sources: syscall
Fri Jan  5 09:23:35 2024: Enabled event sources: syscall
Fri Jan  5 09:23:35 2024: Opening event source 'syscall'
Fri Jan  5 09:23:35 2024: Opening 'syscall' source with Kernel module
Fri Jan  5 09:24:03 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:03 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:03 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:03 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:03 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:03 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:03 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:04 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:04 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:04 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:04 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:04 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:04 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:05 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:05 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:05 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:05 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:05 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:05 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:06 2024: Syscall event drop but token bucket depleted, skipping actions

Environment

  • Falco version: 0.36.2

  • System info:
    "machine": "x86_64",
    "release": "4.18.0-372.73.1.el8_6.x86_64",
    "sysname": "Linux",
    "version": "Digwatch compiler #1 SMP Fri Sep 8 13:16:27 EDT 2023"

  • Cloud provider or hardware configuration:

  • OS:
    NAME="Red Hat Enterprise Linux"
    VERSION="8.8 (Ootpa)"
    ID="rhel"
    ID_LIKE="fedora"
    VERSION_ID="8.8"
    PLATFORM_ID="platform:el8"
    PRETTY_NAME="Red Hat Enterprise Linux 8.8 (Ootpa)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"

  • Kernel:
    Linux 4.18.0-372.73.1.el8_6.x86_64 Digwatch compiler #1 SMP Fri Sep 8 13:16:27 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

  • Installation method: from source

@Andreagit97
Copy link
Member

Andreagit97 commented Jan 8, 2024

ei @Junia18 thank you for reporting! I see that in Falco 0.36.2 you have the k8s metadata enrichment enabled. This unfortunately could cause huge drops... not sure which deployment method you are using, but the k8s enrichment could be disabled by removing the -k/-K flags if you are running falco binary manually, or with the following command if you are using the default falco helm chart

helm install falco falcosecurity/falco  --set collectors.kubernetes.enabled=false 

Could you disable it and retry please?

@Andreagit97 Andreagit97 added this to the TBD milestone Jan 8, 2024
@Junia18
Copy link
Author

Junia18 commented Jan 10, 2024

@Andreagit97 I have removed the -k and -K flags, but the -pk flag is still needed as we are consuming Kubernetes metadata. Despite these changes, CPU utilization remains high. We are running the Falco binary using the following command:

falco -K /var/run/secrets/kubernetes.io/serviceaccount/token -k https://$(KUBERNETES_SERVICE_HOST) --cri /var/run/crio/crio.sock -pk -v

Is there any other solution I can try?

@alacuku
Copy link
Member

alacuku commented Jan 10, 2024

Hi @Junia18, could you try the new k8s-metacollector and k8smeta plugin? We are going to ship them with falco 0.37 but you can test them with falco 0.36.2. Here you can find the modified chart: falcosecurity/charts#598

For more info please see the following issue: #2973

Let me know if you need any help!

@Junia18
Copy link
Author

Junia18 commented Jan 11, 2024

@alacuku I was wondering if there's any information available about the anticipated release date for Falco v0.37.0. I appreciate your time and assistance.

@alacuku
Copy link
Member

alacuku commented Jan 11, 2024

The expected release date for Falco 0.37 is the end of January 2024.

@poiana
Copy link
Contributor

poiana commented Apr 10, 2024

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@Junia18
Copy link
Author

Junia18 commented Apr 16, 2024

/remove-lifecycle stale

@Junia18
Copy link
Author

Junia18 commented Apr 16, 2024

@Andreagit97 Despite upgrading to Falco v0.37.1, we continue to experience high CPU usage, accompanied by frequent syscall event drops in the logs.

CPU usage after upgrading Falco to v0.37.1

➜ oc adm top nodes
NAME                               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
master0.brussels.cp.fyre.ibm.com   1458m        19%    11325Mi         78%       
master1.brussels.cp.fyre.ibm.com   1123m        14%    11305Mi         78%       
master2.brussels.cp.fyre.ibm.com   1851m        24%    9500Mi          65%       
worker0.brussels.cp.fyre.ibm.com   6779m        43%    15696Mi         51%       
worker1.brussels.cp.fyre.ibm.com   5537m        35%    17222Mi         56%       
worker2.brussels.cp.fyre.ibm.com   4758m        30%    14445Mi         47%

Logs

Tue Apr 16 14:51:20 2024: Falco version: 0.37.1 (x86_64)
Tue Apr 16 14:51:20 2024: CLI args: /usr/bin/falco -c /config/falco_config.yaml --cri /var/run/crio/crio.sock -pk -v
Tue Apr 16 14:51:20 2024: Falco initialized with configuration file: /config/falco_config.yaml
Tue Apr 16 14:51:20 2024: System info: Linux version 5.14.0-284.55.1.el9_2.x86_64 ([email protected]) (gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4), GNU ld version 2.35.2-37.el9) #1 SMP PREEMPT_DYNAMIC Mon Feb 19 16:57:59 EST 2024
Tue Apr 16 14:51:20 2024: Configured rules filenames:
Tue Apr 16 14:51:20 2024: /config/falco_rules.yaml
Tue Apr 16 14:51:20 2024: Loading rules from file /config/falco_rules.yaml
Tue Apr 16 14:51:20 2024: Watching file '/config/falco_config.yaml'
Tue Apr 16 14:51:20 2024: Watching file '/config/falco_rules.yaml'
Tue Apr 16 14:51:20 2024: (12) syscalls in rules: execve, kill, link, linkat, mkdir, mkdirat, open, openat, rename, rmdir, unlink, unlinkat
Tue Apr 16 14:51:20 2024: +(58) syscalls (Falco's state engine set of syscalls): accept, accept4, bind, capset, chdir, chroot, clone, clone3, close, connect, creat, dup, dup2, dup3, epoll_create, epoll_create1, eventfd, eventfd2, execveat, fchdir, fcntl, fork, getsockopt, inotify_init, inotify_init1, io_uring_setup, memfd_create, mount, open_by_handle_at, openat2, pidfd_getfd, pidfd_open, pipe, pipe2, prctl, prlimit, procexit, recvfrom, recvmsg, sendmsg, sendto, setgid, setpgid, setresgid, setresuid, setrlimit, setsid, setuid, shutdown, signalfd, signalfd4, socket, socketpair, timerfd_create, umount, umount2, userfaultfd, vfork
Tue Apr 16 14:51:20 2024: (70) syscalls selected in total (final set): accept, accept4, bind, capset, chdir, chroot, clone, clone3, close, connect, creat, dup, dup2, dup3, epoll_create, epoll_create1, eventfd, eventfd2, execve, execveat, fchdir, fcntl, fork, getsockopt, inotify_init, inotify_init1, io_uring_setup, kill, link, linkat, memfd_create, mkdir, mkdirat, mount, open, open_by_handle_at, openat, openat2, pidfd_getfd, pidfd_open, pipe, pipe2, prctl, prlimit, procexit, recvfrom, recvmsg, rename, rmdir, sendmsg, sendto, setgid, setpgid, setresgid, setresuid, setrlimit, setsid, setuid, shutdown, signalfd, signalfd4, socket, socketpair, timerfd_create, umount, umount2, unlink, unlinkat, userfaultfd, vfork
Tue Apr 16 14:51:20 2024: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Tue Apr 16 14:51:20 2024: Loaded event sources: syscall
Tue Apr 16 14:51:20 2024: Enabled event sources: syscall
Tue Apr 16 14:51:20 2024: Opening event source 'syscall'
Tue Apr 16 14:51:20 2024: Opening 'syscall' source with Kernel module
Tue Apr 16 14:54:01 2024: Syscall event drop but token bucket depleted, skipping actions
Tue Apr 16 14:54:04 2024: Syscall event drop but token bucket depleted, skipping actions
Tue Apr 16 14:54:06 2024: Syscall event drop but token bucket depleted, skipping actions
Tue Apr 16 14:54:07 2024: Syscall event drop but token bucket depleted, skipping actions
Tue Apr 16 14:54:08 2024: Syscall event drop but token bucket depleted, skipping actions
Tue Apr 16 14:55:02 2024: Syscall event drop but token bucket depleted, skipping actions

Running the Falco binary using the following command:
falco", "-c", "/config/falco_config.yaml", "--cri", "/var/run/docker.sock", "-pk", "-v"

New Environment:

  • Falco version: 0.37.1

  • System info:
    "machine": "x86_64",
    "release": "5.14.0-284.55.1.el9_2.x86_64",
    "sysname": "Linux",
    "version": "Digwatch compiler #1 SMP PREEMPT_DYNAMIC Mon Feb 19 16:57:59 EST 2024"

  • Cloud provider or hardware configuration:

  • OS:
    NAME="Red Hat Enterprise Linux"
    VERSION="8.9 (Ootpa)"
    ID="rhel"
    ID_LIKE="fedora"
    VERSION_ID="8.9"
    PLATFORM_ID="platform:el8"
    PRETTY_NAME="Red Hat Enterprise Linux 8.9 (Ootpa)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"

  • Kernel:
    Linux 5.14.0-284.55.1.el9_2.x86_64

@Junia18
Copy link
Author

Junia18 commented May 6, 2024

@Andreagit97 Just a gentle reminder that I am waiting for your input on this issue for over two weeks now. Any updates or insights you can provide would be greatly appreciated. Thanks!

@incertum
Copy link
Contributor

incertum commented May 6, 2024

Hi 👋 @Junia18 happy to also help since Andrea is constantly working on many areas. Between v0.28 and now a lot changed.

Recommending to follow our Troubleshooting guides https://falco.org/docs/troubleshooting/dropping/ and explore the base_syscalls option and other optimizations.

In your original comment I see that you are showing metrics from different clusters #2996 (comment) -- unfortunately it's hard to make truly fair comparisons and looking at the CPU usages it does not appear to be a huge delta. Perhaps also checkout this paragraph https://falco.org/docs/metrics/performance/#server-load-and-falco-event-drops stating that Falco's usage is never constant. In case you are interested in a more fair comparison: would you consider running the different Falco versions on the exact same nodes plus also export our internal metrics (https://falco.org/docs/metrics/falco-metrics/) just so we can check if event rates etc were similar.

Please let us know if the guides are helpful.

@incertum
Copy link
Contributor

incertum commented May 6, 2024

^ small correction: #2996 (comment) the internal metrics didn't yet exist in v0.28 so we can't make the more detailed comparisons.

@Andreagit97
Copy link
Member

ei @Junia18 as suggested by @incertum i would use the base_syscal.repairl option.

Looking at your actual situation, you are tracing 70 syscalls

Tue Apr 16 14:51:20 2024: (12) syscalls in rules: execve, kill, link, linkat, mkdir, mkdirat, open, openat, rename, rmdir, unlink, unlinkat
Tue Apr 16 14:51:20 2024: +(58) syscalls (Falco's state engine set of syscalls): accept, accept4, bind, capset, chdir, chroot, clone, clone3, close, connect, creat, dup, dup2, dup3, epoll_create, epoll_create1, eventfd, eventfd2, execveat, fchdir, fcntl, fork, getsockopt, inotify_init, inotify_init1, io_uring_setup, memfd_create, mount, open_by_handle_at, openat2, pidfd_getfd, pidfd_open, pipe, pipe2, prctl, prlimit, procexit, recvfrom, recvmsg, sendmsg, sendto, setgid, setpgid, setresgid, setresuid, setrlimit, setsid, setuid, shutdown, signalfd, signalfd4, socket, socketpair, timerfd_create, umount, umount2, userfaultfd, vfork
Tue Apr 16 14:51:20 2024: (70) syscalls selected in total (final set): accept, accept4, bind, capset, chdir, chroot, clone, clone3, close, connect, creat, dup, dup2, dup3, epoll_create, epoll_create1, eventfd, eventfd2, execve, execveat, fchdir, fcntl, fork, getsockopt, inotify_init, inotify_init1, io_uring_setup, kill, link, linkat, memfd_create, mkdir, mkdirat, mount, open, open_by_handle_at, openat, openat2, pidfd_getfd, pidfd_open, pipe, pipe2, prctl, prlimit, procexit, recvfrom, recvmsg, rename, rmdir, sendmsg, sendto, setgid, setpgid, setresgid, setresuid, setrlimit, setsid, setuid, shutdown, signalfd, signalfd4, socket, socketpair, timerfd_create, umount, umount2, unlink, unlinkat, userfaultfd, vfork

With base_syscall.repait=true you should obtain something like

Tue May  7 11:12:06 2024: (12) syscalls in rules: execve, kill, link, linkat, mkdir, mkdirat, open, openat, rename, rmdir, unlink, unlinkat
Tue May  7 11:12:06 2024: +(18) syscalls (Falco's state engine set of syscalls): capset, chdir, chroot, clone, clone3, close, execveat, fchdir, fork, prctl, procexit, setgid, setpgid, setresgid, setresuid, setsid, setuid, vfork
Tue May  7 11:12:06 2024: (30) syscalls selected in total (final set): capset, chdir, chroot, clone, clone3, close, execve, execveat, fchdir, fork, kill, link, linkat, mkdir, mkdirat, open, openat, prctl, procexit, rename, rmdir, setgid, setpgid, setresgid, setresuid, setsid, setuid, unlink, unlinkat, vfork

It's enough to provide Falco with -o base_syscalls.repair=true or changing it directly in the config. This should improve the drop situation.

@poiana
Copy link
Contributor

poiana commented Aug 5, 2024

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@poiana
Copy link
Contributor

poiana commented Sep 4, 2024

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

@johannespostler
Copy link

We experience the same issue on a number of Ubuntu 22.04 VMs using falco 0.37.1.

@poiana
Copy link
Contributor

poiana commented Oct 14, 2024

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

@poiana
Copy link
Contributor

poiana commented Oct 14, 2024

@poiana: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@poiana poiana closed this as completed Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants