High CPU Usage when using Falco v0.28.0 and higher #2996

Junia18 · 2024-01-08T08:33:42Z

I am encountering a critical issue while running Falco v0.36.2 as a daemonset on an Openshift cluster (OCP v4.12) with three worker nodes. The problem manifests as a substantial drop rate of events and a notable doubling of CPU usage across all three worker nodes.

This behavior is particularly concerning as it impacts the effective functioning of Falco and the overall performance of the cluster. The kernel probe is being utilized in this configuration.

Steps to Reproduce:

Deploy Falco v0.36.2 as a daemonset on an Openshift cluster (OCP v4.12) with three worker nodes.
Observe the event drop rate and CPU usage on each worker node.

CPU usage before (when using Falco versions upto v0.26.2)

> oc adm top nodes
NAME                             CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
master0.sweden.cp.fyre.ibm.com   1577m        21%    10056Mi         69%
master1.sweden.cp.fyre.ibm.com   1301m        17%    12124Mi         83%
master2.sweden.cp.fyre.ibm.com   899m         11%    10133Mi         69%
worker0.sweden.cp.fyre.ibm.com   2022m        13%    12960Mi         42%
worker1.sweden.cp.fyre.ibm.com   2422m        15%    7887Mi          26%
worker2.sweden.cp.fyre.ibm.com   1844m        11%    11767Mi         38%

CPU usage after (when using Falco versions > v0.27.0)

> oc adm top nodes
NAME                               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
master0.normandy.cp.fyre.ibm.com   2014m        26%    8543Mi          58%
master1.normandy.cp.fyre.ibm.com   1090m        14%    7409Mi          51%
master2.normandy.cp.fyre.ibm.com   1506m        20%    10557Mi         72%
worker0.normandy.cp.fyre.ibm.com   4468m        28%    10971Mi         36%
worker1.normandy.cp.fyre.ibm.com   4059m        26%    12588Mi         41%
worker2.normandy.cp.fyre.ibm.com   4043m        26%    12829Mi         42%

Logs

Fri Jan  5 09:23:35 2024: Setting metadata download max size to 100 MB
Fri Jan  5 09:23:35 2024: Setting metadata download chunk wait time to 1000 Î¼s
Fri Jan  5 09:23:35 2024: Setting metadata download watch frequency to 1 seconds
Fri Jan  5 09:23:35 2024: (12) syscalls in rules: execve, kill, link, linkat, mkdir, mkdirat, open, openat, rename, rmdir, unlink, unlinkat
Fri Jan  5 09:23:35 2024: +(58) syscalls (Falco's state engine set of syscalls): accept, accept4, bind, capset, chdir, chroot, clone, clone3, close, connect, creat, dup, dup2, dup3, epoll_create, epoll_create1, eventfd, eventfd2, execveat, fchdir, fcntl, fork, getsockopt, inotify_init, inotify_init1, io_uring_setup, memfd_create, mount, open_by_handle_at, openat2, pidfd_getfd, pidfd_open, pipe, pipe2, prctl, prlimit, procexit, recvfrom, recvmsg, sendmsg, sendto, setgid, setpgid, setresgid, setresuid, setrlimit, setsid, setuid, shutdown, signalfd, signalfd4, socket, socketpair, timerfd_create, umount, umount2, userfaultfd, vfork
Fri Jan  5 09:23:35 2024: (70) syscalls selected in total (final set): accept, accept4, bind, capset, chdir, chroot, clone, clone3, close, connect, creat, dup, dup2, dup3, epoll_create, epoll_create1, eventfd, eventfd2, execve, execveat, fchdir, fcntl, fork, getsockopt, inotify_init, inotify_init1, io_uring_setup, kill, link, linkat, memfd_create, mkdir, mkdirat, mount, open, open_by_handle_at, openat, openat2, pidfd_getfd, pidfd_open, pipe, pipe2, prctl, prlimit, procexit, recvfrom, recvmsg, rename, rmdir, sendmsg, sendto, setgid, setpgid, setresgid, setresuid, setrlimit, setsid, setuid, shutdown, signalfd, signalfd4, socket, socketpair, timerfd_create, umount, umount2, unlink, unlinkat, userfaultfd, vfork
Fri Jan  5 09:23:35 2024: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Fri Jan  5 09:23:35 2024: Loaded event sources: syscall
Fri Jan  5 09:23:35 2024: Enabled event sources: syscall
Fri Jan  5 09:23:35 2024: Opening event source 'syscall'
Fri Jan  5 09:23:35 2024: Opening 'syscall' source with Kernel module
Fri Jan  5 09:24:03 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:03 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:03 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:03 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:03 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:03 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:03 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:04 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:04 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:04 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:04 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:04 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:04 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:05 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:05 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:05 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:05 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:05 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:05 2024: Syscall event drop but token bucket depleted, skipping actions
Fri Jan  5 09:24:06 2024: Syscall event drop but token bucket depleted, skipping actions

Environment

Falco version: 0.36.2
System info:
"machine": "x86_64",
"release": "4.18.0-372.73.1.el8_6.x86_64",
"sysname": "Linux",
"version": "Digwatch compiler #1 SMP Fri Sep 8 13:16:27 EDT 2023"
Cloud provider or hardware configuration:
OS:
NAME="Red Hat Enterprise Linux"
VERSION="8.8 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.8 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
Kernel:
Linux 4.18.0-372.73.1.el8_6.x86_64 Digwatch compiler #1 SMP Fri Sep 8 13:16:27 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
Installation method: from source

The text was updated successfully, but these errors were encountered:

Andreagit97 · 2024-01-08T15:38:36Z

ei @Junia18 thank you for reporting! I see that in Falco 0.36.2 you have the k8s metadata enrichment enabled. This unfortunately could cause huge drops... not sure which deployment method you are using, but the k8s enrichment could be disabled by removing the -k/-K flags if you are running falco binary manually, or with the following command if you are using the default falco helm chart

helm install falco falcosecurity/falco  --set collectors.kubernetes.enabled=false

Could you disable it and retry please?

Junia18 · 2024-01-10T13:27:12Z

@Andreagit97 I have removed the -k and -K flags, but the -pk flag is still needed as we are consuming Kubernetes metadata. Despite these changes, CPU utilization remains high. We are running the Falco binary using the following command:

falco -K /var/run/secrets/kubernetes.io/serviceaccount/token -k https://$(KUBERNETES_SERVICE_HOST) --cri /var/run/crio/crio.sock -pk -v

Is there any other solution I can try?

alacuku · 2024-01-10T13:37:04Z

Hi @Junia18, could you try the new k8s-metacollector and k8smeta plugin? We are going to ship them with falco 0.37 but you can test them with falco 0.36.2. Here you can find the modified chart: falcosecurity/charts#598

For more info please see the following issue: #2973

Let me know if you need any help!

Junia18 · 2024-01-11T08:29:23Z

@alacuku I was wondering if there's any information available about the anticipated release date for Falco v0.37.0. I appreciate your time and assistance.

alacuku · 2024-01-11T09:07:05Z

The expected release date for Falco 0.37 is the end of January 2024.

poiana · 2024-04-10T09:50:58Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

Junia18 · 2024-04-16T12:53:41Z

/remove-lifecycle stale

Junia18 · 2024-04-16T15:46:48Z

@Andreagit97 Despite upgrading to Falco v0.37.1, we continue to experience high CPU usage, accompanied by frequent syscall event drops in the logs.

CPU usage after upgrading Falco to v0.37.1

➜ oc adm top nodes
NAME                               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
master0.brussels.cp.fyre.ibm.com   1458m        19%    11325Mi         78%       
master1.brussels.cp.fyre.ibm.com   1123m        14%    11305Mi         78%       
master2.brussels.cp.fyre.ibm.com   1851m        24%    9500Mi          65%       
worker0.brussels.cp.fyre.ibm.com   6779m        43%    15696Mi         51%       
worker1.brussels.cp.fyre.ibm.com   5537m        35%    17222Mi         56%       
worker2.brussels.cp.fyre.ibm.com   4758m        30%    14445Mi         47%

Logs

Tue Apr 16 14:51:20 2024: Falco version: 0.37.1 (x86_64)
Tue Apr 16 14:51:20 2024: CLI args: /usr/bin/falco -c /config/falco_config.yaml --cri /var/run/crio/crio.sock -pk -v
Tue Apr 16 14:51:20 2024: Falco initialized with configuration file: /config/falco_config.yaml
Tue Apr 16 14:51:20 2024: System info: Linux version 5.14.0-284.55.1.el9_2.x86_64 ([email protected]) (gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4), GNU ld version 2.35.2-37.el9) #1 SMP PREEMPT_DYNAMIC Mon Feb 19 16:57:59 EST 2024
Tue Apr 16 14:51:20 2024: Configured rules filenames:
Tue Apr 16 14:51:20 2024: /config/falco_rules.yaml
Tue Apr 16 14:51:20 2024: Loading rules from file /config/falco_rules.yaml
Tue Apr 16 14:51:20 2024: Watching file '/config/falco_config.yaml'
Tue Apr 16 14:51:20 2024: Watching file '/config/falco_rules.yaml'
Tue Apr 16 14:51:20 2024: (12) syscalls in rules: execve, kill, link, linkat, mkdir, mkdirat, open, openat, rename, rmdir, unlink, unlinkat
Tue Apr 16 14:51:20 2024: +(58) syscalls (Falco's state engine set of syscalls): accept, accept4, bind, capset, chdir, chroot, clone, clone3, close, connect, creat, dup, dup2, dup3, epoll_create, epoll_create1, eventfd, eventfd2, execveat, fchdir, fcntl, fork, getsockopt, inotify_init, inotify_init1, io_uring_setup, memfd_create, mount, open_by_handle_at, openat2, pidfd_getfd, pidfd_open, pipe, pipe2, prctl, prlimit, procexit, recvfrom, recvmsg, sendmsg, sendto, setgid, setpgid, setresgid, setresuid, setrlimit, setsid, setuid, shutdown, signalfd, signalfd4, socket, socketpair, timerfd_create, umount, umount2, userfaultfd, vfork
Tue Apr 16 14:51:20 2024: (70) syscalls selected in total (final set): accept, accept4, bind, capset, chdir, chroot, clone, clone3, close, connect, creat, dup, dup2, dup3, epoll_create, epoll_create1, eventfd, eventfd2, execve, execveat, fchdir, fcntl, fork, getsockopt, inotify_init, inotify_init1, io_uring_setup, kill, link, linkat, memfd_create, mkdir, mkdirat, mount, open, open_by_handle_at, openat, openat2, pidfd_getfd, pidfd_open, pipe, pipe2, prctl, prlimit, procexit, recvfrom, recvmsg, rename, rmdir, sendmsg, sendto, setgid, setpgid, setresgid, setresuid, setrlimit, setsid, setuid, shutdown, signalfd, signalfd4, socket, socketpair, timerfd_create, umount, umount2, unlink, unlinkat, userfaultfd, vfork
Tue Apr 16 14:51:20 2024: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Tue Apr 16 14:51:20 2024: Loaded event sources: syscall
Tue Apr 16 14:51:20 2024: Enabled event sources: syscall
Tue Apr 16 14:51:20 2024: Opening event source 'syscall'
Tue Apr 16 14:51:20 2024: Opening 'syscall' source with Kernel module
Tue Apr 16 14:54:01 2024: Syscall event drop but token bucket depleted, skipping actions
Tue Apr 16 14:54:04 2024: Syscall event drop but token bucket depleted, skipping actions
Tue Apr 16 14:54:06 2024: Syscall event drop but token bucket depleted, skipping actions
Tue Apr 16 14:54:07 2024: Syscall event drop but token bucket depleted, skipping actions
Tue Apr 16 14:54:08 2024: Syscall event drop but token bucket depleted, skipping actions
Tue Apr 16 14:55:02 2024: Syscall event drop but token bucket depleted, skipping actions

Running the Falco binary using the following command:
falco", "-c", "/config/falco_config.yaml", "--cri", "/var/run/docker.sock", "-pk", "-v"

New Environment:

Falco version: 0.37.1
System info:
"machine": "x86_64",
"release": "5.14.0-284.55.1.el9_2.x86_64",
"sysname": "Linux",
"version": "Digwatch compiler #1 SMP PREEMPT_DYNAMIC Mon Feb 19 16:57:59 EST 2024"
Cloud provider or hardware configuration:
OS:
NAME="Red Hat Enterprise Linux"
VERSION="8.9 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.9"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.9 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
Kernel:
Linux 5.14.0-284.55.1.el9_2.x86_64

Junia18 · 2024-05-06T14:09:08Z

@Andreagit97 Just a gentle reminder that I am waiting for your input on this issue for over two weeks now. Any updates or insights you can provide would be greatly appreciated. Thanks!

incertum · 2024-05-06T19:37:18Z

Hi 👋 @Junia18 happy to also help since Andrea is constantly working on many areas. Between v0.28 and now a lot changed.

Recommending to follow our Troubleshooting guides https://falco.org/docs/troubleshooting/dropping/ and explore the base_syscalls option and other optimizations.

In your original comment I see that you are showing metrics from different clusters #2996 (comment) -- unfortunately it's hard to make truly fair comparisons and looking at the CPU usages it does not appear to be a huge delta. Perhaps also checkout this paragraph https://falco.org/docs/metrics/performance/#server-load-and-falco-event-drops stating that Falco's usage is never constant. In case you are interested in a more fair comparison: would you consider running the different Falco versions on the exact same nodes plus also export our internal metrics (https://falco.org/docs/metrics/falco-metrics/) just so we can check if event rates etc were similar.

Please let us know if the guides are helpful.

incertum · 2024-05-06T19:38:58Z

^ small correction: #2996 (comment) the internal metrics didn't yet exist in v0.28 so we can't make the more detailed comparisons.

Andreagit97 · 2024-05-07T09:13:54Z

ei @Junia18 as suggested by @incertum i would use the base_syscal.repairl option.

Looking at your actual situation, you are tracing 70 syscalls

Tue Apr 16 14:51:20 2024: (12) syscalls in rules: execve, kill, link, linkat, mkdir, mkdirat, open, openat, rename, rmdir, unlink, unlinkat
Tue Apr 16 14:51:20 2024: +(58) syscalls (Falco's state engine set of syscalls): accept, accept4, bind, capset, chdir, chroot, clone, clone3, close, connect, creat, dup, dup2, dup3, epoll_create, epoll_create1, eventfd, eventfd2, execveat, fchdir, fcntl, fork, getsockopt, inotify_init, inotify_init1, io_uring_setup, memfd_create, mount, open_by_handle_at, openat2, pidfd_getfd, pidfd_open, pipe, pipe2, prctl, prlimit, procexit, recvfrom, recvmsg, sendmsg, sendto, setgid, setpgid, setresgid, setresuid, setrlimit, setsid, setuid, shutdown, signalfd, signalfd4, socket, socketpair, timerfd_create, umount, umount2, userfaultfd, vfork
Tue Apr 16 14:51:20 2024: (70) syscalls selected in total (final set): accept, accept4, bind, capset, chdir, chroot, clone, clone3, close, connect, creat, dup, dup2, dup3, epoll_create, epoll_create1, eventfd, eventfd2, execve, execveat, fchdir, fcntl, fork, getsockopt, inotify_init, inotify_init1, io_uring_setup, kill, link, linkat, memfd_create, mkdir, mkdirat, mount, open, open_by_handle_at, openat, openat2, pidfd_getfd, pidfd_open, pipe, pipe2, prctl, prlimit, procexit, recvfrom, recvmsg, rename, rmdir, sendmsg, sendto, setgid, setpgid, setresgid, setresuid, setrlimit, setsid, setuid, shutdown, signalfd, signalfd4, socket, socketpair, timerfd_create, umount, umount2, unlink, unlinkat, userfaultfd, vfork

With base_syscall.repait=true you should obtain something like

Tue May  7 11:12:06 2024: (12) syscalls in rules: execve, kill, link, linkat, mkdir, mkdirat, open, openat, rename, rmdir, unlink, unlinkat
Tue May  7 11:12:06 2024: +(18) syscalls (Falco's state engine set of syscalls): capset, chdir, chroot, clone, clone3, close, execveat, fchdir, fork, prctl, procexit, setgid, setpgid, setresgid, setresuid, setsid, setuid, vfork
Tue May  7 11:12:06 2024: (30) syscalls selected in total (final set): capset, chdir, chroot, clone, clone3, close, execve, execveat, fchdir, fork, kill, link, linkat, mkdir, mkdirat, open, openat, prctl, procexit, rename, rmdir, setgid, setpgid, setresgid, setresuid, setsid, setuid, unlink, unlinkat, vfork

It's enough to provide Falco with -o base_syscalls.repair=true or changing it directly in the config. This should improve the drop situation.

poiana · 2024-08-05T10:09:19Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana · 2024-09-04T10:10:27Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

johannespostler · 2024-09-14T05:32:16Z

We experience the same issue on a number of Ubuntu 22.04 VMs using falco 0.37.1.

poiana · 2024-10-14T10:11:24Z

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

poiana · 2024-10-14T10:11:28Z

@poiana: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Junia18 added the kind/bug label Jan 8, 2024

Andreagit97 added this to the TBD milestone Jan 8, 2024

poiana added the lifecycle/stale label Apr 10, 2024

poiana removed the lifecycle/stale label Apr 16, 2024

poiana added the lifecycle/stale label Aug 5, 2024

poiana added lifecycle/rotten and removed lifecycle/stale labels Sep 4, 2024

poiana closed this as completed Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU Usage when using Falco v0.28.0 and higher #2996

High CPU Usage when using Falco v0.28.0 and higher #2996

Junia18 commented Jan 8, 2024 •

edited

Loading

Andreagit97 commented Jan 8, 2024 •

edited

Loading

Junia18 commented Jan 10, 2024

alacuku commented Jan 10, 2024 •

edited

Loading

Junia18 commented Jan 11, 2024

alacuku commented Jan 11, 2024

poiana commented Apr 10, 2024

Junia18 commented Apr 16, 2024

Junia18 commented Apr 16, 2024

Junia18 commented May 6, 2024

incertum commented May 6, 2024

incertum commented May 6, 2024

Andreagit97 commented May 7, 2024

poiana commented Aug 5, 2024

poiana commented Sep 4, 2024

johannespostler commented Sep 14, 2024

poiana commented Oct 14, 2024

poiana commented Oct 14, 2024

High CPU Usage when using Falco v0.28.0 and higher #2996

High CPU Usage when using Falco v0.28.0 and higher #2996

Comments

Junia18 commented Jan 8, 2024 • edited Loading

Andreagit97 commented Jan 8, 2024 • edited Loading

Junia18 commented Jan 10, 2024

alacuku commented Jan 10, 2024 • edited Loading

Junia18 commented Jan 11, 2024

alacuku commented Jan 11, 2024

poiana commented Apr 10, 2024

Junia18 commented Apr 16, 2024

Junia18 commented Apr 16, 2024

Junia18 commented May 6, 2024

incertum commented May 6, 2024

incertum commented May 6, 2024

Andreagit97 commented May 7, 2024

poiana commented Aug 5, 2024

poiana commented Sep 4, 2024

johannespostler commented Sep 14, 2024

poiana commented Oct 14, 2024

poiana commented Oct 14, 2024

Junia18 commented Jan 8, 2024 •

edited

Loading

Andreagit97 commented Jan 8, 2024 •

edited

Loading

alacuku commented Jan 10, 2024 •

edited

Loading