-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add missed probes metrics #1941
Conversation
de8076f
to
d53cdd7
Compare
d53cdd7
to
6e23002
Compare
f3305dc
to
97ca33b
Compare
✅ Deploy Preview for tetragon ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
1c504b7
to
d270fd3
Compare
MissedProbes = metrics.NewBPFCounter(prometheus.NewDesc( | ||
prometheus.BuildFQName(consts.MetricsNamespace, "", "missed_probes_total"), | ||
"The total number of Tetragon probe missed per policy,probe.", | ||
[]string{"policy", "attach"}, nil, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the attach
label exactly?
I'm asking because tetragon_policy_events_total metric has a hook
label with values like kprobe:__sys_setfsgid
. Is attach
essentially same or only similar? I'm thinking of standardizing some labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so attach is kernel function name for single kprobe and kprobe_multi (3 functions)
for kprobe multi
similar for uprobes.. I'm not sure it's good idea to standardize this, we might want to change in future ;-)
} | ||
|
||
var ( | ||
MissedProbes = metrics.NewBPFCounter(prometheus.NewDesc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To have this metric documented in the metrics reference, we need to create a separate dummy collector that reports this metrics with example labels. See 861489a for an example, and consts.go for example labels.
This is not ideal tbh. I'm working on a better metrics library for Tetragon use cases, so hopefully soon it will be more straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will add that
d270fd3
to
9209527
Compare
97b720d
to
62c22cf
Compare
62c22cf
to
17d8993
Compare
62db9fa
to
cc05052
Compare
b24768d
to
a87a9e2
Compare
59b3588
to
6966716
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good :)
744570d
to
22437f4
Compare
b161f68
to
9ef973e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM.
One question what exactly is this missed metric? :)
Is it when a link exists without a program? Or is it something else?
so it's incremented any time we don't execute the bpf program due to recursion in kernel, there was a case +- year ago where we were loosing exit events for this reason due to wrong |
@lambdanis could you please recheck, I changed few things on metrics side because the tetragon base changed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments:
- Ideally we should use
MustNewCustomCounter
andNewCustomCollector
frompkg/metrics
here. See how other collectors were defined in a05fbc7 or 7309c9b. If it's a trouble for some reason, I'm ok with merging as-is, but then we should have a follow-up item to refactor it. - The new metrics are not documented. Using helpers from 1. should provide an easy way to include them in the generated reference.
- I'm not sure
kprobe_multi (3 functions)
is a useful label value. Seeing such metric, how do I find out which functions were missed?
80c5f4d
to
42d1674
Compare
ok, used that in new version
ok
we don't have any more insight than that.. we get the counter per link/program that is attached to multiple functions, |
42d1674
to
457c7e2
Compare
I see you use |
457c7e2
to
f712d75
Compare
At the moment we keep unloaded programs in AllPrograms. Making AllPrograms local and locked by mutex. Adding new helpers progsAdd and progsCleanup to add and cleanup programs (on sensor unload) respectively. Signed-off-by: Jiri Olsa <[email protected]>
This just matters for log output, changing from: time="2024-03-28T13:08:34Z" level=info msg="Loading registered BPF probe" Attach="3 functions" Program=... into: time="2024-03-28T13:08:34Z" level=info msg="Loading registered BPF probe" Attach="kprobe_multi (3 functions)" Program=... It has more info and will help in following changes where the Attach string will serve as the metric label. Signed-off-by: Jiri Olsa <[email protected]>
Storing link in program so we can access link's stats and store them as metrics. Signed-off-by: Jiri Olsa <[email protected]>
Storing ebpf.Program instance in program.Program to allow ebpf related stats retrieval in following changes. Signed-off-by: Jiri Olsa <[email protected]>
Adding support to set policy name for program so we can use it as a metric label in following changes. We are going to support missed stats retrieval for kprobes in following changes so at ATM we set policy for kprobe programs which are in base sensor and generic kprobe sensor. Signed-off-by: Jiri Olsa <[email protected]>
Adding HasMissedStatsKprobe/Multi functions to detect support of kprobe missed stats. Checking this by searching BTF for needed structs, because AFAICS there's no 'functional' test possible. Signed-off-by: Jiri Olsa <[email protected]>
f712d75
to
997de7c
Compare
Adding metrics for missed runs on program and link level to kprobemetrics package and logic to store and collect missed stats. The missed stats are supported for all programs and kprobe/kprobe-multi links. They are stored per 'attach name' and 'policy name'. For programs (not just kprobes): tetragon_missed_prog_probes_total{attach="__x64_sys_linkat",policy="sys-linkat-passwd"} 68 tetragon_missed_prog_probes_total{attach="acct_process",policy="__base__"} 60 tetragon_missed_prog_probes_total{attach="sched/sched_process_exec",policy="__base__"} 64 tetragon_missed_prog_probes_total{attach="security_bprm_committing_creds",policy="__base__"} 66 tetragon_missed_prog_probes_total{attach="wake_up_new_task",policy="__base__"} 62 For kprobe and kprobe-multi links: tetragon_missed_link_probes_total{attach="__x64_sys_linkat",policy="sys-linkat-passwd"} 45 tetragon_missed_link_probes_total{attach="acct_process",policy="__base__"} 39 tetragon_missed_link_probes_total{attach="security_bprm_committing_creds",policy="__base__"} 43 tetragon_missed_link_probes_total{attach="wake_up_new_task",policy="__base__"} 41 tetragon_missed_prog_probes_total{attach="acct_process",policy="__base__"} 40 tetragon_missed_prog_probes_total{attach="kprobe_multi (1 functions)",policy="sys-linkat-passwd"} 48 tetragon_missed_prog_probes_total{attach="sched/sched_process_exec",policy="__base__"} 44 tetragon_missed_prog_probes_total{attach="security_bprm_committing_creds",policy="__base__"} 46 tetragon_missed_prog_probes_total{attach="wake_up_new_task",policy="__base__"} 42 Note changing the healthMetrics group to be created as not constrained, so it can carry new metrics. It will be addressed in future by adding debug metrics group. Signed-off-by: Jiri Olsa <[email protected]>
Adding test for tetragon_missed_prog_probes_total metric. At the moment I'm able to trigged missed event only for kprobe_multi and for prog recursion miss. We will need to do special setup to hit other missed counts, adding that on my todo list. Signed-off-by: Jiri Olsa <[email protected]>
997de7c
to
02900fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good :)
wip needs cilium/ebpf#1295