Net: Implement deferred panic #314

solardiz · 2024-02-25T22:21:17Z

Nov 10, 2022

When LKRG decides to panic the kernel and we have networking enabled, the panic should be deferred until after we've at least tried sending the message out.

However, in such configuration we should probably apply the previous level of enforcement (where applicable) right away, so that switching from that level to panic obviously does not weaken security. For example, kill the task right away (where applicable), then send the message that we're about to panic the kernel, then actually panic.

solardiz · 2024-02-25T22:23:14Z

Nov 23, 2022 (which was before we implemented kprobe self-test via 26f36ed)

Without deferred panic yet, testing with echo 0 > /sys/kernel/debug/kprobes/enabled, I am getting this transferred to the remote (tested this twice - once with kernel.panic=-1 and once with kernel.panic=0 - same result):

1669221616777553,1669221579179197,404779645,6,317,404779215,-;Kprobes globally disabled
1669221626697537,1669221589102050,414702497,2,318,414702399,-;LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly

whereas the full messages would be:

[  404.779215] Kprobes globally disabled
[  414.702399] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[  414.707913] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly
[  414.707949] LKRG: ALERT: BLOCK: Kernel: 1 checksums changed unexpectedly
[  414.707988] Kernel panic - not syncing: Kernel: 1 checksums changed unexpectedly

followed by a backtrace.

As to deferred panic, I am wondering whether we should limit that to LKRG-induced panics or maybe hook into the kernel's panic code (or something it calls) and similarly defer non-LKRG panics (only the final stopping/rebooting, but not the messages). For example, we could have a wait-until-sent-or-timeout loop in a callback we'd register with kmsg_dump_register (an exported symbol across our supported kernels).

solardiz · 2024-02-25T22:24:23Z

Nov 28, 2022

we could have a wait-until-sent-or-timeout loop in a callback we'd register with kmsg_dump_register

I've just experimented with this. First, by code review those callbacks are made too late for us - after shutdown of SMP, whereas we'd want our network sending code to run on another CPU because the one panic'ing is in an unsuitable state (was already in an unknown state, and is further modified by the panic in progress). Second, in my testing the callback is somehow not called at all - which I couldn't figure out yet.

solardiz · 2024-09-23T23:50:59Z

After test-implementing #336 and repeating the echo 0 > /sys/kernel/debug/kprobes/enabled test (albeit with newer LKRG and kernel), I got on the remote logging host (~40 ms ping round-trip over the Internet):

2024-09-23T23:47:15Z kprobes: Kprobes globally disabled
2024-09-23T23:47:28Z LKRG: ALERT: DETECT: Kprobes: Don't work as intended (disabled?)
2024-09-23T23:47:28Z LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
2024-09-23T23:47:28Z LKRG: ALERT: DETECT: Kernel: 2 checksums changed unexpectedly
2024-09-23T23:47:28Z LKRG: ALERT: BLOCK: Kernel: 2 checksums changed unexpectedly
2024-09-23T23:47:28Z Kernel panic - not syncing: Kernel: 2 checksums changed unexpectedly
2024-09-23T23:47:28Z CPU: 2 PID: 849242 Comm: kworker/u8:3 Tainted: G           OE      6.1.[censored].x86_64 #1
2024-09-23T23:47:28Z Workqueue: events_unbound p_check_integrity [lkrg]
2024-09-23T23:47:28Z Call Trace:
2024-09-23T23:47:28Z  <TASK>
2024-09-23T23:47:28Z  dump_stack_lvl+0x45/0x5e

... and that's where it ends, even though locally the messages included the full backtrace.

That's an improvement in what got through! But I didn't even test without trying to fix #336 on this specific setup, so I don't know what caused the improvement.

solardiz · 2024-09-24T00:41:28Z

Tried this again without the #336 fix, got:

2024-09-24T00:38:48Z kprobes: Kprobes globally disabled
2024-09-24T00:38:49Z LKRG: ALERT: DETECT: Kprobes: Don't work as intended (disabled?)
2024-09-24T00:38:49Z LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
2024-09-24T00:38:49Z LKRG: ALERT: DETECT: Kernel: Module hash changed unexpectedly, name lkrg
2024-09-24T00:38:49Z LKRG: ALERT: DETECT: Kernel: Module list hash changed unexpectedly
2024-09-24T00:38:49Z LKRG: ALERT: DETECT: Kernel: Module KOBJ list hash changed unexpectedly
2024-09-24T00:38:49Z LKRG: ALERT: DETECT: Kernel: Module KOBJ hash changed unexpectedly, name lkrg
2024-09-24T00:38:49Z LKRG: ALERT: DETECT: Kernel: 6 checksums changed unexpectedly
2024-09-24T00:38:49Z LKRG: ALERT: BLOCK: Kernel: 6 checksums changed unexpectedly
2024-09-24T00:38:49Z Kernel panic - not syncing: Kernel: 6 checksums changed unexpectedly
2024-09-24T00:38:49Z CPU: 2 PID: 2798 Comm: kworker/u8:3 Tainted: G           OE      6.1.[censored].x86_64 #1

So much of this is random, but this newer kernel is in general more lucky in getting some info through than the older one was during my testing in 2022.

solardiz added the enhancement New feature or request label Feb 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Net: Implement deferred panic #314

Net: Implement deferred panic #314

solardiz commented Feb 25, 2024

solardiz commented Feb 25, 2024

solardiz commented Feb 25, 2024

solardiz commented Sep 23, 2024 •

edited

Loading

solardiz commented Sep 24, 2024

Net: Implement deferred panic #314

Net: Implement deferred panic #314

Comments

solardiz commented Feb 25, 2024

solardiz commented Feb 25, 2024

solardiz commented Feb 25, 2024

solardiz commented Sep 23, 2024 • edited Loading

solardiz commented Sep 24, 2024

solardiz commented Sep 23, 2024 •

edited

Loading