Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Net: Implement deferred panic #314

Open
solardiz opened this issue Feb 25, 2024 · 4 comments
Open

Net: Implement deferred panic #314

solardiz opened this issue Feb 25, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@solardiz
Copy link
Contributor

Nov 10, 2022

When LKRG decides to panic the kernel and we have networking enabled, the panic should be deferred until after we've at least tried sending the message out.

However, in such configuration we should probably apply the previous level of enforcement (where applicable) right away, so that switching from that level to panic obviously does not weaken security. For example, kill the task right away (where applicable), then send the message that we're about to panic the kernel, then actually panic.

@solardiz
Copy link
Contributor Author

Nov 23, 2022 (which was before we implemented kprobe self-test via 26f36ed)

Without deferred panic yet, testing with echo 0 > /sys/kernel/debug/kprobes/enabled, I am getting this transferred to the remote (tested this twice - once with kernel.panic=-1 and once with kernel.panic=0 - same result):

1669221616777553,1669221579179197,404779645,6,317,404779215,-;Kprobes globally disabled
1669221626697537,1669221589102050,414702497,2,318,414702399,-;LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly

whereas the full messages would be:

[  404.779215] Kprobes globally disabled
[  414.702399] LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
[  414.707913] LKRG: ALERT: DETECT: Kernel: 1 checksums changed unexpectedly
[  414.707949] LKRG: ALERT: BLOCK: Kernel: 1 checksums changed unexpectedly
[  414.707988] Kernel panic - not syncing: Kernel: 1 checksums changed unexpectedly

followed by a backtrace.

As to deferred panic, I am wondering whether we should limit that to LKRG-induced panics or maybe hook into the kernel's panic code (or something it calls) and similarly defer non-LKRG panics (only the final stopping/rebooting, but not the messages). For example, we could have a wait-until-sent-or-timeout loop in a callback we'd register with kmsg_dump_register (an exported symbol across our supported kernels).

@solardiz
Copy link
Contributor Author

Nov 28, 2022

we could have a wait-until-sent-or-timeout loop in a callback we'd register with kmsg_dump_register

I've just experimented with this. First, by code review those callbacks are made too late for us - after shutdown of SMP, whereas we'd want our network sending code to run on another CPU because the one panic'ing is in an unsuitable state (was already in an unknown state, and is further modified by the panic in progress). Second, in my testing the callback is somehow not called at all - which I couldn't figure out yet.

@solardiz solardiz added the enhancement New feature or request label Feb 25, 2024
@solardiz
Copy link
Contributor Author

solardiz commented Sep 23, 2024

After test-implementing #336 and repeating the echo 0 > /sys/kernel/debug/kprobes/enabled test (albeit with newer LKRG and kernel), I got on the remote logging host (~40 ms ping round-trip over the Internet):

2024-09-23T23:47:15Z kprobes: Kprobes globally disabled
2024-09-23T23:47:28Z LKRG: ALERT: DETECT: Kprobes: Don't work as intended (disabled?)
2024-09-23T23:47:28Z LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
2024-09-23T23:47:28Z LKRG: ALERT: DETECT: Kernel: 2 checksums changed unexpectedly
2024-09-23T23:47:28Z LKRG: ALERT: BLOCK: Kernel: 2 checksums changed unexpectedly
2024-09-23T23:47:28Z Kernel panic - not syncing: Kernel: 2 checksums changed unexpectedly
2024-09-23T23:47:28Z CPU: 2 PID: 849242 Comm: kworker/u8:3 Tainted: G           OE      6.1.[censored].x86_64 #1
2024-09-23T23:47:28Z Workqueue: events_unbound p_check_integrity [lkrg]
2024-09-23T23:47:28Z Call Trace:
2024-09-23T23:47:28Z  <TASK>
2024-09-23T23:47:28Z  dump_stack_lvl+0x45/0x5e

... and that's where it ends, even though locally the messages included the full backtrace.

That's an improvement in what got through! But I didn't even test without trying to fix #336 on this specific setup, so I don't know what caused the improvement.

@solardiz
Copy link
Contributor Author

Tried this again without the #336 fix, got:

2024-09-24T00:38:48Z kprobes: Kprobes globally disabled
2024-09-24T00:38:49Z LKRG: ALERT: DETECT: Kprobes: Don't work as intended (disabled?)
2024-09-24T00:38:49Z LKRG: ALERT: DETECT: Kernel: _stext hash changed unexpectedly
2024-09-24T00:38:49Z LKRG: ALERT: DETECT: Kernel: Module hash changed unexpectedly, name lkrg
2024-09-24T00:38:49Z LKRG: ALERT: DETECT: Kernel: Module list hash changed unexpectedly
2024-09-24T00:38:49Z LKRG: ALERT: DETECT: Kernel: Module KOBJ list hash changed unexpectedly
2024-09-24T00:38:49Z LKRG: ALERT: DETECT: Kernel: Module KOBJ hash changed unexpectedly, name lkrg
2024-09-24T00:38:49Z LKRG: ALERT: DETECT: Kernel: 6 checksums changed unexpectedly
2024-09-24T00:38:49Z LKRG: ALERT: BLOCK: Kernel: 6 checksums changed unexpectedly
2024-09-24T00:38:49Z Kernel panic - not syncing: Kernel: 6 checksums changed unexpectedly
2024-09-24T00:38:49Z CPU: 2 PID: 2798 Comm: kworker/u8:3 Tainted: G           OE      6.1.[censored].x86_64 #1

So much of this is random, but this newer kernel is in general more lucky in getting some info through than the older one was during my testing in 2022.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant