Kernel panic #3673

maxpain · 2024-11-12T09:20:08Z

I have kernel panic almost every day on Talos Linux v1.8.2 (Linux 6.6.58).
Talos is deployed on bare metal nodes (Dell R6615) with NVMe SSD.
For the network, I use Broadcom 2x25G (50G in LACP bonding) with MTU 9000 (jumbo frame).

I use an image built on factory.talos.dev:

customization:
    extraKernelArgs:
        - console=ttyS0,115200n8r
        - -lockdown
        - lockdown=integrity
        - cpufreq.default_governor=performance
        - amd_pstate=active
        - mitigations=off
        - iommu=off
    systemExtensions:
        officialExtensions:
            - siderolabs/amd-ucode
            - siderolabs/amdgpu-firmware
            - siderolabs/drbd

For CNI I use Cilium in eBPF mode.

[40145.614353] general protection fault, probably for non-canonical address 0x9e759c37ee555c76: 0000 [#1] SMP PTI
[40145.624361] CPU: 18 PID: 234918 Comm: conn48291 Tainted: G           O       6.6.58-talos #1
[40145.632800] Hardware name: Dell Inc. PowerEdge R6615/067N9T, BIOS 1.9.5 09/12/2024
[40145.640376] RIP: 0010:is_uprobe_at_func_entry+0x28/0x80
[40145.645609] Code: 90 90 0f 1f 44 00 00 65 48 8b 04 25 80 e3 02 00 48 83 b8 30 0b 00 00 00 74 60 48 8b 80 30 0b 00 00 48 8b 50 30 48 85 d2 74 50 <80> 3a 55 b8 01 00 00 00 74 1b 48 8b 8f 88 00 00 00 48 83 f9 33 74
[40145.664366] RSP: 0018:ffffc900007c8bc8 EFLAGS: 00010082
[40145.669599] RAX: ffff88813eafb120 RBX: ffffc900007c8c20 RCX: 00007f116e206296
[40145.676740] RDX: 9e759c37ee555c76 RSI: 0000000000000001 RDI: ffffc90111fa3f58
[40145.683880] RBP: ffffc90111fa3f58 R08: 000000000002aee0 R09: 0000000000000008
[40145.691021] R10: ffffc90111fa0000 R11: ffffc900007c8ff8 R12: 0000000000000000
[40145.698162] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[40145.705303] FS:  00007f113e959700(0000) GS:ffff88defb500000(0000) knlGS:0000000000000000
[40145.713398] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[40145.719155] CR2: 000015b40194c804 CR3: 0000000363b74003 CR4: 0000000000f70ee0
[40145.726294] PKRU: 55555554
[40145.729014] Call Trace:
[40145.731468]  <IRQ>
[40145.733502]  ? die_addr+0x36/0x90
[40145.736836]  ? exc_general_protection+0x217/0x420
[40145.741553]  ? asm_exc_general_protection+0x26/0x30
[40145.746450]  ? is_uprobe_at_func_entry+0x28/0x80
[40145.751083]  perf_callchain_user+0x20a/0x360
[40145.755365]  get_perf_callchain+0x147/0x1d0
[40145.759559]  bpf_get_stackid+0x60/0x90
[40145.763319]  bpf_prog_9aac297fb833e2f5_do_perf_event+0x434/0x53b
[40145.769333]  ? __smp_call_single_queue+0xad/0x120
[40145.774049]  bpf_overflow_handler+0x75/0x110
[40145.778330]  __perf_event_overflow+0x114/0x360
[40145.782787]  perf_swevent_hrtimer+0x134/0x150
[40145.787155]  ? __wake_up_common+0x73/0x180
[40145.791258]  ? timerqueue_del+0x2e/0x50
[40145.795107]  ? __pfx_perf_swevent_hrtimer+0x10/0x10
[40145.799996]  __hrtimer_run_queues+0x118/0x240
[40145.804365]  ? ktime_get_update_offsets_now+0x49/0x110
[40145.809511]  hrtimer_interrupt+0xf8/0x240
[40145.813531]  __sysvec_apic_timer_interrupt+0x4a/0xe0
[40145.818508]  sysvec_apic_timer_interrupt+0x6d/0x90
[40145.823310]  </IRQ>
[40145.825426]  <TASK>
[40145.827537]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[40145.832687] RIP: 0010:__kmem_cache_free+0x1cb/0x350
[40145.837576] Code: 48 85 db 0f 84 00 01 00 00 48 89 c2 48 0f ca 49 33 94 24 b8 00 00 00 48 89 10 49 8b 04 24 65 48 03 05 99 bd 37 61 48 8b 70 08 <4c> 39 68 10 0f 85 0b 01 00 00 48 8b 10 41 8b 44 24 28 48 01 d8 48
[40145.856331] RSP: 0018:ffffc90111fa3b70 EFLAGS: 00000282
[40145.861561] RAX: ffff88defb533910 RBX: ffff88813eafb120 RCX: ffffea0000000000
[40145.868698] RDX: 9e759c37ee555c76 RSI: 0000000000119862 RDI: ffff88810004e200
[40145.875836] RBP: ffffc90111fa3bc0 R08: 0000000000000086 R09: 00007f1153f9f9c0
[40145.882980] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88810004e200
[40145.890120] R13: ffffea0004fabec0 R14: 0000000000000000 R15: 0000000000000000
[40145.897266]  ? uprobe_free_utask+0x62/0x80
[40145.901378]  ? acct_collect+0x4c/0x220
[40145.905141]  uprobe_free_utask+0x62/0x80
[40145.909075]  mm_release+0x12/0xb0
[40145.912401]  do_exit+0x26b/0xaa0
[40145.915643]  __x64_sys_exit+0x1b/0x20
[40145.919317]  do_syscall_64+0x5a/0x80
[40145.922911]  entry_SYSCALL_64_after_hwframe+0x78/0xe2
[40145.927976] RIP: 0033:0x7f116e206296
[40145.931565] Code: 28 06 00 00 0f 84 ec 01 00 00 48 8b 44 24 08 f6 80 08 03 00 00 40 0f 85 7a 01 00 00 ba 3c 00 00 00 0f 1f 00 31 ff 89 d0 0f 05 <eb> f8 48 89 c8 48 c7 00 00 00 00 00 48 8d 48 f8 48 39 d0 75 ed 48
[40145.950321] RSP: 002b:00007f113e958a40 EFLAGS: 00000246 ORIG_RAX: 000000000000003c
[40145.957891] RAX: ffffffffffffffda RBX: 00007f113e859000 RCX: 00007f116e206296
[40145.965033] RDX: 000000000000003c RSI: 00007f1153f9f9c0 RDI: 0000000000000000
[40145.972177] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000056b90006
[40145.979317] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f1149f8925e
[40145.986456] R13: 00007f1149f8925f R14: 00007f113e959700 R15: 00007f113e958b00
[40145.993606]  </TASK>
[40145.995808] Modules linked in: drbd_transport_tcp(O) drbd(O) ahci i40e sp5100_tco bnxt_en amd64_edac megaraid_sas libahci nvme k10temp watchdog
[40146.008673] ---[ end trace 0000000000000000 ]---
[40146.013298] RIP: 0010:is_uprobe_at_func_entry+0x28/0x80
[40146.018531] Code: 90 90 0f 1f 44 00 00 65 48 8b 04 25 80 e3 02 00 48 83 b8 30 0b 00 00 00 74 60 48 8b 80 30 0b 00 00 48 8b 50 30 48 85 d2 74 50 <80> 3a 55 b8 01 00 00 00 74 1b 48 8b 8f 88 00 00 00 48 83 f9 33 74
[40146.037290] RSP: 0018:ffffc900007c8bc8 EFLAGS: 00010082
[40146.042521] RAX: ffff88813eafb120 RBX: ffffc900007c8c20 RCX: 00007f116e206296
[40146.049662] RDX: 9e759c37ee555c76 RSI: 0000000000000001 RDI: ffffc90111fa3f58
[40146.056805] RBP: ffffc90111fa3f58 R08: 000000000002aee0 R09: 0000000000000008
[40146.063946] R10: ffffc90111fa0000 R11: ffffc900007c8ff8 R12: 0000000000000000
[40146.071088] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[40146.078227] FS:  00007f113e959700(0000) GS:ffff88defb500000(0000) knlGS:0000000000000000
[40146.086321] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[40146.092077] CR2: 000015b40194c804 CR3: 0000000363b74003 CR4: 0000000000f70ee0
[40146.099222] PKRU: 55555554
[40146.101943] Kernel panic - not syncing: Fatal exception in interrupt
[40146.108739] Kernel Offset: disabled
[40146.112246] Rebooting in 10 seconds..

The text was updated successfully, but these errors were encountered:

korniltsev · 2024-11-18T04:55:12Z

Seems like a kernel/ebpf subsystem bug.
Do you use pyroscope.ebpf component in alloy as profiler? Or is it coroot profiling? Which version? Which configuration?

bpf_prog_9aac297fb833e2f5_do_perf_event suggests it could be pyroscope either or coroot

Will you be able to share your kernel+modules so we could try reproducing it?

maxpain · 2024-11-19T08:00:55Z

Or is it coroot profiling?

Yes, It's coroot.
We deploy coroot using the official helm chart with default configuration.
Version of helm chart: 0.15.16

Will you be able to share your kernel+modules so we could try reproducing it?

I use Talos v1.8.2 (Linux 6.6.58) built on factory.talos.dev with following configuration:

customization:
    extraKernelArgs:
        - console=ttyS0,115200n8r
        - -lockdown
        - lockdown=integrity
        - cpufreq.default_governor=performance
        - amd_pstate=active
        - mitigations=off
        - iommu=off
    systemExtensions:
        officialExtensions:
            - siderolabs/amd-ucode
            - siderolabs/amdgpu-firmware
            - siderolabs/drbd

https://factory.talos.dev/image/c4402c8cf9c87bcdc3947f2cc6e9486f413ca69716fa3b0a4c0c9863aafe963f/v1.8.2/metal-amd64-secureboot.iso

borkmann · 2025-01-09T10:18:20Z

@maxpain would you be able to test a kernel patch?

maxpain · 2025-01-09T10:23:27Z

@maxpain would you be able to test a kernel patch?

It would not be easy since I built Talos Linux images using factory.talos.dev and Secure Boot.

I think the panic could be caused by Puppeteer (chrome for developers) pods in my cluster. I can reproduce it on another hardware.

borkmann · 2025-01-09T10:45:20Z

Would you be able to test this one?

From e73a85a3fc1753656aba6d365640b16dca432ae1 Mon Sep 17 00:00:00 2001
From: Daniel Borkmann <[email protected]>
Date: Thu, 9 Jan 2025 09:01:59 +0000
Subject: [PATCH] events: Fix GPF due to corrupted utask->auprobe pointer

Fixes: cfa7f3d2c526 ("perf,x86: avoid missing caller address in stack traces captured in uprobe")
Signed-off-by: Daniel Borkmann <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Jiri Olsa <[email protected]>
Link: https://github.com/grafana/pyroscope/issues/3673
---
 arch/x86/events/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index c75c482d4c52..05f9cedf2691 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2835,6 +2835,8 @@ static bool is_uprobe_at_func_entry(struct pt_regs *regs)

        if (!current->utask)
                return false;
+       if (!current->utask->active_uprobe)
+               return false;

        auprobe = current->utask->auprobe;
        if (!auprobe)
-- 
2.43.0

olsajiri · 2025-01-09T13:18:43Z

I can reproduce with following running in separate terminals

# while :; do bpftrace -e 'uprobe:/bin/ls:_start  { printf("hit\n"); }' -c ls; done
# bpftrace -e 'profile:hz:100000 { @[ustack()] = count(); }'

looks like we have a fix already, will send out shortly

Max Makarov reported kernel panic [1] in perf user callchain code. The reason for that is the race between uprobe_free_utask and bpf profiler code doing the perf user stack unwind and is triggered within uprobe_free_utask function: - after current->utask is freed and - before current->utask is set to NULL general protection fault, probably for non-canonical address 0x9e759c37ee555c76: 0000 [#1] SMP PTI RIP: 0010:is_uprobe_at_func_entry+0x28/0x80 ... ? die_addr+0x36/0x90 ? exc_general_protection+0x217/0x420 ? asm_exc_general_protection+0x26/0x30 ? is_uprobe_at_func_entry+0x28/0x80 perf_callchain_user+0x20a/0x360 get_perf_callchain+0x147/0x1d0 bpf_get_stackid+0x60/0x90 bpf_prog_9aac297fb833e2f5_do_perf_event+0x434/0x53b ? __smp_call_single_queue+0xad/0x120 bpf_overflow_handler+0x75/0x110 ... asm_sysvec_apic_timer_interrupt+0x1a/0x20 RIP: 0010:__kmem_cache_free+0x1cb/0x350 ... ? uprobe_free_utask+0x62/0x80 ? acct_collect+0x4c/0x220 uprobe_free_utask+0x62/0x80 mm_release+0x12/0xb0 do_exit+0x26b/0xaa0 __x64_sys_exit+0x1b/0x20 do_syscall_64+0x5a/0x80 It can be easily reproduced by running following commands in separate terminals: # while :; do bpftrace -e 'uprobe:/bin/ls:_start { printf("hit\n"); }' -c ls; done # bpftrace -e 'profile:hz:100000 { @[ustack()] = count(); }' Fixing this by making sure current->utask pointer is set to NULL before we start to release the utask object. [1] grafana/pyroscope#3673 Reported-by: Max Makarov <[email protected]> Signed-off-by: Jiri Olsa <[email protected]>

borkmann · 2025-01-09T14:33:55Z

@olsajiri found a better variant which does not incur an additional runtime check, got submitted here: https://lore.kernel.org/bpf/[email protected]/

Max Makarov reported kernel panic [1] in perf user callchain code. The reason for that is the race between uprobe_free_utask and bpf profiler code doing the perf user stack unwind and is triggered within uprobe_free_utask function: - after current->utask is freed and - before current->utask is set to NULL general protection fault, probably for non-canonical address 0x9e759c37ee555c76: 0000 [#1] SMP PTI RIP: 0010:is_uprobe_at_func_entry+0x28/0x80 ... ? die_addr+0x36/0x90 ? exc_general_protection+0x217/0x420 ? asm_exc_general_protection+0x26/0x30 ? is_uprobe_at_func_entry+0x28/0x80 perf_callchain_user+0x20a/0x360 get_perf_callchain+0x147/0x1d0 bpf_get_stackid+0x60/0x90 bpf_prog_9aac297fb833e2f5_do_perf_event+0x434/0x53b ? __smp_call_single_queue+0xad/0x120 bpf_overflow_handler+0x75/0x110 ... asm_sysvec_apic_timer_interrupt+0x1a/0x20 RIP: 0010:__kmem_cache_free+0x1cb/0x350 ... ? uprobe_free_utask+0x62/0x80 ? acct_collect+0x4c/0x220 uprobe_free_utask+0x62/0x80 mm_release+0x12/0xb0 do_exit+0x26b/0xaa0 __x64_sys_exit+0x1b/0x20 do_syscall_64+0x5a/0x80 It can be easily reproduced by running following commands in separate terminals: # while :; do bpftrace -e 'uprobe:/bin/ls:_start { printf("hit\n"); }' -c ls; done # bpftrace -e 'profile:hz:100000 { @[ustack()] = count(); }' Fixing this by making sure current->utask pointer is set to NULL before we start to release the utask object. [1] grafana/pyroscope#3673 Reported-by: Max Makarov <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Acked-by: Oleg Nesterov <[email protected]>

Max Makarov reported kernel panic [1] in perf user callchain code. The reason for that is the race between uprobe_free_utask and bpf profiler code doing the perf user stack unwind and is triggered within uprobe_free_utask function: - after current->utask is freed and - before current->utask is set to NULL general protection fault, probably for non-canonical address 0x9e759c37ee555c76: 0000 [#1] SMP PTI RIP: 0010:is_uprobe_at_func_entry+0x28/0x80 ... ? die_addr+0x36/0x90 ? exc_general_protection+0x217/0x420 ? asm_exc_general_protection+0x26/0x30 ? is_uprobe_at_func_entry+0x28/0x80 perf_callchain_user+0x20a/0x360 get_perf_callchain+0x147/0x1d0 bpf_get_stackid+0x60/0x90 bpf_prog_9aac297fb833e2f5_do_perf_event+0x434/0x53b ? __smp_call_single_queue+0xad/0x120 bpf_overflow_handler+0x75/0x110 ... asm_sysvec_apic_timer_interrupt+0x1a/0x20 RIP: 0010:__kmem_cache_free+0x1cb/0x350 ... ? uprobe_free_utask+0x62/0x80 ? acct_collect+0x4c/0x220 uprobe_free_utask+0x62/0x80 mm_release+0x12/0xb0 do_exit+0x26b/0xaa0 __x64_sys_exit+0x1b/0x20 do_syscall_64+0x5a/0x80 It can be easily reproduced by running following commands in separate terminals: # while :; do bpftrace -e 'uprobe:/bin/ls:_start { printf("hit\n"); }' -c ls; done # bpftrace -e 'profile:hz:100000 { @[ustack()] = count(); }' Fixing this by making sure current->utask pointer is set to NULL before we start to release the utask object. [1] grafana/pyroscope#3673 Reported-by: Max Makarov <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]>

Max Makarov reported kernel panic [1] in perf user callchain code. The reason for that is the race between uprobe_free_utask and bpf profiler code doing the perf user stack unwind and is triggered within uprobe_free_utask function: - after current->utask is freed and - before current->utask is set to NULL general protection fault, probably for non-canonical address 0x9e759c37ee555c76: 0000 [#1] SMP PTI RIP: 0010:is_uprobe_at_func_entry+0x28/0x80 ... ? die_addr+0x36/0x90 ? exc_general_protection+0x217/0x420 ? asm_exc_general_protection+0x26/0x30 ? is_uprobe_at_func_entry+0x28/0x80 perf_callchain_user+0x20a/0x360 get_perf_callchain+0x147/0x1d0 bpf_get_stackid+0x60/0x90 bpf_prog_9aac297fb833e2f5_do_perf_event+0x434/0x53b ? __smp_call_single_queue+0xad/0x120 bpf_overflow_handler+0x75/0x110 ... asm_sysvec_apic_timer_interrupt+0x1a/0x20 RIP: 0010:__kmem_cache_free+0x1cb/0x350 ... ? uprobe_free_utask+0x62/0x80 ? acct_collect+0x4c/0x220 uprobe_free_utask+0x62/0x80 mm_release+0x12/0xb0 do_exit+0x26b/0xaa0 __x64_sys_exit+0x1b/0x20 do_syscall_64+0x5a/0x80 It can be easily reproduced by running following commands in separate terminals: # while :; do bpftrace -e 'uprobe:/bin/ls:_start { printf("hit\n"); }' -c ls; done # bpftrace -e 'profile:hz:100000 { @[ustack()] = count(); }' Fixing this by making sure current->utask pointer is set to NULL before we start to release the utask object. [1] grafana/pyroscope#3673 Fixes: cfa7f3d ("perf,x86: avoid missing caller address in stack traces captured in uprobe") Reported-by: Max Makarov <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected]

maxpain mentioned this issue Nov 12, 2024

Kernel Panic coroot/coroot#377

Open

knylander-grafana added the type/bug Something isn't working label Nov 13, 2024

maxpain mentioned this issue Nov 13, 2024

Kernel Panic on Talos Linux v1.8.2 cilium/cilium#35874

Closed

3 tasks

korniltsev self-assigned this Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel panic #3673

Kernel panic #3673

maxpain commented Nov 12, 2024

korniltsev commented Nov 18, 2024

maxpain commented Nov 19, 2024

borkmann commented Jan 9, 2025

maxpain commented Jan 9, 2025

borkmann commented Jan 9, 2025

olsajiri commented Jan 9, 2025

borkmann commented Jan 9, 2025

Kernel panic #3673

Kernel panic #3673

Comments

maxpain commented Nov 12, 2024

korniltsev commented Nov 18, 2024

maxpain commented Nov 19, 2024

borkmann commented Jan 9, 2025

maxpain commented Jan 9, 2025

borkmann commented Jan 9, 2025

olsajiri commented Jan 9, 2025

borkmann commented Jan 9, 2025