Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TS-7800-V2: linux-6.6.y: Inconsistent tssdcard failure #94

Open
ts-kris opened this issue Jul 12, 2024 · 1 comment
Open

TS-7800-V2: linux-6.6.y: Inconsistent tssdcard failure #94

ts-kris opened this issue Jul 12, 2024 · 1 comment
Assignees

Comments

@ts-kris
Copy link
Contributor

ts-kris commented Jul 12, 2024

Difficult to reproduce, unsure at this time what is triggering it.

[    7.743784] 8<--- cut here ---
[    7.750479] Unable to handle kernel NULL pointer dereference at virtual address 00000a05 when read
[    7.759494] [00000a05] *pgd=00000000
[    7.763110] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[    7.768452] Modules linked in:
[    7.771527] CPU: 1 PID: 61 Comm: kworker/u4:1 Not tainted 6.6.50-00190-gf78eed5874d4 #1
[    7.779574] Hardware name: Marvell Armada 380/385 (Device Tree)
[    7.785523] Workqueue: tssdcarda diskpoll_thread
[    7.790180] PC is at ___slab_alloc+0x27c/0x6bc
[    7.794660] LR is at ___slab_alloc+0x11c/0x6bc
[    7.799135] pc : [<c0271b30>]    lr : [<c02719d0>]    psr: 20000013
[    7.805434] sp : e0ae1df8  ip : dfb61788  fp : 00000000
[    7.810684] r10: 00000008  r9 : 00000a01  r8 : 00000000
[    7.815932] r7 : 20000013  r6 : c1001380  r5 : dfb61788  r4 : dfbb68f0
[    7.822492] r3 : 00000026  r2 : c1a2b780  r1 : 80100010  r0 : dfbb68f0
[    7.829052] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[    7.836224] Control: 10c5387d  Table: 0000404a  DAC: 00000051
[    7.841995] Register r0 information: non-slab/vmalloc memory
[    7.847689] Register r1 information: non-paged memory
[    7.852767] Register r2 information: slab task_struct start c1a2b780 pointer offset 0 size 2368
[    7.861530] Register r3 information: non-paged memory
[    7.866608] Register r4 information: non-slab/vmalloc memory
[    7.872296] Register r5 information: non-slab/vmalloc memory
[    7.877985] Register r6 information: slab kmem_cache start c1001380 pointer offset 0 size 124
[    7.886569] Register r7 information: non-paged memory
[    7.891647] Register r8 information: NULL pointer
[    7.896376] Register r9 information: non-paged memory
[    7.901454] Register r10 information: non-paged memory
[    7.906618] Register r11 information: NULL pointer
[    7.911435] Register r12 information: non-slab/vmalloc memory
[    7.917211] Process kworker/u4:1 (pid: 61, stack limit = 0xf8b75b6d)
[    7.923598] Stack: (0xe0ae1df8 to 0xe0ae2000)
[    7.927980] 1de0:                                                       00000010 c026f5b0
[    7.936201] 1e00: 00000dc0 c1a2b780 80100010 c1001100 c0241814 c1a2b780 e0ae1df8 00000048
[    7.944422] 1e20: c128c800 00000cc0 00000cc0 c128c800 00000cc0 c022bca4 e0ae1e50 00000048
[    7.952642] 1e40: 00100010 00000000 c1a7a32c c0272350 e0ae1e64 c1a2b780 1ef12000 00000230
[    7.960861] 1e60: 00000008 e0ae1e90 c1a2b780 00000000 c1a7a32c c0271fa4 dfb61788 00000230
[    7.969080] 1e80: c1001380 00000dc0 00000230 c02733dc 00000230 00000000 00000002 c022bd14
[    7.977300] 1ea0: c128c800 00000000 00000000 6e68e8cf c18c3c00 00000000 00000008 c2e2e000
[    7.985519] 1ec0: c18c3c4c c02442fc c0241814 00000dc0 00000008 c0241814 00000008 00000002
[    7.993738] 1ee0: c18c3c00 c0411ce8 00000008 c2e2e000 c0d7e64c 00000000 c1369524 c1a2b780
[    8.001958] 1f00: 00000000 c0411e28 c1369524 c1369414 c1369410 c065e894 c1a7a300 c1a1b300
[    8.010179] 1f20: c1369528 c1008000 c1369524 c01350dc 00000000 c1a2b780 c1008000 c1a1b305
[    8.018399] 1f40: c1008020 c1a7a300 c1008000 c1008000 c1008020 c1a2b780 c1a7a32c 00000000
[    8.026619] 1f60: 00000000 c01356a4 00000000 c1a2b780 c1abf1c0 c1abf2c0 c0135444 c1a7a300
[    8.034838] 1f80: e0859edc 00000000 00000000 c013b4b0 c1abf1c0 c013b3ac 00000000 00000000
[    8.043056] 1fa0: 00000000 00000000 00000000 c010014c 00000000 00000000 00000000 00000000
[    8.051274] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[    8.059492] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
[    8.067713]  ___slab_alloc from __slab_alloc.constprop.0+0x34/0x68
[    8.073947]  __slab_alloc.constprop.0 from __kmem_cache_alloc_node+0x90/0x148
[    8.081140]  __kmem_cache_alloc_node from kmalloc_node_trace+0xc/0x14
[    8.087640]  kmalloc_node_trace from bdi_alloc+0x1c/0x7c
[    8.093001]  bdi_alloc from __alloc_disk_node+0x54/0x170
[    8.098361]  __alloc_disk_node from __blk_alloc_disk+0x24/0x50
[    8.104239]  __blk_alloc_disk from diskpoll_thread+0xd0/0x178
[    8.110027]  diskpoll_thread from process_scheduled_works+0x184/0x258
[    8.116514]  process_scheduled_works from worker_thread+0x260/0x2b8
[    8.122825]  worker_thread from kthread+0x104/0x10c
[    8.127750]  kthread from ret_from_fork+0x14/0x28
[    8.132498] Exception stack(0xe0ae1fb0 to 0xe0ae1ff8)
[    8.137577] 1fa0:                                     00000000 00000000 00000000 00000000
[    8.145795] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[    8.154012] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[    8.160665] Code: e1a01004 e1a00006 ebfffd08 eaffff74 (e5993004) 
[    8.166810] ---[ end trace 0000000000000000 ]---
@ts-kris ts-kris self-assigned this Jul 12, 2024
markfeathers pushed a commit that referenced this issue Aug 14, 2024
[ Upstream commit 305a5170dc5cf3d395bb4c4e9239bca6d0b54b49 ]

Currently, mdadm support --revert-reshape to abort the reshape while
reassembling, as the test 07revert-grow. However, following BUG_ON()
can be triggerred by the test:

kernel BUG at drivers/md/raid5.c:6278!
invalid opcode: 0000 [#1] PREEMPT SMP PTI
irq event stamp: 158985
CPU: 6 PID: 891 Comm: md0_reshape Not tainted 6.9.0-03335-g7592a0b0049a #94
RIP: 0010:reshape_request+0x3f1/0xe60
Call Trace:
 <TASK>
 raid5_sync_request+0x43d/0x550
 md_do_sync+0xb7a/0x2110
 md_thread+0x294/0x2b0
 kthread+0x147/0x1c0
 ret_from_fork+0x59/0x70
 ret_from_fork_asm+0x1a/0x30
 </TASK>

Root cause is that --revert-reshape update the raid_disks from 5 to 4,
while reshape position is still set, and after reassembling the array,
reshape position will be read from super block, then during reshape the
checking of 'writepos' that is caculated by old reshape position will
fail.

Fix this panic the easy way first, by converting the BUG_ON() to
WARN_ON(), and stop the reshape if checkings fail.

Noted that mdadm must fix --revert-shape as well, and probably md/raid
should enhance metadata validation as well, however this means
reassemble will fail and there must be user tools to fix the wrong
metadata.

Signed-off-by: Yu Kuai <[email protected]>
Signed-off-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
markfeathers pushed a commit that referenced this issue Aug 19, 2024
[ Upstream commit 305a5170dc5cf3d395bb4c4e9239bca6d0b54b49 ]

Currently, mdadm support --revert-reshape to abort the reshape while
reassembling, as the test 07revert-grow. However, following BUG_ON()
can be triggerred by the test:

kernel BUG at drivers/md/raid5.c:6278!
invalid opcode: 0000 [#1] PREEMPT SMP PTI
irq event stamp: 158985
CPU: 6 PID: 891 Comm: md0_reshape Not tainted 6.9.0-03335-g7592a0b0049a #94
RIP: 0010:reshape_request+0x3f1/0xe60
Call Trace:
 <TASK>
 raid5_sync_request+0x43d/0x550
 md_do_sync+0xb7a/0x2110
 md_thread+0x294/0x2b0
 kthread+0x147/0x1c0
 ret_from_fork+0x59/0x70
 ret_from_fork_asm+0x1a/0x30
 </TASK>

Root cause is that --revert-reshape update the raid_disks from 5 to 4,
while reshape position is still set, and after reassembling the array,
reshape position will be read from super block, then during reshape the
checking of 'writepos' that is caculated by old reshape position will
fail.

Fix this panic the easy way first, by converting the BUG_ON() to
WARN_ON(), and stop the reshape if checkings fail.

Noted that mdadm must fix --revert-shape as well, and probably md/raid
should enhance metadata validation as well, however this means
reassemble will fail and there must be user tools to fix the wrong
metadata.

Signed-off-by: Yu Kuai <[email protected]>
Signed-off-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
@ts-kris
Copy link
Contributor Author

ts-kris commented Sep 16, 2024

This seems to be an issue only at SD card initialization at startup, roughly a 5% chance of occurring. If the SD card properly initializes, then it is rock solid. About 90 hours of testing thus far (plus probably another 72 when first bootstrapping the TS-7800-V2 on 6.6.y) on a unit that boots successfully and hammering on the SD card. Tests include end to end sequential reads/writes, bonnie++ tests, entire filesystem random access with hash testing for file integrity..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant