-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After "apt upgrade" every other boot fails #6349
Comments
Which medium are these Pis booting from? |
@pelwell Micro SD card. |
What sort of micro sd card? |
It's a SanDisk Extreme Pro 128GB on all three of the devices. For what is worth, I also have a bunch of external HDDs attached to each Pi, however those have their own power supply and are connected via a powered hub. I also use an initramfs with dropbear for rootfs decryption according to https://github.com/ViRb3/pi-encrypted-boot-ssh. This setup hasn't ever caused issues until yesterday. |
Is your debug output above from a serial port? Or rather, do you have a Pi 5-compatible serial cable you can use to capture debug output? If so, try with |
These logs were taken from the HDMI output, I just wrote down the important lines by hand. I do actually have a debug probe, I hooked it up for the past 2 days, but unfortunately even after 30+ restarts at various times of the day and intervals, I could not reproduce the issue. I thought maybe it was a fluke, so I restarted my other two Raspberry Pis to test, and surprisingly both of then failed to boot exactly like before. Sadly, those two devices are remote, so I can't attach a probe to them. I would like to mention, when this issue occurred before, I could enter the initramfs/dropbear environment, but after decrypting the rootfs, the mounting timeout would happen. In my last attempt on the remote Raspberry Pis, I did Sorry I can't provide any concrete details, but since the day the Raspberry Pi 5s have been released, I have been running these 3 in the exact same setup with no issues. So while I can't tell what's wrong, something definitely seems off. Thanks! |
Is it possible that you need to regenerate the initramfs following the kernel update? |
It auto-regenerates during every update, and I explicitly looked for those lines to make sure it didn't go wrong as it was my first suspect as well. I also did manually regenerate, just to double-check. |
I would also like to add, I have updated all 3 devices in this exact way at least 10 times since the Raspberry Pi 5 was released and never experienced an issue. Here's ls of initrd file: # ls -alt /boot/initrd*
-rw------- 1 root root 13890585 Sep 13 18:36 /boot/initrd.img-6.6.31+rpt-rpi-2712
-rw------- 1 root root 13888365 Sep 13 18:35 /boot/initrd.img-6.6.31+rpt-rpi-v8
-rw------- 1 root root 14076413 Sep 13 18:35 /boot/initrd.img-6.6.47+rpt-rpi-2712
-rw------- 1 root root 14070518 Sep 13 18:35 /boot/initrd.img-6.6.47+rpt-rpi-v8 Actually, one more thing I forgot to mention, I am running the kernel in 4KB addressing mode instead of 16KB for compatibility reasons with my external disks, which are formatted with BTRFS and 4KB sectors. |
One of my remote Raspberry Pi 5s just hung up out of the blue. It did not reboot, the green light was still on, but it was completely unresponsive - could not SSH, no disk activity, etc. After force restarting it, I checked the logs, and there was absolutely nothing since it crashed ~8 hours ago. Last logs were just some routine networking related cron jobs I have running. I have now downgraded all my devices to the previous kernel, 6.6.31, will report back how it goes. For reference, this is how I did it: set -e
cd /boot/firmware
mv kernel8.img kernel8.img.bak
mv kernel_2712.img kernel_2712.img.bak
mv initramfs_2712 initramfs_2712.bak
mv initramfs8 initramfs8.bak
cp -a ../vmlinuz-6.6.31+rpt-rpi-2712 kernel_2712.img
cp -a ../vmlinuz-6.6.31+rpt-rpi-v8 kernel8.img
cp -a ../initrd.img-6.6.31+rpt-rpi-2712 initramfs_2712
cp -a ../initrd.img-6.6.31+rpt-rpi-v8 initramfs8 Reference of kernel versions: https://archive.raspberrypi.com/debian/pool/main/l/linux/ |
Since downgrading, have restarted the servers multiple times and had no issues. No random crashes either. I am therefore very confident that this was due to the kernel upgrade. Sadly, not sure how I can help more. |
The major change to the SD card handling has been the addition of support for command queueing. I'm going to list a number of kernel builds in order of incrementing time. For each build, run The builds: |
I think I have the same issue on my RPi 5 on 6.6.47. OS and all data are on a SanDisk Extreme 256 GB SD card. Sometimes it hangs at boot, but for me it's less than 50% of the time. Additionally, sometimes during normal system runtime, the whole system suddenly freezes. Because I have a system monitor open at all times, I always see disk I/O of my SD card completely stop when that happens. From one moment to the next there is not a single byte disk I/O to and from the SD card any longer. And from here on the system completely freezes and needs a hard power cycle. For me this happens much more frequently when there is a lot of I/O going on, such as pulling large container images. For example, there is one particular image that's almost 2 GB which I tried to pull dozens of times last week and each time the system froze eventually. It can also happen during normal desktop use, e.g. while browsing with Firefox or using vscode, though less frequently. I can reasonably rule out a faulty SD card because I bought a new SanDisk Extreme 128 GB card, cloned the OS to it (yes, I did properly downsize partitions) and had the exact same issue with that. Just now I downgraded to 6.6.22 using |
Thanks for chiming in @fshimizu! Happy to hear this is not an isolated case. I have been meaning to troubleshoot more, but sadly have a busy week ahead. I hope I can help more after that. Can I ask if you have any external disks connected? When you mention that I/O causes the crash, do you mean load on the SD card or any external disk? |
@ViRb3 Happy to try and clarify further. No external disks. This Pi is only for experimenting and a bit of desktop use. So there are peripherals and an audio device connected via USB, but the only storage is the SD card. So, all I/O is on the SD card.
Let me be careful here: I can't really confirm that I/O directly causes the crash. It's just an observation I made that for me it seems to happen much more often during times of heavy I/O. But not exclusively. I tried to get more info, e.g. by running |
Thanks for the update. @pelwell posted above some commit hashes of various major changes in the kernel. Can you follow the instructions and test which one does and doesn't crash? I'm away from my Pis, so I don't dare to touch them as I can't do anything if they hang. |
This is certainly not exhaustive but I installed the kernels listed and did the following test procedure, which had previously frozen my system on 6.6.47 reliably.
These are the results.
Test 5 was a control with the regular kernel installed by the distro which had been freezing for me. On the first try I didn't even get to the main test, pulling the image. The system already froze while just starting some other, smaller containers. Test 6 was another control to verify that it would work again with the older kernel. |
On the version with the crash/freeze, can you add |
@P33M thanks a lot for that suggestion. You might be onto something here. I've done the same test as before, with and without the All tests were done with the kernel from the distribution where I initially noticed the issue:
I will keep using the problematic 6.6.47 kernel with the |
To hopefully rule out some other factors, here are some details on @ViRb3's setup and my settings in comparison.
I have no storage other than the SD card.
I use neither storage encryption nor dropbear.
I haven't changed that, so I use whatever is the default in a fresh installation. |
For each of the cards where boot fails without sd_cqe=off, can you run and post the output of |
Here you go: device-info.txt At the moment I only have this one SD card. The other one I had cloned from it previously is already overwritten. |
I've managed to hang a card I previously thought was reliable - Sandisk Extreme Pro 64GB with a date of 09/2021. Running fstrim / sync /fio in parallel managed to wedge the card in about 20 minutes. |
I can reproduce this on our own cards - it smells a lot like a deadlock, as there are no IO or CQE engine errors but eventually all tasks that were accessing the mmc device report scheduler timeouts. I was operating under the assumption that async and sync mmc_blk commands had sufficient guards to prevent doing sync while async requests were pending, but perhaps not. |
This reverts commit 216df57. See raspberrypi#6349 There is an unknown hang when issuing flush commands that may be triggered when IO is pending. Revert while investigation takes place.
I can now get CQE to halt cleanly after finishing all its current tasks, but things still fall over because of the MMC framework's multiple interlocking mechanisms for declaring "I'm exclusively using the host". Now at least I get timeouts attributed to the MMC layer, but forward progress for IO still stops. |
Hi, I was reporting problems with CQE and SanDisk Extreme Pro 128GB and SanDisk Extreme 64GB already during the testing. I could not find the cause at that time, but maybe now I found something: Using swap partition (instead of the default swap file) on the SD card together with CQE results in failed boots in 8/10 cases. Fstab look like:
No partition swap - system boots with CQE in 10/10 cases, sometimes some individual files are read incorrectly. I mean, there is no error reading the file, but checksums do not match. Disabling CQE cures all the errors. With the new kernel (6.6.51-1+rpt3 (2024-10-08), which changes the default CQE behavior to off, I sometimes observe CQE recovery in dmesg. Maybe the developers enabled some kind of CQE debugging? This is without the swap partition during checksumming the files on the SD card. I believe that CQE recovery somehow reads the files just fine (no error in checksum), but it is slightly slower. It looks like this:
Does it mean that my SD card is faulty with CQE enebled? I have just received the new official Raspberry Pi A2 SD card. I'll try that one as well. |
|
In about 30 minutes after auto-build completes, please try testing with |
Thanks a lot for your efforts, @P33M. I'll test when I get some time. One question ahead of time:
Is there an easy way how I can check that without any serial cable or other debug setup while testing? I will do the same tests as I did previously in any case. Just wondering if I can get more info out of it, than simply "hangs" or "probably doesn't hang". |
After a round of my usual amateurish tests with the test build and with my old kernel from the distro I'm carefully optimistic.
I'll keep using the pulls/6418 build with In other news: My SD card now gives off a distinct glow in the dark. :-) |
I have some good news! I'm still testing the stock 6.6.51-1+rpt3 (2024-10-08) kernel, didn't have time to test the 6418 yet. I have in total 4 different A2 SD cards, including the new official Raspberry Pi A2 card:
The recovery messages in dmesg are not so frequent. I'm checksumming 5GB of files on the SD card 100 times and I get on average only about 20 recovery messages. So it is 1/5. On the other hand, the swap partition creates problems almost every time, approximately in 4/5 of the reboots. Now I think it is related to the Fstrim alone works as expected, but fstrim immediately followed by a card read has a fatal effect. (This is probably what happens activating a swap partition with
I'll try to make it more reproducible. |
fstrim works for me on a Sandisk 64GB Extreme Pro A2. Lots of these cards should get trim/discard demoted to an erase operation, as there is a pre-existing quirk for broken behaviour. Please provide card 2+3 register information as per the command earlier in this issue, and the output of |
If it helps, here is the quirks line of my Sandisk Extreme (non-Pro) A2 256 for which I provided the register info earlier.
Edit: This is on pulls/6418 with |
Well, I don't think this is just a trim problem. Actually it is CQE+trim problem. With CQE disabled, trim works just fine on 2 and 3. Looks like the trim leaves CQE in a bad state, which corrupts the first read just after? Quirks:
The following test fails on 2 and 3 with CQE enabled, while passes with CQE disabled:
sometimes there is another error like CQE enabled failure rate: 10/10 ! on cards 2 and 3 Card 1 passes the test with CQE enabled. With card 4 I can't enable CQE. SD card info:
2:
3:
4:
|
The mechanism by which fstrim issues discard ops has the same potential failure mode as fsync in the apt kernel (hang with no obvious IO error). When trimming a large filesystem with many free extents, hundreds of trim commands for approx. the preferred erase size are issued back-to-back. Does the rpi-update version fix this? |
No, unfortunately pulls/6418 does not fix the problem. I found out that actually the checksum of /usr/bin/ls changes. You can try any other command instead of
Actually just the first 64kB of |
Cards 2&3 are Sandisk models that fall within the range of dates for cards with known-bad flush behaviour. So I would expect them to be unreliable anyway. However, I have a card that is newer than the newest card with known-bad flush behaviour - Sandisk Extreme Pro A2 64GB, that does not have the cache flush issue, but does get hosed if fstrim is invoked.
In the garbled line where the kernel splat was printed to the console, I did |
This is a separate failure mode from the OP, so please create a new issue. |
Actually this is CQE related, but due to a clumsy interaction between the discard quirk and the spec. CMD38 is supported in CQ mode, but only with a Discard op. The quirk (applied to all Sandisk cards) demotes Discard to Erase, which presumably results in undefined behaviour. If I check for the correct option and disable CQ otherwise, the Sandisk card no longer returns bogus data. Pull request updated, and in about 45 minutes the autobulds will complete. Re-run rpi-update (delete /boot/firmware/.firmware_revision first). |
I've tried version 15645c8e..., which gives kernel 6.6.56-v8-16k+ #1805 SMP PREEMPT Tue Oct 15 20:06:35 BST 2024, but the problem persists. |
The current state-of-the-art is |
That fixes all errors on cards 2 and 3. I tested all the cases. Good job, thanks! |
@P33M thanks so much for your work! When can we expect this to arrive in regular |
Oh, I've realized only now that the CQ on cards 2 and 3 is actually disabled now. The performance without CQ is lower than with CQ. For example card 2 gives 3800 with CQ and only 2000 without CQ as measured by pibenchmarks. Isn't there a way to keep CQ on and instead sanitize the trim call? |
You're conflating two things - issuing commands to cards in the wrong state (a software bug), and cards that do not respond correctly to commands in CQ mode even if the state is correct (buggy cards). I can fix the first, but I can't fix the second. So those cards don't get CQ enabled. |
Would it be possible to flush CQ after trim for those cards? Or issue a fake read of 64kB after trim? (maybe 128kB to be sure). I know, these cards have at least one more issue - the one resulting in CQE recovery. But it seems that CQE recovery returns correct data. And it is much less frequent than the read-after-trim issue, which hits every time. |
Describe the bug
I ran
apt update && apt upgrade
today. Since then, approximately every other reboot fails (50% chance) with the following error:I was unfortunate to update 3 of my Raspberry Pi 5s, and every single one experiences this issue.
Steps to reproduce the behaviour
apt update && apt upgrade
Device (s)
Raspberry Pi 5
System
$ cat /etc/rpi-issue Raspberry Pi reference 2023-12-11 Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 2acf7afcba7d11500313a7b93bb55a2aae20b2d6, stage2 $ vcgencmd version 2024/07/30 15:25:46 Copyright (c) 2012 Broadcom version 790da7ef (release) (embedded) $ uname -a Linux server 6.6.47+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.47-1+rpt1 (2024-09-02) aarch64 GNU/Linux
Logs
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: