Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btrfs scrub start -r tries to write data unless mounted read-only #934

Open
m0gg opened this issue Dec 21, 2024 · 9 comments
Open

btrfs scrub start -r tries to write data unless mounted read-only #934

m0gg opened this issue Dec 21, 2024 · 9 comments

Comments

@m0gg
Copy link

m0gg commented Dec 21, 2024

Happened to me while readonly-checking a recovered md raid.
System information:

# btrfs --version
btrfs-progs v6.12
-EXPERIMENTAL -INJECT -STATIC +LZO +ZSTD +UDEV +FSVERITY +ZONED CRYPTO=builtin
# uname -a
Linux <redacted> 6.12.5-gentoo-dist #1 SMP PREEMPT_DYNAMIC Sun Dec 15 03:17:02 -00 2024 x86_64 Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz GenuineIntel GNU/Linux

This lsblk snip visualizes the block device layers:

NAME                        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
loop0                         7:0    0   4,5T  0 loop  
└─md127                       9:127  0  13,6T  1 raid5 
  ├─vg--archive-data--crypt 253:0    0     4T  0 lvm   
  │ └─data                  253:3    0     4T  0 crypt /run/media/system/dm-3

Note, that md127 was started in readonly mode.

When running btrfs scrub -r on the fs of data (mounted rw), the kernel reports attempted writes to the read-only device md127 after about 10G of scrubbed data:

[174366.203678] BTRFS info (device dm-3): first mount of filesystem e18f0c40-88de-413f-9d7e-dcc8136ad6dd
[174366.203691] BTRFS info (device dm-3): using crc32c (crc32c-intel) checksum algorithm
[174366.203696] BTRFS info (device dm-3): using free-space-tree
[174441.289198] BTRFS info (device dm-3): scrub: started on devid 1
[174475.439500] Trying to write to read-only block-device md127
[174475.439546] btrfs_dev_stat_inc_and_print: 362 callbacks suppressed
[174475.439554] BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[174475.439588] BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
[174475.439610] BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
[174475.439657] BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
[174475.439693] BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
[174475.439722] BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
[174475.439758] BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
[174475.439787] BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
[174475.439815] BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
[174475.439852] BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
[174475.445886] BTRFS: error (device dm-3) in btrfs_commit_transaction:2523: errno=-5 IO failure (Error while writing out transaction)
[174475.445915] BTRFS info (device dm-3 state E): forced readonly
[174475.445927] BTRFS warning (device dm-3 state E): Skipping commit of aborted transaction.
[174475.445938] BTRFS error (device dm-3 state EA): Transaction aborted (error -5)
[174475.445948] BTRFS: error (device dm-3 state EA) in cleanup_transaction:2017: errno=-5 IO failure
[174475.446157] BTRFS warning (device dm-3 state EA): failed setting block group ro: -5
[174475.446192] BTRFS info (device dm-3 state EA): scrub: not finished on devid 1 with status: -5

Everything's fine when mounted ro.

@Forza-tng
Copy link
Contributor

It is expected that Btrfs tries to write to the block devices, even when mounting ro (log replay, etc). I do not think btrfs can run on a ro block device.

@m0gg
Copy link
Author

m0gg commented Dec 21, 2024

It is expected that Btrfs tries to write to the block devices, even when mounting ro (log replay, etc). I do not think btrfs can run on a ro block device.

The man-page - btrfs-scrub(8) - about the -r flag:

run in read-only mode, do not attempt to correct
anything, can be run on a read-only filesystem

As i wrote, everything's fine when mounted ro. No complaints about writes to an ro-device.

@Zygo
Copy link

Zygo commented Dec 21, 2024

There are multiple agents here. The documentation could be clearer.

The scrub is read-only, i.e. errors found in blocks that are read and verified by the scrub ioctl are not corrected.

The filesystem is read-write. Errors have been found while running the scrub, so the device stats are incremented. These updates to the device stats items will be committed in the next transaction, which is what failed in the logs above.

Also, scrub reads the filesystem metadata trees in order to get device maps, extent maps, and data csums for verification. If any of these reads fail, the filesystem will attempt to correct these pages on disk by writing the correct data over the incorrect data.

If any other process reads the filesystem while the scrub is running, the other process is not affected by the -r flag on scrub. If those reads encounter correctable errors, the filesystem will attempt to correct the data and overwrite bad blocks.

Try it with the preferred metadata patches and set up data-only and metadata-only drives. You should see that scrub -r will never write to a data-only drive.

@m0gg
Copy link
Author

m0gg commented Dec 21, 2024

That's what I guessed too after finding out I forgot to mount ro the first time. A process running with an ro option causing writes was still scary enough for me to report it.

The documentation could be clearer.

I agree. While this might be a corner-case, I still think it should be noted, that the fs itself could still try to fix stuff by itself.

@adam900710
Copy link
Collaborator

Firstly, if scrub finds no error, it should not trigger any write into the fs, thus even if the target block device is RO, and no data/metadata/superblock errors are found, scrub itself will not trigger the write.

According to your output, at least scrub found no error so far, so the write is not triggered by scrub itself.

The direct cause is that, there is a transaction needs to be committed, and we failed to commit the transaction.

The root cause is that, since scrub is done on commit roots, to avoid write and scrub on the same block group, we mark the current scrub target as read-only.

But that marking read-only operation needs to start a transaction and even force a chunk allocation, which will need to join/start a new transaction, which will cause new metadata to be created and written back.
And that writeback triggered the error.

That's why scrub provides read-only mode, which will not try to allocate a chunk (aka, update the metadata) during scrub.

Then talking about why if your fs is mount RO, even a RW scrub will be fine.

That's because the function btrfs_inc_block_group_ro() utilized by scrub will automatically avoid chunk allocation if the fs is already mounted RO, thus even if it's a RW scrub, as long as no error is found, everything is fine.

So there is nothing special, nothing related to whatever patchset, it's just some corner cases related to scrub implementation.
The overall rules are:

  • RW scrub on RW fs
    High chance to write to the fs, no matter if errors are found.

  • RW scrub on RO fs
    If no errors found, it's the same as RO scrub

  • RO scrub on RO fs
    Purely RO.

  • RO scrub on RW fs
    Scrub itself will not cause any write by itself.

And your report matches the first RW scrub on RW fs case, thus write is expected.

@m0gg
Copy link
Author

m0gg commented Dec 21, 2024

And your report matches the first RW scrub on RW fs case, thus write is expected.

That statement is not true. I clearly stated that i started an RO scrub on an RW fs which resides on an RO device.

Worth mentioning:
I successfully copied all of the FS contents in that setup without triggering the error. Only the scrub (or any intentional write operation) would trigger it.

Since you already closed this issue, I guess you do not deem "RO scrub may cause writes to the underlaying device unless mounted RO" worthy enough to be noted?

@adam900710
Copy link
Collaborator

OK, the problem is in the btrfs_inc_block_group_ro(), which doesn't really honor the scrub RO, but only the fs RO flag.

Thus a RO scrub will trigger a transaction on RW mounted fs.

I can add an extra check to avoid this. Although on such RW mounted fs, you may hit -ENOSPC if there is not much space left.

@adam900710 adam900710 reopened this Dec 21, 2024
@m0gg
Copy link
Author

m0gg commented Dec 21, 2024

which doesn't really honor the scrub RO, but only the fs RO flag

This sounds unintentional and IMHO deserves to be fixed. Thank you very much!

Although on such RW mounted fs, you may hit -ENOSPC if there is not much space left.

This seems like a very minor inconvenience.

@adam900710
Copy link
Collaborator

adam900710 commented Dec 21, 2024

Unfortunately the code is not that easy to handle the RO scrub on RW mount:

  • We have to start a transaction
    To ensure there is no conflicts between marking block group RO, and writing back the target block group.
    Thus we hold a transaction handle to prevent the current transaction to be committed, until we lock the ro_block_group_mutex.

  • We will still update the super blocks even if the current transaction is empty

So this means even if we skip the chunk allocation part, we will have an empty transaction to commit and have to update the super block.

But if we skip holding a transaction and continue, it means we will have the chance to conflict and corrupt the target block group.
The best solution is to make btrfs to detect empty transaction and fully skip it (aka, no writes at all), but will require quite some changes.

I'd go with a doc update for now, to warn about the modification to the fs.

adam900710 added a commit to adam900710/btrfs-progs that referenced this issue Dec 22, 2024
[BUG]
There is a bug report that read-only scrub on a read-write fs still
causes writes into the fs, and that will be caught if there is a
read-only block device among the storage stack.

This will cause a kernel warning on failed transaction commit:

 BTRFS info (device dm-3): first mount of filesystem e18f0c40-88de-413f-9d7e-dcc8136ad6dd
 BTRFS info (device dm-3): using crc32c (crc32c-intel) checksum algorithm
 BTRFS info (device dm-3): using free-space-tree
 BTRFS info (device dm-3): scrub: started on devid 1
 Trying to write to read-only block-device md127
 btrfs_dev_stat_inc_and_print: 362 callbacks suppressed
 BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
 BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
 BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
 BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
 BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
 BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
 BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
 BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
 BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
 BTRFS error (device dm-3): bdev /dev/mapper/data errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
 BTRFS: error (device dm-3) in btrfs_commit_transaction:2523: errno=-5 IO failure (Error while writing out transaction)
 BTRFS info (device dm-3 state E): forced readonly
 BTRFS warning (device dm-3 state E): Skipping commit of aborted transaction.
 BTRFS error (device dm-3 state EA): Transaction aborted (error -5)
 BTRFS: error (device dm-3 state EA) in cleanup_transaction:2017: errno=-5 IO failure
 BTRFS warning (device dm-3 state EA): failed setting block group ro: -5
 BTRFS info (device dm-3 state EA): scrub: not finished on devid 1 with status: -5

[CAUSE]
The root cause is inside btrfs_inc_block_group_ro(), where we need to
hold a transaction handle, to prevent the transaction to be committed,
until we hold ro_block_group_mutex.

This will cause an empty transaction by itself, thus even if we can mark
the block group read-only without any extra workload, we still need to
commit the new and empty transaction.

Unfortunately this means RO scrub on RW filesystem will always cause the
fs to be updated.

[FIX]
The best fix is to make btrfs to avoid empty commit transaction, but
even with that done, read-only scrub on rw mount can still cause real
metadata updates (e.g. allocate new chunks and update device error
statistics).

It will be very complex to make read-only scrub to be fully read-only
on a read-write btrfs.

Thankfully read-only scrub on read-write mount with read-only device in
the storage stack is pretty rare, thus a documentation update should be
enough.

Issue: kdave#934
Signed-off-by: Qu Wenruo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants