Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zpool commands hang if nvme / lvm disk is disconnected (2.2.6) #16806

Open
thoro opened this issue Nov 25, 2024 · 5 comments
Open

Zpool commands hang if nvme / lvm disk is disconnected (2.2.6) #16806

thoro opened this issue Nov 25, 2024 · 5 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@thoro
Copy link

thoro commented Nov 25, 2024

System information

Type Version/Name
Distribution Name Talos
Distribution Version 1.8.1
Kernel Version 6.6.54-talos
Architecture x86
OpenZFS Version 2.2.6

Describe the problem you're observing

We are attaching zpool on top of lvm on top of nvmeof. (architecture requires this)

When a disk is disconnected forcefully (nvme controller delete, disconnect) zfs starts to exhibit lock (?) problems, zpool and zfs commands start to hang indefinitly.

From my understanding zfs should be tolerant to faults of the underlying device and put the pool into a suspended state. This seems to happen, but zfs is after the failure unusable.

Describe how to reproduce the problem

  1. connect volume via nvmeof
  2. create zfs on the volume
  3. disconnect nvmeof volume
  4. zfs hangs

Include any warning/errors/backtraces from the system logs

In the following logs you can see that the nvmeof device is forcefully removed an the pool gets suspended because no device is available anymore, but also all zfs commands (zpool and zfs) get stuck.

dc-at-prod-02-h05: kern:    info: [2024-11-25T13:26:50.298266798Z]: nvme nvme4: Removing ctrl: NQN "nqn.2024-07.at.avafin.nvme.storage.at-storage-d01.at-storage-d01-ssd8:v-6a637038-f658-4ff6-b8cf-862e8a027d5d"
 SUBSYSTEM=nvme
 DEVICE=c247:4
dc-at-prod-02-h05: kern:    info: [2024-11-25T13:26:51.368764798Z]: nvme nvme5: Removing ctrl: NQN "nqn.2024-07.at.avafin.nvme.storage.at-storage-d02.at-storage-d02-ssd7:v-9bae315b-25e9-44ed-8925-a47ce453c6f9"
 SUBSYSTEM=nvme
 DEVICE=c247:5
dc-at-prod-02-h05: kern:    info: [2024-11-25T13:26:52.511555798Z]: nvme nvme6: Removing ctrl: NQN "nqn.2024-07.at.avafin.nvme.storage.at-storage-d01.at-storage-d01-ssd7:v-53eb6c53-765a-48c9-9b6a-9501e10a70e9"
 SUBSYSTEM=nvme
 DEVICE=c247:6
dc-at-prod-02-h05: kern:    info: [2024-11-25T13:26:53.742369798Z]: nvme nvme7: Removing ctrl: NQN "nqn.2024-07.at.avafin.nvme.storage.at-storage-d02.at-storage-d02-ssd8:v-d7eafbbd-64c4-4a6e-ae6c-2919862cbfce"
 SUBSYSTEM=nvme
 DEVICE=c247:7
dc-at-prod-02-h05: kern:    info: [2024-11-25T13:28:26.214815798Z]: z_wr_iss: attempt to access beyond end of device
nvme6n1: rw=257, sector=184560366, nr_sectors = 2 limit=0
dc-at-prod-02-h05: kern:    info: [2024-11-25T13:28:26.214951798Z]: z_wr_iss: attempt to access beyond end of device
nvme6n1: rw=257, sector=2399152622, nr_sectors = 4 limit=0
dc-at-prod-02-h05: kern: warning: [2024-11-25T13:28:26.363682798Z]: zio pool=v-53eb6c53-765a-48c9-9b6a-9501e10a70e9 vdev=/dev/v-53eb6c53-765a-48c9-9b6a-9501e10a70e9/data error=5 type=2 offset=94493858816 size=1024 flags=1589376
dc-at-prod-02-h05: kern: warning: [2024-11-25T13:28:26.513641798Z]: zio pool=v-53eb6c53-765a-48c9-9b6a-9501e10a70e9 vdev=/dev/v-53eb6c53-765a-48c9-9b6a-9501e10a70e9/data error=5 type=2 offset=1228365093888 size=2048 flags=1589376
dc-at-prod-02-h05: kern:    info: [2024-11-25T13:28:26.513693798Z]: z_wr_int_6: attempt to access beyond end of device
nvme6n1: rw=257, sector=2874771074, nr_sectors = 1 limit=0
dc-at-prod-02-h05: kern:    info: [2024-11-25T13:28:26.699123798Z]: z_wr_int_0: attempt to access beyond end of device
nvme6n1: rw=257, sector=3758107118, nr_sectors = 4 limit=0
dc-at-prod-02-h05: kern: warning: [2024-11-25T13:28:26.886377798Z]: zio pool=v-53eb6c53-765a-48c9-9b6a-9501e10a70e9 vdev=/dev/v-53eb6c53-765a-48c9-9b6a-9501e10a70e9/data error=5 type=2 offset=1471881741312 size=512 flags=1589376
dc-at-prod-02-h05: kern: warning: [2024-11-25T13:28:27.038426798Z]: zio pool=v-53eb6c53-765a-48c9-9b6a-9501e10a70e9 vdev=/dev/v-53eb6c53-765a-48c9-9b6a-9501e10a70e9/data error=5 type=2 offset=1924149795840 size=2048 flags=1589376
dc-at-prod-02-h05: kern:    info: [2024-11-25T13:28:27.190504798Z]: z_wr_int_6: attempt to access beyond end of device
nvme6n1: rw=0, sector=2576, nr_sectors = 16 limit=0
dc-at-prod-02-h05: kern:    info: [2024-11-25T13:28:27.190522798Z]: z_wr_int_9: attempt to access beyond end of device
nvme6n1: rw=257, sector=536897766, nr_sectors = 1 limit=0
dc-at-prod-02-h05: kern: warning: [2024-11-25T13:28:27.190536798Z]: zio pool=v-53eb6c53-765a-48c9-9b6a-9501e10a70e9 vdev=/dev/v-53eb6c53-765a-48c9-9b6a-9501e10a70e9/data error=5 type=2 offset=274890607616 size=512 flags=1589376
dc-at-prod-02-h05: kern:    info: [2024-11-25T13:28:27.190563798Z]: z_wr_int_9: attempt to access beyond end of device
nvme6n1: rw=257, sector=654322655, nr_sectors = 23 limit=0
dc-at-prod-02-h05: kern: warning: [2024-11-25T13:28:27.190569798Z]: zio pool=v-53eb6c53-765a-48c9-9b6a-9501e10a70e9 vdev=/dev/v-53eb6c53-765a-48c9-9b6a-9501e10a70e9/data error=5 type=2 offset=335012150784 size=11776 flags=1074267264
dc-at-prod-02-h05: kern:    info: [2024-11-25T13:28:27.190580798Z]: z_wr_int_9: attempt to access beyond end of device
nvme6n1: rw=257, sector=940182226, nr_sectors = 4 limit=0
dc-at-prod-02-h05: kern: warning: [2024-11-25T13:28:27.190584798Z]: zio pool=v-53eb6c53-765a-48c9-9b6a-9501e10a70e9 vdev=/dev/v-53eb6c53-765a-48c9-9b6a-9501e10a70e9/data error=5 type=2 offset=481372251136 size=2048 flags=1589376
dc-at-prod-02-h05: kern:    info: [2024-11-25T13:28:27.190595798Z]: z_wr_int_9: attempt to access beyond end of device
nvme6n1: rw=257, sector=940616465, nr_sectors = 21 limit=0
dc-at-prod-02-h05: kern: warning: [2024-11-25T13:28:27.190599798Z]: zio pool=v-53eb6c53-765a-48c9-9b6a-9501e10a70e9 vdev=/dev/v-53eb6c53-765a-48c9-9b6a-9501e10a70e9/data error=5 type=2 offset=481594581504 size=10752 flags=1589376
dc-at-prod-02-h05: kern:    info: [2024-11-25T13:28:27.190608798Z]: z_wr_int_9: attempt to access beyond end of device
nvme6n1: rw=257, sector=941977473, nr_sectors = 2 limit=0
dc-at-prod-02-h05: kern: warning: [2024-11-25T13:28:27.190612798Z]: zio pool=v-53eb6c53-765a-48c9-9b6a-9501e10a70e9 vdev=/dev/v-53eb6c53-765a-48c9-9b6a-9501e10a70e9/data error=5 type=2 offset=482291417600 size=1024 flags=1589376
dc-at-prod-02-h05: kern: warning: [2024-11-25T13:28:27.190685798Z]: zio pool=v-53eb6c53-765a-48c9-9b6a-9501e10a70e9 vdev=/dev/v-53eb6c53-765a-48c9-9b6a-9501e10a70e9/data error=5 type=2 offset=483717777920 size=1536 flags=1074267264
dc-at-prod-02-h05: kern: warning: [2024-11-25T13:28:27.190710798Z]: zio pool=v-53eb6c53-765a-48c9-9b6a-9501e10a70e9 vdev=/dev/v-53eb6c53-765a-48c9-9b6a-9501e10a70e9/data error=5 type=2 offset=506869227008 size=10752 flags=1589376
dc-at-prod-02-h05: kern: warning: [2024-11-25T13:28:27.190722798Z]: zio pool=v-53eb6c53-765a-48c9-9b6a-9501e10a70e9 vdev=/dev/v-53eb6c53-765a-48c9-9b6a-9501e10a70e9/data error=5 type=2 offset=509395511296 size=1024 flags=1589376
dc-at-prod-02-h05: kern: warning: [2024-11-25T13:28:27.190732798Z]: zio pool=v-53eb6c53-765a-48c9-9b6a-9501e10a70e9 vdev=/dev/v-53eb6c53-765a-48c9-9b6a-9501e10a70e9/data error=5 type=2 offset=1245545234432 size=1024 flags=1589376
dc-at-prod-02-h05: kern: warning: [2024-11-25T13:28:30.153006798Z]: zio pool=v-53eb6c53-765a-48c9-9b6a-9501e10a70e9 vdev=/dev/v-53eb6c53-765a-48c9-9b6a-9501e10a70e9/data error=5 type=1 offset=270336 size=8192 flags=721089
dc-at-prod-02-h05: kern: warning: [2024-11-25T13:28:30.153030798Z]: zio pool=v-53eb6c53-765a-48c9-9b6a-9501e10a70e9 vdev=/dev/v-53eb6c53-765a-48c9-9b6a-9501e10a70e9/data error=5 type=1 offset=2199022739456 size=8192 flags=721089
dc-at-prod-02-h05: kern: warning: [2024-11-25T13:28:30.153036798Z]: zio pool=v-53eb6c53-765a-48c9-9b6a-9501e10a70e9 vdev=/dev/v-53eb6c53-765a-48c9-9b6a-9501e10a70e9/data error=5 type=1 offset=2199023001600 size=8192 flags=721089
dc-at-prod-02-h05: kern: warning: [2024-11-25T13:28:30.154242798Z]: WARNING: Pool 'v-53eb6c53-765a-48c9-9b6a-9501e10a70e9' has encountered an uncorrectable I/O failure and has been suspended.
@thoro thoro added the Type: Defect Incorrect behavior (e.g. crash, hang) label Nov 25, 2024
@amotin
Copy link
Member

amotin commented Nov 25, 2024

Unfortunately pool suspension was never safe for system well-being. We indeed should work more on it, but it is a long process, not something that can easily be achieved. Within visible time frame I'd say just "don't do it". Plan enough redundancy so that you would not lose pool critical mass routinely.

@thoro
Copy link
Author

thoro commented Nov 25, 2024

is there already a tracking issue for issues like this?

@stuartthebruce
Copy link

FWIW, I have seen similar problems with redundant mirror devices behind the Linux nvme-rdma kernel driver being removed from a zpool, but I have not seen the problem with nvme-tcp.

@Gendra13
Copy link

is there already a tracking issue for issues like this?

There are quite a few going back more then 10 years.
There ist also a (somewhat stalled) pull requests
#11082 unfortunately without being near implementation.

@thoro
Copy link
Author

thoro commented Nov 25, 2024

It would already be fine if a single broken pool doesn't break the whole server, we are importing between 20 - 30 pools per server and if one of them gets broken due to such an issue we have to force restart the whole server since we can't even export the zfs devices without it hanging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

4 participants