You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are attaching zpool on top of lvm on top of nvmeof. (architecture requires this)
When a disk is disconnected forcefully (nvme controller delete, disconnect) zfs starts to exhibit lock (?) problems, zpool and zfs commands start to hang indefinitly.
From my understanding zfs should be tolerant to faults of the underlying device and put the pool into a suspended state. This seems to happen, but zfs is after the failure unusable.
Describe how to reproduce the problem
connect volume via nvmeof
create zfs on the volume
disconnect nvmeof volume
zfs hangs
Include any warning/errors/backtraces from the system logs
In the following logs you can see that the nvmeof device is forcefully removed an the pool gets suspended because no device is available anymore, but also all zfs commands (zpool and zfs) get stuck.
Unfortunately pool suspension was never safe for system well-being. We indeed should work more on it, but it is a long process, not something that can easily be achieved. Within visible time frame I'd say just "don't do it". Plan enough redundancy so that you would not lose pool critical mass routinely.
FWIW, I have seen similar problems with redundant mirror devices behind the Linux nvme-rdma kernel driver being removed from a zpool, but I have not seen the problem with nvme-tcp.
is there already a tracking issue for issues like this?
There are quite a few going back more then 10 years.
There ist also a (somewhat stalled) pull requests #11082 unfortunately without being near implementation.
It would already be fine if a single broken pool doesn't break the whole server, we are importing between 20 - 30 pools per server and if one of them gets broken due to such an issue we have to force restart the whole server since we can't even export the zfs devices without it hanging.
System information
Describe the problem you're observing
We are attaching zpool on top of lvm on top of nvmeof. (architecture requires this)
When a disk is disconnected forcefully (nvme controller delete, disconnect) zfs starts to exhibit lock (?) problems, zpool and zfs commands start to hang indefinitly.
From my understanding zfs should be tolerant to faults of the underlying device and put the pool into a suspended state. This seems to happen, but zfs is after the failure unusable.
Describe how to reproduce the problem
Include any warning/errors/backtraces from the system logs
In the following logs you can see that the nvmeof device is forcefully removed an the pool gets suspended because no device is available anymore, but also all zfs commands (zpool and zfs) get stuck.
The text was updated successfully, but these errors were encountered: