You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Failover of 1 gateway using ceph orch daemon stop command failed in a 4 gateway configuration, fio gets stuck and nvme disks disappear from client node after a while.
However IOs get stuck for more than 3 hours and are not run by GW1
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
image_10 0/s 1/s 0 B/s 54 KiB/s 0.00 ns 163.79 us
image_11 0/s 1/s 0 B/s 54 KiB/s 0.00 ns 191.56 us
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
image_8 0/s 1/s 0 B/s 54 KiB/s 0.00 ns 175.21 us
image_7 0/s 0/s 0 B/s 819 B/s 0.00 ns 201.62 us
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
image_7 0/s 1/s 0 B/s 54 KiB/s 0.00 ns 51.91 us
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT
Also eventually after some time, the disks being served originally by gateway1 (/dev/nvme13n series) and also the disks it picked up after gateway 4’s failure (/dev/nvme9n series), both disappear from the client.
[root@ceph-mytest-578wbg-node8 ~]# nvme list
Node Generic SN Model Namespace Usage Format FW Rev
--------------------- --------------------- -------------------- ---------------------------------------- ---------- -------------------------- ---------------- --------
/dev/nvme5n3 /dev/ng5n3 2 Ceph bdev Controller 0x3 536.87 GB / 536.87 GB 512 B + 0 B 23.01.1
/dev/nvme5n2 /dev/ng5n2 2 Ceph bdev Controller 0x2 536.87 GB / 536.87 GB 512 B + 0 B 23.01.1
/dev/nvme5n1 /dev/ng5n1 2 Ceph bdev Controller 0x1 536.87 GB / 536.87 GB 512 B + 0 B 23.01.1
/dev/nvme1n3 /dev/ng1n3 1 Ceph bdev Controller 0x3 536.87 GB / 536.87 GB 512 B + 0 B 23.01.1
/dev/nvme1n2 /dev/ng1n2 1 Ceph bdev Controller 0x2 536.87 GB / 536.87 GB 512 B + 0 B 23.01.1
/dev/nvme1n1 /dev/ng1n1 1 Ceph bdev Controller 0x1 536.87 GB / 536.87 GB 512 B + 0 B 23.01.1
The text was updated successfully, but these errors were encountered:
@caroav As discussed yesterday, I retried the test with the latest downstream build and with rhel version 9.3 and issue is not seen.
I am able to successfully failover and failback.
We can close this issue for now, if I see any bug again will open a new one.
Failover of 1 gateway using ceph orch daemon stop command failed in a 4 gateway configuration, fio gets stuck and nvme disks disappear from client node after a while.
Details:
Before failover
GW1
GW2
GW3
GW4
Running IOs on all disks exposed through nvme from all 4 gateways.
IOs start executing on GW 4 for the corresponding subsystem and namespace.
Failover failed
Performed failover of GW4 using ceph orch daemon stop command.
Ana id 3 is now picked by GW1
GW1
However IOs get stuck for more than 3 hours and are not run by GW1
Also eventually after some time, the disks being served originally by gateway1 (/dev/nvme13n series) and also the disks it picked up after gateway 4’s failure (/dev/nvme9n series), both disappear from the client.
The text was updated successfully, but these errors were encountered: