You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
i know a ceph cluster, and specifically the mons should always be up. but in some occasion they're not.
for example: when we completely shut off a 3 node ceph cluster.
and this cluster have for example autostart container (restart policy = always) AND ceph components are also docker containers...
this lead the whole things towards a complete failure as docker engine cannot start, because it keeps retrying connecting to rbd volumes, and wetopi/rbd, calling ceph command lines, and those being stuck ... (because ceph is still not up).
"Error while checking if volume 'xxxxxxx' exists in driver 'wetopi/rbd:latest' .. retrying .. and spending manu seconds
there ... and for every volumes ...
if wetopi reports to docker that it doesn't have the volume (because ceph is down), docker will i guess fails to start the container. and that's ok.
but having the wetopi calls stuck forever (again, because ceph cli utils are themselves stuck forever) os a nightmare. even a simple "docker ps" freeze ...
my questions: having a "fail gracefully", by having a timeout around the ceph tools would be fine. I see in the source code that the wrapper exist. i didn't looked deeply (because it's now late :-( ), i will do tomorrow.. or you have perhaps the reason ..
Thanks
The text was updated successfully, but these errors were encountered:
nul0op
changed the title
complete docker lookup if ceph cli are stuck (for wathever reason)
complete docker lockup if ceph cli are stuck (for wathever reason)
Nov 18, 2024
seems rbd-docker.go List() doesn't enforce any timeout before calling GetRbdImages() api.
will check the ceph api doc to see if there is something there.
Hi,
i know a ceph cluster, and specifically the mons should always be up. but in some occasion they're not.
for example: when we completely shut off a 3 node ceph cluster.
and this cluster have for example autostart container (restart policy = always) AND ceph components are also docker containers...
this lead the whole things towards a complete failure as docker engine cannot start, because it keeps retrying connecting to rbd volumes, and wetopi/rbd, calling ceph command lines, and those being stuck ... (because ceph is still not up).
"Error while checking if volume 'xxxxxxx' exists in driver 'wetopi/rbd:latest' .. retrying .. and spending manu seconds
there ... and for every volumes ...
if wetopi reports to docker that it doesn't have the volume (because ceph is down), docker will i guess fails to start the container. and that's ok.
but having the wetopi calls stuck forever (again, because ceph cli utils are themselves stuck forever) os a nightmare. even a simple "docker ps" freeze ...
my questions: having a "fail gracefully", by having a timeout around the ceph tools would be fine. I see in the source code that the wrapper exist. i didn't looked deeply (because it's now late :-( ), i will do tomorrow.. or you have perhaps the reason ..
Thanks
The text was updated successfully, but these errors were encountered: