Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

complete docker lockup if ceph cli are stuck (for wathever reason) #32

Open
nul0op opened this issue Nov 18, 2024 · 1 comment
Open

Comments

@nul0op
Copy link

nul0op commented Nov 18, 2024

Hi,

i know a ceph cluster, and specifically the mons should always be up. but in some occasion they're not.
for example: when we completely shut off a 3 node ceph cluster.
and this cluster have for example autostart container (restart policy = always) AND ceph components are also docker containers...

this lead the whole things towards a complete failure as docker engine cannot start, because it keeps retrying connecting to rbd volumes, and wetopi/rbd, calling ceph command lines, and those being stuck ... (because ceph is still not up).

"Error while checking if volume 'xxxxxxx' exists in driver 'wetopi/rbd:latest' .. retrying .. and spending manu seconds
there ... and for every volumes ...

if wetopi reports to docker that it doesn't have the volume (because ceph is down), docker will i guess fails to start the container. and that's ok.

but having the wetopi calls stuck forever (again, because ceph cli utils are themselves stuck forever) os a nightmare. even a simple "docker ps" freeze ...

my questions: having a "fail gracefully", by having a timeout around the ceph tools would be fine. I see in the source code that the wrapper exist. i didn't looked deeply (because it's now late :-( ), i will do tomorrow.. or you have perhaps the reason ..

Thanks

@nul0op nul0op changed the title complete docker lookup if ceph cli are stuck (for wathever reason) complete docker lockup if ceph cli are stuck (for wathever reason) Nov 18, 2024
@nul0op
Copy link
Author

nul0op commented Nov 18, 2024

seems rbd-docker.go List() doesn't enforce any timeout before calling GetRbdImages() api.
will check the ceph api doc to see if there is something there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant