-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad Autoscaler Idle forever with context.canceled error #711
Comments
Hi @bernardoVale 👋 What kind of policies are these? Are they application scaling defined in jobs? If that's the case, it seems to be coming from Nomad itself, more specifically from here: Have you noticed any errors in the Nomad log when this happens? |
Correct, using nomad target plugin
It happened again today, and not, there was no error log in nomad server/clients |
We had this recur today, and I thought to dig into the nomad-autoscaler logs to see what was happening at the time the errors started. It looked like:
So it begins to happen after the external plugin (nomad target) exits with an error. I went looking for the code that restarts the plugin on error, and couldn't find it. I did find this: https://github.com/hashicorp/nomad-autoscaler/blob/main/plugins/manager/manager.go#L202
So is it the case that nomad-autoscaler simply can't recover if an external plugin dies? The 'rpc error:' text is generic enough that I wonder if perhaps the error message is coming from the gRPC communication between plugin client and server, rather than between plugin server and nomad API? |
OK, I just tried this out by manually running The logs were a fine copy of the above:
So in terms of possible solutions, I guess one of:
? |
I've got a test case and first pass at a fix in #931 |
I couldn't find a way to reproduce this error, but sometimes, autoscaler gets stuck and returns
context canceled
to all policy evaluations:Each eval loop, it prints one
context canceled
error per policy. Once it gets into this state, only a restart fixes it.Here's the SIGABRT dump:
The text was updated successfully, but these errors were encountered: