-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One scaling policy gets stuck often #879
Comments
Hi @DTTerastar 👋 Thank you for the detail report! Looking at the logs it seems like one node is not being marked as having completed its drain.
If you look carefully you will notice that the node It also never receives the message |
A node not draining can't be a fatal event for the autoscaler, right? I'm on windows where things only work 99% of the time. I thought that after the drain timeout it would just forcibly stop the node and move on. |
That's right. We have a drain deadline that was supposed to configure the Nomad drain with a timeout value. nomad-autoscaler/sdk/helper/scaleutils/node_drain.go Lines 82 to 84 in ccd5f7c
But for some reason the drain didn't seem to complete, or maybe it did complete and the bug may be in the Nomad SDK not unblocking callers: So that's why I'm trying to understand the details of what you've observed. Do you remember if the drain deadline was reached? |
I can confirm we are experiencing the same issue described above, and yes the drain deadline was far surpassed. Ours is set to 15m and we've had cases where we are stuck for 18-24 hours. Debug logs in the autoscaler show it's not even evaluating the "stuck" policy when this situation occurs. Nodes in that scaling group constantly receive drain messages. We have thousands of nodes so it's very difficult to tell which one, or more, it's getting stuck on. I can provide logs if they would be of assistance. |
@lgfa29 Pinging for visibility |
Hi @gibbonsjohnm 👋 Thank you for the extra information. Yes, any logs you can provide could be useful. You can send them via email to [email protected] referencing this issue ID in the subject if they contain sensitive data. As I mentioned in my previous message, I suspect the issue is that the Nomad SDK is not unblocking callers when the drain is complete. More specifically, looking at the stack traces provided by @DTTerastar we can see several calls to Additionally, So another thing to look for is the status of the node that is attempted to be drained. Does it still have have allocations in the One last thing, I no longer work for HashiCorp, so I won't be following this issue anymore. I think we have enough information for someone else in the team to start investigating it further. |
Yes, one of the nodes does in fact still thinks it has a running system allocation, even though the docker container has been stopped and removed. I seem unable to remove with |
Even making the node eligible again, scheduling new workloads, and initiating a manual drain does not clean up the ghost allocation. |
Note that I manually terminated the machine with the ghost allocation and scale-in started after it disconnected from the cluster and I restarted the autoscaler. |
Thanks for the extra info @gibbonsjohnm.
This makes me think of hashicorp/nomad#20116 🤔 But nevertheless, the Nomad SDK drain monitoring function should not block forever because drains have deadlines. Once the deadline is reached the function should return control to the caller, even if there are still (or the client thinks there are still) allocations running. |
It's specifically my windows asg policy.
Version: v0.4.2 (I see 0.4.3 was released, will test that!)
See attached debug log:
https://gist.github.com/DTTerastar/9bf09f78ce247da5325900652ce2cc53
As well as the sigabrt:
https://gist.github.com/DTTerastar/b67c8f4289af99e5ad54bf05214d80ba
This is the config:
The text was updated successfully, but these errors were encountered: