-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexplained gap between Policy Evaluations #885
Comments
Could there be something to do with reusing the
will be used by different policies:
|
I think this is a red-herring. I have found more reproductions today where a policy is not being evaluated for 8 minutes again (instead of 1min40), and inbetween there are no other scalings of other policies. Logs lines are the same as provided above |
Hi @peter-lockhart-pub and thanks for raising this issue, and apologies for the delayed response. The first thought I had was that it's possible all the workers are busy doing other scaling work as described within #348, but your comment "inbetween there are no other scalings of other policies" suggests this might not be the case. It would be useful if we could get some pprof data when you experience this issue, particularly the go-routine dump for the noamd-autoscaler application. This API page has information on the available endpoints and required configuration. This can be pasted into this ticket, sent to our atdot email, or added to any internal customer ticket you have open. I wonder if a followup and slightly related feature would be to add scaling policy priorities, similar to how Nomad evaluations and job priorities work when queueing in the broker. |
Hey @jrasell , I am returning from a long holiday and so will catch up on this ASAP. To complete the loop, there is an internal ticket for this as well 145463 |
Thanks @jrasell , I have caught the reproduction and within 10mins I captured some of the debug profiles and attached them to the ticket mentioned above |
Given a an autoscaler using v0.4.3, with the following setup:
and 6 scaling policies following this template:
We sometimes see policies that are set to evaluate every 1-2 minutes are not getting evaluated that often. The observable behaviour is that we can see in our graphs that the ASG should be scaling out because it is running at a high CPU (as per some of the checks in the policy), but it isn't until much later (e.g. 5-15 minutes later) that the autoscaler evaluates the policy and discovers it needs to scale out. It's hard to tell how often this happens, as the only time we are aware that it happens is when an alert fires due to Nomad Jobs failing to place because Nomad nodes are full - so it may be happening frequently but with less consequences as the rare times where our Nodes get filled.
We have 3 ASGs in us-east-1, and 3 ASGs in us-west-2. So our autoscaler has 6 policies, one for each ASG.
Our policies are set to evaluate every 1-2 minutes, but sometimes we observe that it does not evaluate as frequently as that. Each policy directly maps to 1 ASG. After finding a past reproduction of the issue, I filtered for the problematic policy ID and observed the following logs:
Observe that it is 10 minute again before this is reevaluated. Also observe the big gap in time between the following 2 logs with no other logs with that policy ID being logged inbetween:
Given there are more workers than policies, what else could be stopping this from getting evaluated as frequently as it should? What is that delay between the policy being placed in cooldown, and it being queued again? I have the full log lines I can share directly if you think that would help. From a cursory glance through the many log lines there continue to be many policy checks being done for other policies, and 3 scale ins done on other policies scattered between 14:31 and 14:41.
Many thanks.
The text was updated successfully, but these errors were encountered: