-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail-safe scale-up when metrics plugin returns no data #622
Comments
Thanks for the suggestion and PR @erulabs! Having an APM outage does sound like a huge pain....But I'm not sure what the safe option would be in this case 🤔 Since this is (hopefully) an expected situation, I worry that it may not be immediately obvious what the consequence of using It would also be good to have some way of distinguishing between a long-term failure vs. a blip. We may not want to fully scale on first error. This all leads me to believer that we would need a more advanced way of configuring this, but I don't have any good ideas right now 😞 The first thing that pops in my head would be a special policy {
check "api_needs_uppies" {
# ...
}
check "api_needs_downies" {
# ...
}
error_handler {
failures_before_triggering = 10
strategy "fixed-value" {
value = 5
}
}
} That block doesn't take any @jrasell do you have any thoughts on this one? |
@lgfa29 Interesting idea! I think Certainly your solution is more elegant, although, would you be able to say "error_handler -> strategy -> scale-up by delta/percentage" with that pattern? Re: "We may not want to fully scale on first error", my idea was that I think I still prefer the simpler fail-safe "scale-up-on-error" - although I must admit it might be somewhat enterprise-y! |
For a handful of policies I would agree, but my concern is for deployments that have several jobs, created by different teams, where the sum of
Not right now, but I think it's a fairly simple strategy to implement 🤔 It would be similar to So the policy would look something like this: policy {
check "api_needs_uppies" {
# ...
}
check "api_needs_downies" {
# ...
}
error_handler {
failures_before_triggering = 10
strategy "relative-value" {
delta = 1 # Add 1 new instance if all checks fail.
}
}
} Depending on your evaluation interval and how long the outage takes this this policy can give you a more controlled scale up and maybe help with the thundering herd problem.
Would this match the approach you were thinking about? |
@lgfa29 Yes, I believe the "error_handler" trigger would be exactly what we're looking for - in order to "fail open" as it were. What would it take to implement this - I'd love to help! Because we're hosted on AWS, and because the outage that triggered this investigation was datadog, we're also looking into https://github.com/lob/nomad-autoscaler-cloudwatch-apm |
You may be aware that datadog is experiencing a large outage today. This means that nomad-autoscaler, when using datadog as a
source
, is unable to collect any metrics.There is currently an
on_error
option to eitherignore
orfail
(ignore moving on to other checks and fail stopping all actions). It seem to me a 3rd option here seems reasonable, perhaps calledscale
.The idea would be that if a check fails to get any metrics, it could be set to
on_error = "scale"
which would consider the check active. In this example, if datadog goes offline or no metrics are reported, the nomad-autoscaler would trigger a scale-up and add additional instances (according to ourdelta
in this example).The end-result is that if our metrics become unavailable, we fail-safe and scale up towards our max.
Any interest in this feature? I will probably write this for our purposes over at @classdojo, but I suspect it would be a good mainline feature for nomad-autoscaler in general!
The text was updated successfully, but these errors were encountered: