Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keep scaling when nodes are draining #672

Open
janory opened this issue Jul 19, 2023 · 4 comments
Open

keep scaling when nodes are draining #672

janory opened this issue Jul 19, 2023 · 4 comments

Comments

@janory
Copy link

janory commented Jul 19, 2023

Hi! 👋

We recently started to use the Nomad Autoscaler agent and we really like it. 🚀
We are using the Autoscaler with the Nomad APM, aws-asg target and target-value strategy plugins.

We have multiple long running (1-45 minutes) batch jobs on our nodes and when a scale in action happens the drain event won't finish until the last batch job completes on the node.

This leads to constant warning messages like this:

2023-07-18T13:17:01.646Z [TRACE] policy_manager.policy_handler: target is not ready: policy_id=4a1d5af4-323a-d939-d208-18672288565c
2023-07-18T13:17:01.646Z [WARN] internal_plugin.aws-asg: node pool status readiness check failed: error="node 872ae150-f1a2-12b1-2197-cd32a3b49546 is draining"
2023-07-18T13:17:01.642Z [TRACE] policy_manager.policy_handler: getting target status: policy_id=4a1d5af4-323a-d939-d208-18672288565c
2023-07-18T13:17:01.642Z [TRACE] policy_manager.policy_handler: tick: policy_id=4a1d5af4-323a-d939-d208-18672288565c

because the Autoscaler implicitly checks the ASG target's status for each tick (handleTick -> generateEvaluation -> Status -> IsPoolReady -> FilterNodes -> if node.Drain).

Based on the comment here and also based on what we are experiencing the Autoscaler stops any further scaling actions until all draining activities are completed.

This is an issue for us, because in worst case scenario the long running batch jobs will prevent us scaling for 45 minutes.

Would it be possible to add a config for the idFn function to filter out draining nodes and keep scaling?

We would also like to better understand what are the risks of scaling a cluster which has draining nodes and why such a cluster is considered unstable.

@janory
Copy link
Author

janory commented Jul 24, 2023

I was thinking about something like this: #679
Although this alone probably won't be enough, because even if this part passes, the processLastActivity call would set the Ready flag to false.

@tgross
Copy link
Member

tgross commented Aug 2, 2023

Hi @janory!

We would also like to better understand what are the risks of scaling a cluster which has draining nodes and why such a cluster is considered unstable.

I think the major challenge here is that the nodes might be draining for reasons outside the control of the autoscaler. Maybe you've run nomad node drain -enable :node_id out of band so that software on the host can be upgraded, and then plan is to return that immediately to work afterwards. Or maybe the host is having unrecoverable problems unrelated to scale in/out and you've drained it so that you can decommission it afterwards. Either way, the autoscaler would need to know whether to count that node in the total capacity or not.

If we do decide to ignore this check, then we need to adjust our expectations of what plugins return as node count. For example, if there are 5 instances in a ASG, but 2 are draining, maybe the policy calculation should only count 3 nodes to account for either of those two situations?

@douglaje
Copy link

douglaje commented Sep 21, 2023

Hi @tgross , we've run into this issue/constraint as well. After moving to AWS spot instances which can receive interruption notices at any moment (and nearly continuously if you've got a large enough mixed cluster), our autoscaler would stop scaling for up to a half hour at a time (due to any node in the cluster being draining/initializing/other-than-ready) and we'd totally blow our SLA.

For us, the bigger sin than not scaling exactly is to not scale quickly. We don't mind underestimating capacity so we've customized the aws_asg and nomad_apm plugins so FilterNodes no longer errors on non-ready nodes (it excludes them instead).

It might be nice to be able to provide a strictness=ignore_unstable to the autoscaler plugins to be able to selectively override certain cautious behaviors built into the autoscaler, but part of the problem is that this check happens in nearly every plugin (both the apm and target plugins for us) and my Golang experience is minimal at best.

@lgfa29
Copy link
Contributor

lgfa29 commented Dec 22, 2023

Thank you for extra input @douglaje.

I've experimented with bypassing these checks but I'm still unsure about their impact. The biggest blocker here is that a policy is not allowed to be evaluated in parallel, meaning that only a single scaling action is allowed happen at time. But if you have multiple policies targeting the same set of nodes, or if the scaling action takes so long that evaluation times out, then this can be bypassed as well.

I've opened #811 to start some discussion around this. As I mentioned, I'm still unsure about it, so I'm at least marking these new configuration as experimental and we will probably not document them for now. If you would be willing to try them we could perhaps consider merging it.

For reference, this is the policy file I used for test. I had split the scaling up and down into two different policies so the actions could, in theory, happen at the same time. Another thing that is important about the AWS ASG target plugin is that the ASG events also affect their cooldown, so you also need different values there.

scaling "cluster_up" {
  enabled = true
  min     = 1
  max     = 4

  policy {
    cooldown            = "3s"
    evaluation_interval = "10s"

    check "up" {
      source = "prometheus"
      query  = "sum(nomad_client_allocations_running)/count(nomad_client_allocations_running)"

      strategy "threshold" {
        lower_bound = 3.9
        delta       = 1
      }
    }

    target "aws-asg" {
      dry-run             = "false"
      aws_asg_name        = "hashistack-nomad_client"
      node_class          = "hashistack"
      node_drain_deadline = "10m"

      # EXPERIMENTAL.
      node_filter_ignore_drain = true
      ignore_asg_events        = true
    }
  }
}

scaling "cluster_down" {
  enabled = true
  min     = 1
  max     = 4

  policy {
    cooldown            = "10s"
    evaluation_interval = "10s"

    check "down" {
      source = "prometheus"
      query  = "sum(nomad_client_allocations_running)/count(nomad_client_allocations_running)"

      strategy "threshold" {
        upper_bound = 3.1
        delta       = -1
      }
    }

    target "aws-asg" {
      dry-run             = "false"
      aws_asg_name        = "hashistack-nomad_client"
      node_class          = "hashistack"
      node_drain_deadline = "10m"

      # EXPERIMENTAL.
      node_filter_ignore_drain = true
      ignore_asg_events        = true
    }
  }
}

@tgross tgross moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Aug 6, 2024
@tgross tgross added the hcc/jira label Aug 6, 2024
@tgross tgross removed their assignment Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

4 participants