-
-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autoscaling node pools getting ignored #269
Comments
You are not supposed to change the autoscaling settings of an existing node pool as this can only lead to issues IMO. Labels and taints are also not supported for autoscaled nodes; it's a limitation in the cluster autoscaler for Hetzner unfortunately and there isn't much I can do about it. If autoscaling is enabled for a node pool correctly (and you haven't changed those settings afterwards) then the autoscaler should create nodes as soon as pods are pending for a certain amount of time because of lack of resources. If this doesn't happen please share more details. Also if the node count in your account reaches the limit imposed by Hetzner to your account - which is low for new accounts - new nodes won't be created and you need to request a limit increase. |
I'm not changing existing settings. I'm bootstrapping a completely new environment inside a fresh Hetzner Cloud project. My account has a couple thousand cores and servers of unused quota. I'm starting |
I thought that you added/removed the lines from the config file and rerun the create command. When you expect new nodes to be created by the autoscaler, do you see pods in pending state? Also what do you see in the autoscaler's logs? |
Yup, when I deploy a hello-world test app with two replicas, that targets the
The log shows the
Although I specified the minimum number of nodes as 1 so I would expect the scaler to always have a minimum of one node available, shouldn't it? I was also wondering, there's By the way, I have another hetzner-k3s cluster, where autoscaling works flawlessly, but there it's the only node pool there is. So maybe that's playing a role here? |
Interesting, thanks for sharing more details. Can you try this:
So to see if there is something wrong with that node pool for some reason. With the steps above it should in theory move the pods to new nodes from the new autoscaled pool. I have been using autoscaling with 2 clusters since I added it and haven't come across this issue so far so not sure what's happening, also because I don't have much control on the cluster autoscaler itself. I should get more familiar with its own codebase. BTW others have reported the issue that minimum number of nodes is not respected by the autoscaler and I noticed this myself too. It only creates nodes when there are pending jobs. I will see if I can report it to their github issues. |
I think I found the problem. My taints and labels are missing on the autoscaled nodes. But let's go through what I tried step by step: So, first I did steps 1 and 2 (of your list above) but the autoscaler config did not change. I don't think
When I look at the spec, I see that the autoscaler is still running with the no longer existing
Next I added the following section to the
For testing I explicitly left out any labels and taints. I re-ran the create command (steps 3 and 4). This time the autoscaler was restarted with a new configuration:
I then created an app with 200 replicas requesting lots of CPU and memory. Now when I check the autoscaler log I see the nodes scaling up:
and after a while the nodes show up on the cluster:
So I think the reason it didn't work before was because the application I was deploying was targeting a label node-role.fixcloud.io=jobs and the labels (and taints) I have in my My use case is, I have short (~2h) running CPU intensive jobs that run once per day. When they run I would like for the system to add additional nodes and scale them back down when the jobs have finished running after 2h or so. So I'd like for the autoscaler to only add instances, for those jobs, but not any other pods that might get scheduled. My idea was to use labels and taints to achieve this. |
Thanks a lot for reporting back in detail! Seems that I need to fix a bug where rerunning the create command doesn't update the autoscaler config when all autoscaled pools are removed from the config. Also I knew that taints and tolerations aren't supported by the autoscaler yet bud didn't remember. Sorry. Glad you got it sorted though :) |
If combining labels/taints with autoscaling is not a valid configuration, maybe the config validation should abort with an error if someone tries to combine the two. Just to make it very explicit that the given configuration is invalid. WDYT? |
Yep, good idea. |
Adding labels and taints to autoscaled nodes is now supported by the autoscaler, so I am scheduling this for v2.0.1 since v2 is going to be released probably next weekend already. |
Closing in favor of #317 since the discussion moved on labels and taints. |
I'm running into an issue where if I enable autoscaling on one of the worker pools, that pool is being ignored completely by
hetzner-k3s
.Using the following configuration file, if I remove the last four lines, the jobs pool is being created. With those autoscaling lines in the config however, the nodes aren't being created and the label/taint code throws errors.
This is my config:
The log initially doesn't show any errors during node creating, the
jobs
servers are just being skipped as if they didn't exist in the config. Later in the log where it tries to set the labels and taints it does show some errors:if I delete the last four lines from the config, I get the
jobs
worker pool, but then have to manually scale it.Is there anything obvious that I'm doing wrong?
The text was updated successfully, but these errors were encountered: