Autoscaled nodes not joining cluster #466

enter-marlah · 2024-10-15T10:18:39Z

Hello!

We are running Hetzner-k3s version 2.0.8 with the following worker pool config:

worker_node_pools:
- name: med-static
  instance_type: cpx31
  instance_count: 3
  location: hel1
  autoscaling:
    enabled: true
    min_instances: 0
    max_instances: 6

The nodes are created in Hetzner after autoscaling is initiated by stressing the cluster but they are not joining the cluster after that. We can ssh into the machines but they don't have for example ssh keys set or anything related to k3s installed. For static nodes the ssh keys are set correctly.

We think this has something to do with the previous cloud init wait problem in issue #379

If we read the code correctly the cloud_init_wait.sh script is not called when creating a autoscaled node?

We are running a private network only cluster. Regarding to this PR #458 our cloud init takes several minutes with both static and autoscaled nodes.

The text was updated successfully, but these errors were encountered:

vitobotta · 2024-10-15T10:44:00Z

Hi, can you share your full config file (minus the token)?

enter-marlah · 2024-10-15T10:50:09Z

---
cluster_name: kube-prod
kubeconfig_path: "./kubeconfig"
k3s_version: v1.29.3+k3s1
include_instance_type_in_instance_name: true

networking:
  ssh:
    port: 22
    use_agent: false
    public_key_path: "./id_rsa_hetzner_prod.pub"
    private_key_path: "./id_rsa_hetzner_prod"
  allowed_networks:
    ssh:
      - 0.0.0.0/0
    api:
      - 0.0.0.0/0
  public_network:
    ipv4: false
    ipv6: false
  private_network:
    enabled : true
    subnet: 10.0.0.0/16
    existing_network_name: "KubeNet"
  cni:
    enabled: true
    encryption: false
    mode: flannel

manifests:
  cloud_controller_manager_manifest_url: "https://github.com/hetznercloud/hcloud-cloud-controller-manager/releases/download/v1.20.0/ccm-networks.yaml"
  csi_driver_manifest_url: "https://raw.githubusercontent.com/hetznercloud/csi-driver/v2.8.0/deploy/kubernetes/hcloud-csi.yml"
  system_upgrade_controller_deployment_manifest_url: "https://github.com/rancher/system-upgrade-controller/releases/download/v0.13.4/system-upgrade-controller.yaml"
  system_upgrade_controller_crd_manifest_url: "https://github.com/rancher/system-upgrade-controller/releases/download/v0.13.4/crd.yaml"
  cluster_autoscaler_manifest_url: "https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/hetzner/examples/cluster-autoscaler-run-on-master.yaml"

datastore:
  mode: etcd
  external_datastore_endpoint: postgres://....

schedule_workloads_on_masters: false

image: ubuntu-22.04

masters_pool:
  instance_type: cpx11
  instance_count: 3
  location: hel1

worker_node_pools:
- name: med-static
  instance_type: cpx31
  instance_count: 3
  location: hel1
  autoscaling:
    enabled: true
    min_instances: 0
    max_instances: 6

embedded_registry_mirror:
  enabled: true

post_create_commands:
 - echo 'network:\n  version:\ 2\n  ethernets:\n    enp7s0:\n      critical:\ true\n      nameservers:\n        addresses:\ [10.0.0.2]\n      routes:\n      - on-link:\ true\n        to:\ 0.0.0.0/0\n        via:\ 10.0.0.1' > /etc/netplan/50-cloud-init.yaml
 - sed -i 's/\\//g' /etc/netplan/50-cloud-init.yaml
 - sed -i 's/^nameserver.*/nameserver 10.0.0.2/' /etc/resolv.conf
 - netplan apply
 - apt update
 - apt upgrade -y
 - apt autoremove -y

vitobotta · 2024-10-21T20:37:03Z

Hi, this has been reported a couple of times before but I haven't had a chance to try and reproduce the problem yet. Can you share more details on how you have configured the network in Hetzner? The more details the better as they might help me understand where the problem might be.

TimoGoetze · 2024-11-14T06:55:17Z

Hi, I have the same problem. Is there any known config to work? my network config is same as static workers, but after autoscaling and waiting some minutes, I cannot ssh into the autoscaled worker and it is not joining the cluster.

- name: medium-autoscaled
  instance_type: cpx31
  instance_count: 1
  location: hel1
  autoscaling:
    enabled: true
    min_instances: 0
    max_instances: 4
  image: debian-12
  additional_packages:
  - ifupdown
  post_create_commands:
  - ip route add default via 10.100.0.1  # Adapt this to your gateway IP
  - echo "nameserver 185.12.64.1" > /etc/resolv.conf

vitobotta · 2024-11-14T07:03:15Z

Hi, I have the same problem. Is there any known config to work? my network config is same as static workers, but after autoscaling and waiting some minutes, I cannot ssh into the autoscaled worker and it is not joining the cluster.
- name: medium-autoscaled
  instance_type: cpx31
  instance_count: 1
  location: hel1
  autoscaling:
    enabled: true
    min_instances: 0
    max_instances: 4
  image: debian-12
  additional_packages:
  - ifupdown
  post_create_commands:
  - ip route add default via 10.100.0.1  # Adapt this to your gateway IP
  - echo "nameserver 185.12.64.1" > /etc/resolv.conf

The problem is, the few reports I've come across about these issues all involve some custom commands to tweak the network settings, and that's something I haven't checked yet. So far, with the default network configuration, I haven't been able to recreate any of those problems.

TimoGoetze · 2024-11-14T07:32:18Z

so it might have to do with "private network only" setup? - because thats my only real difference - using a NAT routing VM for Internet access from inside the cluster.

vitobotta · 2024-11-14T10:43:53Z

I can't be sure because I haven't had a chance to verify this, but that's my suspicion at the moment.

saashqdev · 2024-11-28T00:53:28Z

Same here. Just trying the generic cluster_config.yaml and the following section:

...
worker_node_pools:
- name: small-static
  instance_type: cpx21
  instance_count: 2
  location: hel1
  # image: debian-11
  # labels:
  #   - key: purpose
  #     value: blah
  # taints:
  #   - key: something
  #     value: value1:NoSchedule
- name: medium-autoscaled
  instance_type: cpx31
  instance_count: 1
  location: fsn1
  autoscaling:
    enabled: true
    min_instances: 0
    max_instances: 2

"medium-autoscaled" doesn't get created.

Full config:

hetzner_token: <my token>
cluster_name: saashqcloud
kubeconfig_path: "./kubeconfig"
k3s_version: v1.30.3+k3s1

networking:
  ssh:
    port: 22
    use_agent: false # set to true if your key has a passphrase
    public_key_path: "~/.ssh/id_rsa.pub"
    private_key_path: "~/.ssh/id_rsa"
  allowed_networks:
    ssh:
      - 0.0.0.0/0
    api: # this will firewall port 6443 on the nodes
      - 0.0.0.0/0
  public_network:
    ipv4: true
    ipv6: true
  private_network:
    enabled: true
    subnet: 10.0.0.0/16
    existing_network_name: ""
  cni:
    enabled: true
    encryption: false
    mode: flannel

  # cluster_cidr: 10.244.0.0/16 # optional: a custom IPv4/IPv6 network CIDR to use for pod IPs
  # service_cidr: 10.43.0.0/16 # optional: a custom IPv4/IPv6 network CIDR to use for service IPs. Warning, if you change this, you should also change cluster_dns!
  # cluster_dns: 10.43.0.10 # optional: IPv4 Cluster IP for coredns service. Needs to be an address from the service_cidr range


# manifests:
#   cloud_controller_manager_manifest_url: "https://github.com/hetznercloud/hcloud-cloud-controller-manager/releases/download/v1.20.0/ccm-networks.yaml"
#   csi_driver_manifest_url: "https://raw.githubusercontent.com/hetznercloud/csi-driver/v2.9.0/deploy/kubernetes/hcloud-csi.yml"
#   system_upgrade_controller_deployment_manifest_url: "https://github.com/rancher/system-upgrade-controller/releases/download/v0.13.4/system-upgrade-controller.yaml"
#   system_upgrade_controller_crd_manifest_url: "https://github.com/rancher/system-upgrade-controller/releases/download/v0.13.4/crd.yaml"
#   cluster_autoscaler_manifest_url: "https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/hetzner/examples/cluster-autoscaler-run-on-master.yaml"

datastore:
  mode: etcd # etcd (default) or external
  external_datastore_endpoint: postgres://....

schedule_workloads_on_masters: false

# image: rocky-9 # optional: default is ubuntu-24.04
# autoscaling_image: 103908130 # optional, defaults to the `image` setting
# snapshot_os: microos # optional: specified the os type when using a custom snapshot

masters_pool:
  instance_type: cpx21
  instance_count: 3
  location: nbg1

worker_node_pools:
- name: small-static
  instance_type: cpx21
  instance_count: 2
  location: hel1
  # image: debian-11
  # labels:
  #   - key: purpose
  #     value: blah
  # taints:
  #   - key: something
  #     value: value1:NoSchedule
- name: medium-autoscaled
  instance_type: cpx31
  instance_count: 1
  location: fsn1
  autoscaling:
    enabled: true
    min_instances: 0
    max_instances: 2

embedded_registry_mirror:
  enabled: false # Check if your k3s version is compatible before enabling this option. You can find more information at https://docs.k3s.io/installation/registry-mirror

# additional_packages:
# - somepackage

# post_create_commands:
# - apt update
# - apt upgrade -y
# - apt autoremove -y

# kube_api_server_args:
# - arg1
# - ...
# kube_scheduler_args:
# - arg1
# - ...
# kube_controller_manager_args:
# - arg1
# - ...
# kube_cloud_controller_manager_args:
# - arg1
# - ...
# kubelet_args:
# - arg1
# - ...
# kube_proxy_args:
# - arg1
# - ...
# api_server_hostname: k8s.example.com # optional: DNS for the k8s API LoadBalancer. After the script has run, create a DNS record with the address of the API LoadBalancer.

vitobotta · 2024-11-28T15:52:16Z

Hey @saashqdev, do you have any pods that are waiting for resources that aren't currently available in the cluster? If not, the autoscaler won't do anything.

saashqdev · 2024-11-28T15:55:11Z

Ah, no I don't. I'll take it out then. Thanks

vitobotta · 2024-11-28T16:32:18Z

You can keep it, so it scales automatically only when actually needed :)

saashqdev · 2024-11-28T17:04:53Z

ok no problem. We still get the medium auto-scaled worker pools not showing tho

vitobotta · 2024-11-28T17:25:47Z

So, as I was saying, the autoscaler will only add new nodes when there are pods waiting to be deployed and the cluster doesn’t have enough resources to handle them. If the cluster already has enough resources or if you’re running workloads without specifying the needed CPU and memory, the autoscaler won’t take any action.

saashqdev · 2024-11-28T17:32:36Z

ok, got it - I can't get over how well this tool works...

TimoGoetze · 2024-12-12T07:34:36Z

Since my last post to this issue, I exactly followed Vito's Documentation with a new cluster project. Autoscaler works with no issue, as long as you DO NOT USE PRIVATE IP.
The root cause must be somewhere there, when you only have private IPs on your virtual metal and need NAT router to talk to Internet from the nodes.

vitobotta · 2024-12-12T12:06:23Z

Since my last post to this issue, I exactly followed Vito's Documentation with a new cluster project. Autoscaler works with no issue, as long as you DO NOT USE PRIVATE IP. The root cause must be somewhere there, when you only have private IPs on your virtual metal and need NAT router to talk to Internet from the nodes.

Thanks for confirming that. I need to find some time to test those other scenarios.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaled nodes not joining cluster #466

Autoscaled nodes not joining cluster #466

enter-marlah commented Oct 15, 2024 •

edited

Loading

vitobotta commented Oct 15, 2024

enter-marlah commented Oct 15, 2024 •

edited

Loading

vitobotta commented Oct 21, 2024

TimoGoetze commented Nov 14, 2024

vitobotta commented Nov 14, 2024

TimoGoetze commented Nov 14, 2024

vitobotta commented Nov 14, 2024

saashqdev commented Nov 28, 2024

vitobotta commented Nov 28, 2024

saashqdev commented Nov 28, 2024

vitobotta commented Nov 28, 2024

saashqdev commented Nov 28, 2024

vitobotta commented Nov 28, 2024

saashqdev commented Nov 28, 2024 •

edited

Loading

TimoGoetze commented Dec 12, 2024

vitobotta commented Dec 12, 2024

Autoscaled nodes not joining cluster #466

Autoscaled nodes not joining cluster #466

Comments

enter-marlah commented Oct 15, 2024 • edited Loading

vitobotta commented Oct 15, 2024

enter-marlah commented Oct 15, 2024 • edited Loading

vitobotta commented Oct 21, 2024

TimoGoetze commented Nov 14, 2024

vitobotta commented Nov 14, 2024

TimoGoetze commented Nov 14, 2024

vitobotta commented Nov 14, 2024

saashqdev commented Nov 28, 2024

vitobotta commented Nov 28, 2024

saashqdev commented Nov 28, 2024

vitobotta commented Nov 28, 2024

saashqdev commented Nov 28, 2024

vitobotta commented Nov 28, 2024

saashqdev commented Nov 28, 2024 • edited Loading

TimoGoetze commented Dec 12, 2024

vitobotta commented Dec 12, 2024

enter-marlah commented Oct 15, 2024 •

edited

Loading

enter-marlah commented Oct 15, 2024 •

edited

Loading

saashqdev commented Nov 28, 2024 •

edited

Loading