worker-node does not get ready after reboot #11425

rdxmb · 2024-08-07T16:49:52Z

What happened?

In a kubespray-cluster with a single control-plane:

When rebooting a worker-node (without any control-plane-components), the node does not get ready again.

What did you expect to happen?

The worker-node gets ready again after the reboot.

How can we reproduce it (as minimally and precisely as possible)?

Deploy a kubespray-cluster with a single control-plane.

Reboot a worker-node without draining it before.

OS

Linux 6.8.0-39-generic x86_64
PRETTY_NAME="Ubuntu 24.04 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

We run ansible via gitlab-ci with quay.io/kubespray/kubespray:v2.25.0, so the versions are:

Version of Ansible

ansible [core 2.16.7]
  config file = /builds/reddoxx/operations/provisioning/anything-on-pmc/ansible.cfg
  configured module search path = ['/builds/reddoxx/operations/provisioning/anything-on-pmc/library', '/usr/share/ansible']
  ansible python module location = /usr/local/lib/python3.10/dist-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible
  python version = 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (/usr/bin/python3)
  jinja version = 3.1.4
  libyaml = True

Version of Python

python version = 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (/usr/bin/python3)

Version of Kubespray (commit)

7e0a40725 (which is v2.25.0)

Network plugin used

calico

Full inventory with variables

https://gist.github.com/rdxmb/099f6ebd3979369f059a1efdc18f0ec2

Command used to invoke ansible

ansible-playbook -i $INVENTORY /kubespray/cluster.yml

Output of ansible run

--- anything is ok here, so I do not post the output ---

Anything else we need to know

For me it seems to be kind of a hen's egg problem:

kubelet cannot connect the apiserver via localhost:6443, where the nginx-proxy-node-[n] should run and route to the kubernetes-apiserver.

nginx-proxy-node-[n] cannot get ready because kubelet is not working correctly ...

root@node-8:~# systemctl status kubelet | tail
Aug 07 16:34:59 node-8 kubelet[917]: E0807 16:34:59.146019     917 controller.go:146] "Failed to ensure lease exists, will retry" err="Get \"https://localhost:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/node-8?timeout=10s\": dial tcp 127.0.0.1:6443: connect: connection refused" interval="7s"
Aug 07 16:34:59 node-8 kubelet[917]: I0807 16:34:59.193916     917 csi_plugin.go:913] Failed to contact API server when waiting for CSINode publishing: Get "https://localhost:6443/apis/storage.k8s.io/v1/csinodes/node-8": dial tcp 127.0.0.1:6443: connect: connection refused
Aug 07 16:34:59 node-8 kubelet[917]: I0807 16:34:59.566827     917 kubelet_node_status.go:352] "Setting node annotation to enable volume controller attach/detach"
Aug 07 16:34:59 node-8 kubelet[917]: I0807 16:34:59.568002     917 kubelet_node_status.go:669] "Recording event message for node" node="node-8" event="NodeHasSufficientMemory"
Aug 07 16:34:59 node-8 kubelet[917]: I0807 16:34:59.568047     917 kubelet_node_status.go:669] "Recording event message for node" node="node-8" event="NodeHasNoDiskPressure"
Aug 07 16:34:59 node-8 kubelet[917]: I0807 16:34:59.568062     917 kubelet_node_status.go:669] "Recording event message for node" node="node-8" event="NodeHasSufficientPID"
Aug 07 16:34:59 node-8 kubelet[917]: I0807 16:34:59.568089     917 kubelet_node_status.go:70] "Attempting to register node" node="node-8"
Aug 07 16:34:59 node-8 kubelet[917]: E0807 16:34:59.568756     917 kubelet_node_status.go:92] "Unable to register node with API server" err="Post \"https://localhost:6443/api/v1/nodes\": dial tcp 127.0.0.1:6443: connect: connection refused" node="node-8"
Aug 07 16:35:00 node-8 kubelet[917]: I0807 16:35:00.193654     917 csi_plugin.go:913] Failed to contact API server when waiting for CSINode publishing: Get "https://localhost:6443/apis/storage.k8s.io/v1/csinodes/node-8": dial tcp 127.0.0.1:6443: connect: connection refused
Aug 07 16:35:01 node-8 kubelet[917]: I0807 16:35:01.193301     917 csi_plugin.go:913] Failed to contact API server when waiting for CSINode publishing: Get "https://localhost:6443/apis/storage.k8s.io/v1/csinodes/node-8": dial tcp 127.0.0.1:6443: connect: connection refused

root@node-8:~# grep server /etc/kubernetes/kubelet.conf
    server: https://localhost:6443

root@node-8:~# crictl pods | tail
fb128f73279b0       21 hours ago        NotReady            prometheus-prometheus-0                                reddoxx-cloud-wharf         0                   (default)
c9b63478fcff3       21 hours ago        NotReady            max-map-count-setter-fjs9s                             rdx-node-bootstrap-sysctl   0                   (default)
c77504859d18e       21 hours ago        NotReady            kube-prometheus-stack-prometheus-node-exporter-g4vsp   kube-prometheus-stack       0                   (default)
772c45953bc1c       21 hours ago        NotReady            csi-rbdplugin-g4rqx                                    ceph-csi                    0                   (default)
491f1658497c3       21 hours ago        NotReady            minio-operator-7cbcd9b458-t4w6j                        minio-operator              0                   (default)
3ed2f5b5913a9       21 hours ago        NotReady            kustomize-controller-54df4985d-b2rbg                   flux-system                 0                   (default)
093aedf6b4a90       22 hours ago        NotReady            nodelocaldns-jhxcc                                     kube-system                 0                   (default)
8a8f493601caa       22 hours ago        NotReady            calico-node-dx7fj                                      kube-system                 0                   (default)
6b1ab648b8e7b       22 hours ago        NotReady            kube-proxy-vvdvk                                       kube-system                 0                   (default)
16e3dc577108c       22 hours ago        NotReady            nginx-proxy-node-8                                     kube-system                 0                   (default)

There is also a backup-file created by kubespray with the correct server-ip included:

root@node-8:~# diff /etc/kubernetes/kubelet.conf /etc/kubernetes/kubelet.conf.5151.2024-08-06@18\:32\:01~ 
    server: https://localhost:6443                            |     server: https://10.139.131.91:6443

Workaround

cp /etc/kubernetes/kubelet.conf.5151.2024-08-06@18\:32\:01~ /etc/kubernetes/kubelet.conf

systemctl restart kubelet

root@node-8:~# crictl pods | grep nginx
e8e2c11236d72       52 seconds ago      Ready               nginx-proxy-node-8                                     kube-system                 1                   (default)
16e3dc577108c       22 hours ago        NotReady            nginx-proxy-node-8                                     kube-system                 0                   (default)

Now the node gets ready again. 🎉

Just some more information:

On another node, I check the timestamps of the kubelet.conf and its backup:

root@node-7:~# ls -la /etc/kubernetes/kubelet.conf /etc/kubernetes/kubelet.conf.4690.2024-08-06@18\:32\:01~ 
-rw------- 1 root root 1950 Aug  6 18:32 /etc/kubernetes/kubelet.conf
-rw------- 1 root root 1954 Aug  6 18:31 /etc/kubernetes/kubelet.conf.4690.2024-08-06@18:32:01~

Please do not get confused about the additional vars in the inventory - we use the same inventory to create the VMs first ...

The text was updated successfully, but these errors were encountered:

rdxmb · 2024-08-09T20:48:57Z

setting loadbalancer_apiserver_localhost: false in group_vars fixes the problem.

After a reboot, the worker node comes back into the cluster.

k8s-triage-robot · 2024-11-07T21:08:57Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rdxmb added the kind/bug Categorizes issue or PR as related to a bug. label Aug 7, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

worker-node does not get ready after reboot #11425

worker-node does not get ready after reboot #11425

rdxmb commented Aug 7, 2024 •

edited

Loading

rdxmb commented Aug 9, 2024

k8s-triage-robot commented Nov 7, 2024

worker-node does not get ready after reboot #11425

worker-node does not get ready after reboot #11425

Comments

rdxmb commented Aug 7, 2024 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

OS

Version of Ansible

Version of Python

Version of Kubespray (commit)

Network plugin used

Full inventory with variables

Command used to invoke ansible

Output of ansible run

Anything else we need to know

Workaround

Just some more information:

rdxmb commented Aug 9, 2024

k8s-triage-robot commented Nov 7, 2024

rdxmb commented Aug 7, 2024 •

edited

Loading