Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

worker-node does not get ready after reboot #11425

Open
rdxmb opened this issue Aug 7, 2024 · 2 comments
Open

worker-node does not get ready after reboot #11425

rdxmb opened this issue Aug 7, 2024 · 2 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@rdxmb
Copy link

rdxmb commented Aug 7, 2024

What happened?

In a kubespray-cluster with a single control-plane:

When rebooting a worker-node (without any control-plane-components), the node does not get ready again.

What did you expect to happen?

The worker-node gets ready again after the reboot.

How can we reproduce it (as minimally and precisely as possible)?

Deploy a kubespray-cluster with a single control-plane.

Reboot a worker-node without draining it before.

OS

Linux 6.8.0-39-generic x86_64
PRETTY_NAME="Ubuntu 24.04 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

We run ansible via gitlab-ci with quay.io/kubespray/kubespray:v2.25.0, so the versions are:

Version of Ansible

ansible [core 2.16.7]
  config file = /builds/reddoxx/operations/provisioning/anything-on-pmc/ansible.cfg
  configured module search path = ['/builds/reddoxx/operations/provisioning/anything-on-pmc/library', '/usr/share/ansible']
  ansible python module location = /usr/local/lib/python3.10/dist-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible
  python version = 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (/usr/bin/python3)
  jinja version = 3.1.4
  libyaml = True

Version of Python

python version = 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (/usr/bin/python3)

Version of Kubespray (commit)

7e0a40725 (which is v2.25.0)

Network plugin used

calico

Full inventory with variables

https://gist.github.com/rdxmb/099f6ebd3979369f059a1efdc18f0ec2

Command used to invoke ansible

ansible-playbook -i $INVENTORY /kubespray/cluster.yml

Output of ansible run

--- anything is ok here, so I do not post the output ---

Anything else we need to know

For me it seems to be kind of a hen's egg problem:

kubelet cannot connect the apiserver via localhost:6443, where the nginx-proxy-node-[n] should run and route to the kubernetes-apiserver.

nginx-proxy-node-[n] cannot get ready because kubelet is not working correctly ...

root@node-8:~# systemctl status kubelet | tail
Aug 07 16:34:59 node-8 kubelet[917]: E0807 16:34:59.146019     917 controller.go:146] "Failed to ensure lease exists, will retry" err="Get \"https://localhost:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/node-8?timeout=10s\": dial tcp 127.0.0.1:6443: connect: connection refused" interval="7s"
Aug 07 16:34:59 node-8 kubelet[917]: I0807 16:34:59.193916     917 csi_plugin.go:913] Failed to contact API server when waiting for CSINode publishing: Get "https://localhost:6443/apis/storage.k8s.io/v1/csinodes/node-8": dial tcp 127.0.0.1:6443: connect: connection refused
Aug 07 16:34:59 node-8 kubelet[917]: I0807 16:34:59.566827     917 kubelet_node_status.go:352] "Setting node annotation to enable volume controller attach/detach"
Aug 07 16:34:59 node-8 kubelet[917]: I0807 16:34:59.568002     917 kubelet_node_status.go:669] "Recording event message for node" node="node-8" event="NodeHasSufficientMemory"
Aug 07 16:34:59 node-8 kubelet[917]: I0807 16:34:59.568047     917 kubelet_node_status.go:669] "Recording event message for node" node="node-8" event="NodeHasNoDiskPressure"
Aug 07 16:34:59 node-8 kubelet[917]: I0807 16:34:59.568062     917 kubelet_node_status.go:669] "Recording event message for node" node="node-8" event="NodeHasSufficientPID"
Aug 07 16:34:59 node-8 kubelet[917]: I0807 16:34:59.568089     917 kubelet_node_status.go:70] "Attempting to register node" node="node-8"
Aug 07 16:34:59 node-8 kubelet[917]: E0807 16:34:59.568756     917 kubelet_node_status.go:92] "Unable to register node with API server" err="Post \"https://localhost:6443/api/v1/nodes\": dial tcp 127.0.0.1:6443: connect: connection refused" node="node-8"
Aug 07 16:35:00 node-8 kubelet[917]: I0807 16:35:00.193654     917 csi_plugin.go:913] Failed to contact API server when waiting for CSINode publishing: Get "https://localhost:6443/apis/storage.k8s.io/v1/csinodes/node-8": dial tcp 127.0.0.1:6443: connect: connection refused
Aug 07 16:35:01 node-8 kubelet[917]: I0807 16:35:01.193301     917 csi_plugin.go:913] Failed to contact API server when waiting for CSINode publishing: Get "https://localhost:6443/apis/storage.k8s.io/v1/csinodes/node-8": dial tcp 127.0.0.1:6443: connect: connection refused
root@node-8:~# grep server /etc/kubernetes/kubelet.conf
    server: https://localhost:6443
root@node-8:~# crictl pods | tail
fb128f73279b0       21 hours ago        NotReady            prometheus-prometheus-0                                reddoxx-cloud-wharf         0                   (default)
c9b63478fcff3       21 hours ago        NotReady            max-map-count-setter-fjs9s                             rdx-node-bootstrap-sysctl   0                   (default)
c77504859d18e       21 hours ago        NotReady            kube-prometheus-stack-prometheus-node-exporter-g4vsp   kube-prometheus-stack       0                   (default)
772c45953bc1c       21 hours ago        NotReady            csi-rbdplugin-g4rqx                                    ceph-csi                    0                   (default)
491f1658497c3       21 hours ago        NotReady            minio-operator-7cbcd9b458-t4w6j                        minio-operator              0                   (default)
3ed2f5b5913a9       21 hours ago        NotReady            kustomize-controller-54df4985d-b2rbg                   flux-system                 0                   (default)
093aedf6b4a90       22 hours ago        NotReady            nodelocaldns-jhxcc                                     kube-system                 0                   (default)
8a8f493601caa       22 hours ago        NotReady            calico-node-dx7fj                                      kube-system                 0                   (default)
6b1ab648b8e7b       22 hours ago        NotReady            kube-proxy-vvdvk                                       kube-system                 0                   (default)
16e3dc577108c       22 hours ago        NotReady            nginx-proxy-node-8                                     kube-system                 0                   (default)

There is also a backup-file created by kubespray with the correct server-ip included:

root@node-8:~# diff /etc/kubernetes/kubelet.conf /etc/kubernetes/kubelet.conf.5151.2024-08-06@18\:32\:01~ 
    server: https://localhost:6443                            |     server: https://10.139.131.91:6443

Workaround

cp /etc/kubernetes/kubelet.conf.5151.2024-08-06@18\:32\:01~ /etc/kubernetes/kubelet.conf
systemctl restart kubelet
root@node-8:~# crictl pods | grep nginx
e8e2c11236d72       52 seconds ago      Ready               nginx-proxy-node-8                                     kube-system                 1                   (default)
16e3dc577108c       22 hours ago        NotReady            nginx-proxy-node-8                                     kube-system                 0                   (default)

Now the node gets ready again. 🎉

Just some more information:

  1. On another node, I check the timestamps of the kubelet.conf and its backup:
root@node-7:~# ls -la /etc/kubernetes/kubelet.conf /etc/kubernetes/kubelet.conf.4690.2024-08-06@18\:32\:01~ 
-rw------- 1 root root 1950 Aug  6 18:32 /etc/kubernetes/kubelet.conf
-rw------- 1 root root 1954 Aug  6 18:31 /etc/kubernetes/kubelet.conf.4690.2024-08-06@18:32:01~

  1. Please do not get confused about the additional vars in the inventory - we use the same inventory to create the VMs first ...
@rdxmb rdxmb added the kind/bug Categorizes issue or PR as related to a bug. label Aug 7, 2024
@rdxmb
Copy link
Author

rdxmb commented Aug 9, 2024

setting loadbalancer_apiserver_localhost: false in group_vars fixes the problem.

After a reboot, the worker node comes back into the cluster.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

3 participants