Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Microk8s v1.29 snap installation failed on plain Debian 12.4 #4361

Closed
TecIntelli opened this issue Jan 10, 2024 · 43 comments
Closed

Microk8s v1.29 snap installation failed on plain Debian 12.4 #4361

TecIntelli opened this issue Jan 10, 2024 · 43 comments
Labels
kind/bug Something isn't working

Comments

@TecIntelli
Copy link

Summary

The last days I noticed that the installation of MicroK8s v1.29/stable (6364) failed on a new (plain) Debian 12.4 system (tested on AWS EC2 with default Debian 12 image provided by AWS). After a few tests I can summarize the following behavior:

admin@ip-172-31-16-112:~$ microk8s status
microk8s is not running. Use microk8s inspect for a deeper inspection.
admin@ip-172-31-16-112:~$ microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy asnycio usage and limits to the final report tarball
  Copy inotify max_user_instances and max_user_watches to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
sudo: unable to resolve host ip-172-31-16-112: Name or service not known
sudo: unable to resolve host ip-172-31-16-112: Name or service not known
sudo: unable to resolve host ip-172-31-16-112: Name or service not known
sudo: unable to resolve host ip-172-31-16-112: Name or service not known
sudo: unable to resolve host ip-172-31-16-112: Name or service not known
sudo: unable to resolve host ip-172-31-16-112: Name or service not known
Inspecting dqlite
  Inspect dqlite
cp: cannot stat '/var/snap/microk8s/6364/var/kubernetes/backend/localnode.yaml': No such file or directory

Building the report tarball
  Report tarball is at /var/snap/microk8s/6364/inspection-report-20240110_102300.tar.gz

microk8s_1.29_6364-inspection-report-20240110_102300.tar.gz

  • Refreshing the v1.28 (6089) instance to v1.29 (6364) works at the first glance, but the inspect looks not well:
admin@ip-172-31-18-155:~$ microk8s kubectl get all -A
NAMESPACE     NAME                                         READY   STATUS    RESTARTS   AGE
kube-system   pod/coredns-864597b5fd-k7hvt                 1/1     Running   0          2m29s
kube-system   pod/calico-kube-controllers-77bd7c5b-fp4zd   1/1     Running   0          2m29s

NAMESPACE     NAME                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE
default       service/kubernetes   ClusterIP   10.152.183.1    <none>        443/TCP                  2m35s
kube-system   service/kube-dns     ClusterIP   10.152.183.10   <none>        53/UDP,53/TCP,9153/TCP   2m32s

NAMESPACE     NAME                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
kube-system   daemonset.apps/calico-node   1         1         1       1            1           kubernetes.io/os=linux   2m34s

NAMESPACE     NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/coredns                   1/1     1            1           2m32s
kube-system   deployment.apps/calico-kube-controllers   1/1     1            1           2m34s

NAMESPACE     NAME                                               DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/coredns-864597b5fd                 1         1         1       2m29s
kube-system   replicaset.apps/calico-kube-controllers-77bd7c5b   1         1         1       2m29s

admin@ip-172-31-18-155:~$ microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy asnycio usage and limits to the final report tarball
  Copy inotify max_user_instances and max_user_watches to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
Inspecting dqlite
  Inspect dqlite

Building the report tarball
  Report tarball is at /var/snap/microk8s/6364/inspection-report-20240110_103926.tar.gz

microk8s-1.28_6089-refreshed-1.29_6364-inspection-report-20240110_103926.tar.gz

  • The most strange thing is when I removed the MicroK8s package via sudo snap remove --purge microk8s and install the v1.29 (6364) again, the (one node) cluster seems to work like expected, but the inspect looks also not well:
admin@ip-172-31-18-155:~$ microk8s kubectl get all -A
NAMESPACE     NAME                                         READY   STATUS    RESTARTS   AGE
kube-system   pod/calico-node-bggsw                        1/1     Running   0          106s
kube-system   pod/coredns-864597b5fd-wzdz9                 1/1     Running   0          105s
kube-system   pod/calico-kube-controllers-77bd7c5b-vlk94   1/1     Running   0          105s

NAMESPACE     NAME                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE
default       service/kubernetes   ClusterIP   10.152.183.1    <none>        443/TCP                  111s
kube-system   service/kube-dns     ClusterIP   10.152.183.10   <none>        53/UDP,53/TCP,9153/TCP   109s

NAMESPACE     NAME                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
kube-system   daemonset.apps/calico-node   1         1         1       1            1           kubernetes.io/os=linux   111s

NAMESPACE     NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/coredns                   1/1     1            1           110s
kube-system   deployment.apps/calico-kube-controllers   1/1     1            1           111s

NAMESPACE     NAME                                               DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/coredns-864597b5fd                 1         1         1       106s
kube-system   replicaset.apps/calico-kube-controllers-77bd7c5b   1         1         1       106s

admin@ip-172-31-18-155:~$ microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy asnycio usage and limits to the final report tarball
  Copy inotify max_user_instances and max_user_watches to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
Inspecting dqlite
  Inspect dqlite
cp: cannot stat '/var/snap/microk8s/6364/var/kubernetes/backend/localnode.yaml': No such file or directory

Building the report tarball
  Report tarball is at /var/snap/microk8s/6364/inspection-report-20240110_104641.tar.gz

microk8s-reinstall-1.29_6364-inspection-report-20240110_104641.tar.gz.tar.gz

What Should Happen Instead?

I hope somebody of the development team can find the reason for this behavior. I guess there is something installed on the host system during the v1.28 installation what failed in v1.29, and is not removed during snap remove --purge process.

Reproduction Steps

Explained above (incl. inspection tar balls)

If there are any points left, I will try to answer your questions.
Thanks!

@odoo-sh
Copy link

odoo-sh commented Jan 10, 2024

For me..
snap remove --purge microk8s
snap install microk8s --classic --channel=1.29/stable.

root@microk8s-master:~# microk8s status
microk8s is not running. Use microk8s inspect for a deeper inspection.

The inspect log looks same as your first inspect output.

I don know Whats going on.

@TecIntelli
Copy link
Author

@odoo-sh thanks for your fast feedback.

Do you really mean my first inspect output, representing the output of a v1.28 installation without errors. Or do you mean my last output after a re-installation (microk8s-reinstall-1.29_6364-inspection-report-20240110_104641.tar.gz.tar.gz)?

Sorry, just to clarify.

@odoo-sh
Copy link

odoo-sh commented Jan 11, 2024

root@microk8s-master:~# microk8s start
root@microk8s-master:~# microk8s status
microk8s is not running. Use microk8s inspect for a deeper inspection.
root@microk8s-master:~# microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy asnycio usage and limits to the final report tarball
  Copy inotify max_user_instances and max_user_watches to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
Inspecting dqlite
  Inspect dqlite
cp: cannot stat '/var/snap/microk8s/6357/var/kubernetes/backend/localnode.yaml': No such file or directory

Building the report tarball
  Report tarball is at /var/snap/microk8s/6357/inspection-report-20240111_061543.tar.gz

@KlockiLego
Copy link

Have the same problem on ubuntu server 22.04.
snap install microk8s --classic --channel=1.29/stable

file localnode.yaml not exist

microk8s inspect

Inspecting dqlite
  Inspect dqlite
cp: cannot stat '/var/snap/microk8s/6364/var/kubernetes/backend/localnode.yaml': No such file or directory

@itsyoshio
Copy link

itsyoshio commented Jan 18, 2024

Same problem aswell. Also i'm wondering if you guys who got 1.29 running (e.g. by upgrading from 1.28) also can't use kubectl port-forward? It's extremely slow.

@Zvirovyi
Copy link

Same issue on ubuntu desktop 23.10:

Inspecting dqlite
  Inspect dqlite
cp: cannot stat '/var/snap/microk8s/6370/var/kubernetes/backend/localnode.yaml': No such file or directory

@neoaggelos
Copy link
Contributor

Hi @TecIntelli and other folks who are running into this, sorry for taking long to check this.

This seems to be related with cgroups, I see the following in the error logs (and I can also reproduce in Debian 12 systems)

Jan 10 10:16:39 ip-172-31-16-112 microk8s.daemon-kubelite[8441]: E0110 10:16:39.649969    8441 kubelet.go:1542] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubepods] doesn't exist"
Jan 10 10:16:39 ip-172-31-16-112 systemd[1]: snap.microk8s.daemon-kubelite.service: Main process exited, code=exited, status=1/FAILURE
Jan 10 10:16:39 ip-172-31-16-112 systemd[1]: snap.microk8s.daemon-kubelite.service: Failed with result 'exit-code'.
Jan 10 10:16:39 ip-172-31-16-112 systemd[1]: snap.microk8s.daemon-kubelite.service: Consumed 6.137s CPU time.
Jan 10 10:16:39 ip-172-31-16-112 systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 1.
Jan 10 10:16:39 ip-172-31-16-112 systemd[1]: Stopped snap.microk8s.daemon-kubelite.service - Service for snap application microk8s.daemon-kubelite.
Jan 10 10:16:39 ip-172-31-16-112 systemd[1]: snap.microk8s.daemon-kubelite.service: Consumed 6.137s CPU time.
Jan 10 10:16:39 ip-172-31-16-112 systemd[1]: Started snap.microk8s.daemon-kubelite.service - Service for snap application microk8s.daemon-kubelite.

One work-around for this is to disable this on the kubelet with:

echo '
--cgroups-per-qos=false
--enforce-node-allocatable=""
' | sudo tee -a /var/snap/microk8s/current/args/kubelet

sudo snap restart microk8s.daemon-kubelite

Afterwards, MicroK8s should be coming up. We will take this back to see what the root cause is and what sort of mitigations we could apply to prevent this in out of the box deployments.

@neoaggelos
Copy link
Contributor

To add some more details, this is what I'm seeing on a Debian 12 instance where I can reproduce the issue:

root@test-debian:/sys/fs/cgroup/kubepods# mount -t cgroup2
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

root@test-debian:/sys/fs/cgroup/kubepods# cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc
root@test-debian:/sys/fs/cgroup/kubepods# cat /sys/fs/cgroup/kubepods/cgroup.controllers
cpu io memory hugetlb pids rdma misc

root@test-debian:/sys/fs/cgroup/kubepods# echo '+cpuset' > cgroup.subtree_control
bash: echo: write error: No such file or directory

@TecIntelli
Copy link
Author

Let me drop some news to this issue we have found, regarding our initially mentioned problem. Maybe somebody else can explain more about the findings we have made.

It might be an issue with the used kernel 6.1 on Debian 12 (last try with latest version 6.1.69).
When we upgraded the kernel to 6.5.10 manually, we could install Microk8s 1.29 latest/edge (6469) without problems and all expected pods came up properly.

Let me attach the inspect files just to compare if required:
Kernel 6.1.69: debian12.4_kernel6.1.69-1_inspection-report-20240130_125914.tar.gz
Kernel 6.5.10: debian12.4_kernel6.5.10-1~bpo12+1_inspection-report-20240130_130629.tar.gz

Additionally (with link to @neoaggelos detail information) we have also figured out the reason in Kernel 6.1.x might be a deligation issue. If we add the following before we install MicroK8s, the initial problem does not occur.

# mkdir -p /etc/systemd/system/[email protected]
# cat > /etc/systemd/system/[email protected]/delegate.conf << EOF
[Service]
Delegate=cpu cpuset io memory pids
EOF
# systemctl daemon-reload

github - opencontainers - cgroupv2
Let me also attach the inspect files with these settings:
debian12.4_kernel6.1.69-1_inspection-report-20240130_135728.tar.gz

@neoaggelos
Copy link
Contributor

Hi @TecIntelli thanks a lot for looking deeper and coming up with a path towards a solution. It is still not too clear to me how we could handle this on the MicroK8s side, I do not think it's a good approach to mess with the system like this.

@dimw
Copy link

dimw commented Jan 30, 2024

I spontaneously run into the same issue on a HA cluster running Ubuntu 22.04 LTS (Hetzner cloud server) and Microk8s 1.29/stable. Firstly, I spotted weird behavior on one faulty node of the HA cluster (container stayed in Terminating state, no deletion possible). After rebooting, I observed that microk8s status output was flaky alternating between proper status reports, "not running" messages, and "random" execution errors. No issues were reported when running microk8s inspect.

At some point I realized that journalctl -f -u snap.microk8s.daemon-kubelite is logging too much with some hidden errors in between. It took me a while to understand that microk8s.daemon-kubelite is actually not starting (which was sadly not reflected by microk8s inspect):

Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubepods] doesn't exist"

After setting up a new clean machine Ubuntu 22.04.3 LTS and 1.29/stable (single node), I run into the same not starting microk8s.daemon-kubelite. On top I got the missing localnode.yaml error reported by @Zvirovyi earlier.

For now, I managed to restore the cluster by downgrading Microk8s to v1.28.3:

snap refresh microk8s --classic --channel=1.28/stable

PS: Adding and removing nodes from the HA cluster was very smooth in every stage, even with the "broken" 1.29/stable. Kudos to the maintainers!

@TecIntelli
Copy link
Author

TecIntelli commented Jan 30, 2024

@dimw Do you remember what kernel version run on your broken node with Ubuntu 22.04 and the new clean host with Ubuntu 22.04.3?
I tested a new AWS EC2 instance with Ubuntu 22.04.3 LTS wihtout any problems. Snap package with MicroK8s 1.29/stable (6364) on a single node starts like expected.
Kernel: 6.2.0-1018-aws

@dimw
Copy link

dimw commented Jan 31, 2024

@TecIntelli I made a snapshot of the machine before purging it so I restored it now and checked the data. Both machines have the same configuration:

  • Ubuntu 22.04.3 LTS
  • Kernel: 5.15.0-92-generic

@TecIntelli
Copy link
Author

@dimw I was just curious and made a short test on an AWS EC2 instance with Ubuntu 22.04.3 and kernel 5.15.0-1052-aws. Unfortunately I cannot confirm your mentioned behavior when I installed MicroK8s 1.29/stable (6364) via snap. It seems to run smoothly, all pod came up as expected.
The issue might be different.

Here the inspect file of the singe node instance
ubuntu22.04.3_kernel5.15.0-1052-aws_inspection-report-20240131_132643.tar.gz

@dimw
Copy link

dimw commented Feb 1, 2024

@TecIntelli I repeated the process yesterday and installed the newest Ubuntu on Hetzner Cloud and run into the following two issues again:

  • microk8s.daemon-kubelite not starting
  • error on Microk8s' inspect:
    cp: cannot stat '/var/snap/microk8s/6364/var/kubernetes/backend/localnode.yaml': No such file or directory
Expand for details
$ apt update
$ apt upgrade -y
$ apt install snapd -y
$ snap install microk8s --classic --channel=1.29/stable
$ reboot # after kernel upgrade
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.3 LTS
Release:	22.04
Codename:	jammy
$ uname -r
5.15.0-92-generic

$ microk8s start
$ microk8s inspect
microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy asnycio usage and limits to the final report tarball
  Copy inotify max_user_instances and max_user_watches to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
Inspecting dqlite
  Inspect dqlite
cp: cannot stat '/var/snap/microk8s/6364/var/kubernetes/backend/localnode.yaml': No such file or directory

Building the report tarball
  Report tarball is at /var/snap/microk8s/6364/inspection-report-20240131_203342.tar.gz

$ journalctl -u snap.microk8s.daemon-kubelite -n 1000 | grep "err="
Jan 31 20:38:20 ubuntu-4gb-fsn1-2 microk8s.daemon-kubelite[65475]: E0131 20:38:20.663704   65475 kubelet.go:2353] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
Jan 31 20:38:20 ubuntu-4gb-fsn1-2 microk8s.daemon-kubelite[65475]: E0131 20:38:20.721005   65475 container_manager_linux.go:881] "Unable to get rootfs data from cAdvisor interface" err="unable to find data in memory cache"
Jan 31 20:38:20 ubuntu-4gb-fsn1-2 microk8s.daemon-kubelite[65475]: E0131 20:38:20.772964   65475 kubelet.go:2353] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
Jan 31 20:38:20 ubuntu-4gb-fsn1-2 microk8s.daemon-kubelite[65475]: E0131 20:38:20.967043   65475 kubelet.go:1542] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubepods] doesn't exist"

I also tried the same with Ubuntu 20.04.6 LTS (Kernel: 5.4.0-170-generic) and getting the same microk8s inspect error but
microk8s.daemon-kubelite is starting and the cluster seems to be operational.

@Sampy84
Copy link

Sampy84 commented Mar 27, 2024

Hi all,

I have the same issue on Oracle Linux 9.3:

` microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
Service snap.microk8s.daemon-cluster-agent is running
Service snap.microk8s.daemon-containerd is running
Service snap.microk8s.daemon-kubelite is running
Service snap.microk8s.daemon-k8s-dqlite is running
Service snap.microk8s.daemon-apiserver-kicker is running
Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
Copy processes list to the final report tarball
Copy disk usage information to the final report tarball
Copy memory usage information to the final report tarball
Copy server uptime to the final report tarball
Copy openSSL information to the final report tarball
Copy snap list to the final report tarball
Copy VM name (or none) to the final report tarball
Copy current linux distribution to the final report tarball
Copy asnycio usage and limits to the final report tarball
Copy inotify max_user_instances and max_user_watches to the final report tarball
Copy network configuration to the final report tarball
Inspecting kubernetes cluster
Inspect kubernetes cluster
Inspecting dqlite
Inspect dqlite
cp: cannot stat '/var/snap/microk8s/6641/var/kubernetes/backend/localnode.yaml': No such file or directory

Building the report tarball
Report tarball is at /var/snap/microk8s/6641/inspection-report-20240327_173242.tar.gz
`

@idc77
Copy link

idc77 commented Apr 5, 2024

same on 22.0.4.4 ubuntu server

@robertkottelin
Copy link

robertkottelin commented Apr 6, 2024

Same on Debian GNU/Linux 12 (bookworm)

microk8s.inspect:

Inspecting system
Inspecting Certificates
Inspecting services
Service snap.microk8s.daemon-cluster-agent is running
Service snap.microk8s.daemon-containerd is running
Service snap.microk8s.daemon-kubelite is running
Service snap.microk8s.daemon-k8s-dqlite is running
Service snap.microk8s.daemon-apiserver-kicker is running
Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
Copy processes list to the final report tarball
Copy disk usage information to the final report tarball
Copy memory usage information to the final report tarball
Copy server uptime to the final report tarball
Copy openSSL information to the final report tarball
Copy snap list to the final report tarball
Copy VM name (or none) to the final report tarball
Copy current linux distribution to the final report tarball
Copy asnycio usage and limits to the final report tarball
Copy inotify max_user_instances and max_user_watches to the final report tarball
Copy network configuration to the final report tarball
Inspecting kubernetes cluster
Inspect kubernetes cluster
Inspecting dqlite
Inspect dqlite
cp: cannot stat '/var/snap/microk8s/6668/var/kubernetes/backend/localnode.yaml': No such file or directory

Building the report tarball
Report tarball is at /var/snap/microk8s/6668/inspection-report-20240406_123559.tar.gz

@IshwarChincholkar
Copy link

System info:

RETTY_NAME="Ubuntu 22.04 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04 (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

When I installed version 1.29 using snap install microk8s --classic --channel=1.29/stable
and run inspect

microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
Service snap.microk8s.daemon-cluster-agent is running
Service snap.microk8s.daemon-containerd is running
Service snap.microk8s.daemon-kubelite is running
Service snap.microk8s.daemon-k8s-dqlite is running
Service snap.microk8s.daemon-apiserver-kicker is running
Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
Copy processes list to the final report tarball
Copy disk usage information to the final report tarball
Copy memory usage information to the final report tarball
Copy server uptime to the final report tarball
Copy openSSL information to the final report tarball
Copy snap list to the final report tarball
Copy VM name (or none) to the final report tarball
Copy current linux distribution to the final report tarball
Copy asnycio usage and limits to the final report tarball
Copy inotify max_user_instances and max_user_watches to the final report tarball
Copy network configuration to the final report tarball
Inspecting kubernetes cluster
Inspect kubernetes cluster
Inspecting dqlite
Inspect dqlite
cp: cannot stat '/var/snap/microk8s/6641/var/kubernetes/backend/localnode.yaml': No such file or directory

WARNING: Maximum number of inotify user watches is less than the recommended value of 1048576.
Increase the limit with:
echo fs.inotify.max_user_watches=1048576 | sudo tee -a /etc/sysctl.conf
sudo sysctl --system

Got above error.

Solution:

snap remove --purge microk8s
snap install microk8s --classic --channel=1.28/stable

@Nospamas
Copy link

Nospamas commented May 3, 2024

Ran into this same issue with cp: cannot stat '/var/snap/microk8s/6370/var/kubernetes/backend/localnode.yaml': No such file or directory and "Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubepods] doesn't exist" errors. The latter being my google search.

This happened on a fresh 22.04.04 ubuntu server minimal installation having only done a apt upgrade. It looks like the default version installed of microk8s was 1.29/stable (selected during the server install) and resulted in the above errors.

Fix for me was to roll back to 1.28 ie

sudo snap remove --purge microk8s
sudo snap install microk8s --classic --channel=1.28/stable

This let me start back up the node. Subsequently I upgraded to latest:

sudo snap refresh microk8s --channel 1.30/stable

and rejoined the node to the cluster.

microk8s join 192.168.x.x:25000/xxxxxxxxxxxxxxxxxxxxxxxxxx/xxxxxxxxx

Everything seems to be in order.

@neoaggelos
Copy link
Contributor

Hi @Nospamas and all, this seems to have started on Debian, but currently affecting Ubuntu versions as well. This is related to the kubepods cgroup not getting the cpuset controller up on 1.29 and 1.30.

We have a fix #4503 that is out on 1.29/edge and 1.30/edge channels, and will shortly find its way on 1.29/stable and 1.30/stable respectively. So, if people are currently experiencing issues, I would recommend:

# switch to 1.30/edge channel if running 1.30
sudo snap refresh microk8s --channel 1.30/edge

# switch to 1.29/edge channel if running 1.29
sudo snap refresh microk8s --channel 1.29/edge

The issue will remain open until the bugfix is promoted to stable.

@arcrowinteractive
Copy link

Hi @Nospamas and all, this seems to have started on Debian, but currently affecting Ubuntu versions as well. This is related to the kubepods cgroup not getting the cpuset controller up on 1.29 and 1.30.

We have a fix #4503 that is out on 1.29/edge and 1.30/edge channels, and will shortly find its way on 1.29/stable and 1.30/stable respectively. So, if people are currently experiencing issues, I would recommend:

# switch to 1.30/edge channel if running 1.30
sudo snap refresh microk8s --channel 1.30/edge

# switch to 1.29/edge channel if running 1.29
sudo snap refresh microk8s --channel 1.29/edge

The issue will remain open until the bugfix is promoted to stable.

Are we able to switch back to stable once the bug is fixed? or it's best to use version 1.28?

@carlos00027
Copy link

I hate microk8s

@quyen66
Copy link

quyen66 commented Jun 6, 2024

Hi,

I have the same issue on my Ubuntu 22.04
I installed version 1.29/edge but microk8s not running with error

Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy asnycio usage and limits to the final report tarball
  Copy inotify max_user_instances and max_user_watches to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
Inspecting dqlite
  Inspect dqlite
cp: cannot stat '/var/snap/microk8s/6887/var/kubernetes/backend/localnode.yaml': No such file or directory

Building the report tarball
  Report tarball is at /var/snap/microk8s/6887/inspection-report-20240606_145417.tar.gz

report file
inspection-report-20240606_145417.tar.gz

@leosimoesp
Copy link

k3s is life!

@zioCristia
Copy link

Hi, same issue here with Ubuntu 22.04 with both 1.29/edge and 1.30/edge

Refreshing with 1.28/stable seems to work

Fix for me was to roll back to 1.28 ie

sudo snap remove --purge microk8s
sudo snap install microk8s --classic --channel=1.28/stable

This let me start back up the node. Subsequently I upgraded to latest:

sudo snap refresh microk8s --channel 1.30/stable

and I no longer get:
cp: cannot stat '/var/snap/microk8s/6887/var/kubernetes/backend/localnode.yaml': No such file or directory

But I still have the microk8s is not running

$ microk8s status
microk8s is not running. Use microk8s inspect for a deeper inspection.
$ microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy asnycio usage and limits to the final report tarball
  Copy inotify max_user_instances and max_user_watches to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
Inspecting dqlite
  Inspect dqlite

Building the report tarball
  Report tarball is at /var/snap/microk8s/6946/inspection-report-20240619_233116.tar.gz

@thirumalaivasan-k
Copy link

thirumalaivasan-k commented Jul 1, 2024

it fixed my issues
i have successfully created a configuration file named localnode.yaml under the missing file directory. The contents of the file specify a Kubernetes ConfigMap with the following details:

apiVersion: v1
kind: ConfigMap
metadata:
name: localnode-config
namespace: kube-system
data:
address: 192.168.1.100:19001
role: node

This configuration sets the address to 192.168.1.100:19001 and assigns the role as a node Purpose of localnode.yaml:
The localnode.yaml file defines configuration for a single-node Kubernetes cluster.
[It specifies details like the node’s IP address, port, and role (e.g., as a regular node) within the cluster]
MicroK8s uses this configuration to set up the local Kubernetes environment.

@jli113
Copy link

jli113 commented Jul 31, 2024

Hi @Nospamas and all, this seems to have started on Debian, but currently affecting Ubuntu versions as well. This is related to the kubepods cgroup not getting the cpuset controller up on 1.29 and 1.30.
We have a fix #4503 that is out on 1.29/edge and 1.30/edge channels, and will shortly find its way on 1.29/stable and 1.30/stable respectively. So, if people are currently experiencing issues, I would recommend:

# switch to 1.30/edge channel if running 1.30
sudo snap refresh microk8s --channel 1.30/edge

# switch to 1.29/edge channel if running 1.29
sudo snap refresh microk8s --channel 1.29/edge

The issue will remain open until the bugfix is promoted to stable.

Are we able to switch back to stable once the bug is fixed? or it's best to use version 1.28?

not working, all wsl ubuntu 20.04, 22.04, 24.04 gave
cp: cannot stat '/var/snap/microk8s/****/var/kubernetes/backend/localnode.yaml': No such file or directory

@marcwittke
Copy link

Are there plans to get this into the stable branch before the end-of life of 1.28 on 2024-10-28?

@NerdyGriffin
Copy link

As of writing this, both 1.30/edge and 1.30/stable now point to v1.30.4. Does that mean the fix is now merged?

I first saw this problem appear on an Ubuntu 24.04 (x86) VM after upgrading that node from v1.30.3 to v1.30.4. My other nodes were Ubuntu 24.04 on ARM (all Pi 5) upgrading from v1.30.1 to v1.30.4 and they did not have this problems.

@hdzcalmir
Copy link

I am still facing this issue on my new VPS despite trying all available fixes, and none of them have worked.

@nirmesh
Copy link

nirmesh commented Sep 8, 2024

i am also facing same issue

@Monochromics
Copy link

Re: the top level QOS problems, it looks like the systemd delegate conf was included in 1.31 (and maybe ported to others, didn't check). It is /not/ in 1.28. Presumably the changes could just be executed manually though via:

mkdir -p /etc/systemd/system/snap.microk8s.daemon-kubelite.service.d
tee /etc/systemd/system/snap.microk8s.daemon-kubelite.service.d/delegate.conf > /dev/null <<EOF
[Service]
Delegate=yes
EOF
systemctl daemon-reload
snap restart microk8s

@berkayoz
Copy link
Member

Hey folks, we have not seen this on 1.28 initially. We've backported the workaround/fix with #4667 to 1.28, it should be promoted to the 1.28/stable in the next few weeks.

We are following the upstream issue kubernetes/kubernetes#122955 (comment) and the possible fix kubernetes/kubernetes#125923

@szszi
Copy link

szszi commented Sep 16, 2024

Hi,
try it:
sudo microk8s stop
sudo modprobe nf_conntrack
sudo microk8s start

@aipirvu
Copy link

aipirvu commented Sep 20, 2024

Same issue..
Ubuntu: 24.0 LTS
Microk8s: 1.28.13, 1.29.8, 1.30.4, 1.31.0

Workaround as suggested by thirusubash (tested on 1.31.0 and 1.29.8), I've created the file manually. You can create it based on the configuration found in the cluster.yaml file in the same directory. In my case, I was having a HA cluster so the cluster.yaml contained the information for all my 3 nodes. I just picked the one for my current node and created the localnode.yaml file with it, ex:

- ID: 132467980
  Address: 127.0.0.1:19001
  Role: 0

Note: If you drained your node, microk8s status will still report that the cluster is not running, while microk8s inspect will no longer report the file missing issue and microk8s kubectl commands will work. Make sure to uncordon your node so you don't spent time troubleshooting a non existing problem like I did.

@nubblesite
Copy link

Ran into this today. Seems like microk8s is genuinely unusable until this is fixed, any updates to this?

@james-ro-williams
Copy link

Also run into this issue, any news on a fix?

@jcjveraa
Copy link

jcjveraa commented Oct 1, 2024

Chipping in - I have this too on bare metal AMD64 clusters (three identical 6 CPU 16GB ram mini desktop pc's) on the latest Ubuntu server version 24.04 (minimized install). The only thing I have installed microk8s using the option in the Ubuntu installer to install the snap, and then followed the (few) steps to get a HA cluster running.

I'm running a 3 node HA cluster. I repeatedly have that when I take down the cluster (shutdown the machines), and later start it again, 1 or 2 random nodes will report 'NotReady' for a long time when running a kubectl get no - sometimes they come up, sometimes they stay this way effectively indefinitely (>30 minutes). At that time the cluster is indeed not functioning - random services are not available

When I ssh into the machine running the NotReady node, microk8s status says essentially all is fine.

When I then run microk8s inspect, it errors out like reported in this topic.

Inspecting dqlite
  Inspect dqlite
cp: cannot stat '/var/snap/microk8s/7232/var/kubernetes/backend/localnode.yaml': No such file or directory

After this, the node reports Ready on kubectl get no and is functioning seemingly fine - all pods and services running.


Some system details - note that the microk8s status reports 'running' both on Ready and NotReady nodes indifferently.

$ microk8s version
MicroK8s v1.30.5 revision 7232

$ microk8s status
microk8s is running
high-availability: yes
  datastore master nodes: XXX:19001 YYY:19001 ZZZ:19001
  datastore standby nodes: none
addons:
  enabled:
    dashboard-ingress    # (community) Ingress definition for Kubernetes dashboard
    community            # (core) The community addons repository
    dashboard            # (core) The Kubernetes dashboard
    dns                  # (core) CoreDNS
    ha-cluster           # (core) Configure high availability on the current node
    helm                 # (core) Helm - the package manager for Kubernetes
    helm3                # (core) Helm 3 - the package manager for Kubernetes
    ingress              # (core) Ingress controller for external access
    metrics-server       # (core) K8s Metrics Server for API access to service metrics
    rbac                 # (core) Role-Based Access Control for authorisation
  disabled:
    # removed 

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 24.04.1 LTS
Release:        24.04
Codename:       noble

# (edited for essential details)
$ sudo lshw 
lenovo                   
    description: Mini PC
    vendor: LENOVO
    version: ThinkCentre M920q
    width: 64 bits
    capabilities: smbios-3.2.1 dmi-3.2.1 smp vsyscall32
  *-core
       description: Motherboard
       product: 3136
       vendor: LENOVO
       physical id: 0
       version: SDK0J40697 WIN 3305155531172
     *-firmware
          description: BIOS
          vendor: LENOVO
          physical id: 0
          version: M1UKT77A
          date: 04/10/2024
          size: 64KiB
          capacity: 12MiB
          capabilities: pci upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification uefi
     *-memory
          description: System Memory
          physical id: 3b
          slot: System board or motherboard
          size: 16GiB
     *-cpu
          description: CPU
          product: Intel(R) Core(TM) i5-8500T CPU @ 2.10GHz
          vendor: Intel Corp.
          physical id: 48
          bus info: cpu@0
          version: 6.158.10
          serial: To Be Filled By O.E.M.
          slot: U3E1
          size: 3299MHz
          capacity: 3500MHz
          width: 64 bits
          clock: 100MHz
          capabilities: lm fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp x86-64 constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64
 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi md_clear flush_l1d arch_capabilities cpufreq

@Anushkach
Copy link

Got this error. it seems still there is no fix for this issue for 11 months.

@berkayoz
Copy link
Member

berkayoz commented Nov 6, 2024

Hey folks, this issue now contains reports related to multiple causes.

To address certain ones:

  1. About the failed to initialize top level QOS containers issue, this seems to be an upstream bug that will get fixed in 1.32+. See my previous comment for more info.
  2. About the nf_conntrack issue, we've landed a patch with fix: ensure nf_conntrack module loaded for kubelite. #4705 which should try to loads nf_conntrack before starting kubelite. The patches should've landed on the stable channels for 1.28+.
  3. About the cp: cannot stat '/var/snap/microk8s/x1/var/kubernetes/backend/localnode.yaml': No such file or directory issue, this is a "false positive" in the inspect script. This file was utilized in previous versions of k8s-dqlite and is not created/used 1.28+. You can check this by creating a fresh cluster with 1.30 nodes for example.

I'm closing this issue since a workaround is issued for the original bug report. Please create a separate issue if you are facing a different problem, thanks!

@AminHA1248
Copy link

In my case the issue was resolved by creating the missing file:
touch /var/snap/microk8s/7449/var/kubernetes/backend/localnode.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests