Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K0S run via docker-compose doesn't recover from host rebooting (single host) #5023

Open
4 tasks done
tmeltser opened this issue Sep 22, 2024 · 17 comments
Open
4 tasks done
Assignees
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@tmeltser
Copy link

tmeltser commented Sep 22, 2024

Before creating an issue, make sure you've checked the following:

  • You are running the latest released version of k0s
  • Make sure you've searched for existing issues, both open and closed
  • Make sure you've searched for PRs too, a fix might've been merged already
  • You're looking at docs for the released version, "main" branch docs are usually ahead of released versions.

Platform

as360@AS360-AIO-Ubuntu:~$ uname -srvmo; cat /etc/os-release || lsb_release -a
Linux 6.8.0-45-generic #45-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 30 12:02:04 UTC 2024 x86_64 GNU/Linux
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

Version

v1.30.4+k0s.0

Sysinfo

`k0s sysinfo`
Total memory: 35.2 GiB (pass)
Disk space available for /var/lib/k0s: 197.0 GiB (pass)
Name resolution: localhost: [::1 127.0.0.1] (pass)
Operating system: Linux (pass)
  Linux kernel release: 6.8.0-45-generic (pass)
  Max. file descriptors per process: current: 1048576 / max: 1048576 (pass)
  AppArmor: unavailable (pass)
  Executable in PATH: modprobe: /sbin/modprobe (pass)
  Executable in PATH: mount: /bin/mount (pass)
  Executable in PATH: umount: /bin/umount (pass)
  /proc file system: mounted (0x9fa0) (pass)
  Control Groups: version 2 (pass)
    cgroup controller "cpu": available (is a listed root controller) (pass)
    cgroup controller "cpuacct": available (via cpu in version 2) (pass)
    cgroup controller "cpuset": available (is a listed root controller) (pass)
    cgroup controller "memory": available (is a listed root controller) (pass)
    cgroup controller "devices": available (device filters attachable) (pass)
    cgroup controller "freezer": available (cgroup.freeze exists) (pass)
    cgroup controller "pids": available (is a listed root controller) (pass)
    cgroup controller "hugetlb": available (is a listed root controller) (pass)
    cgroup controller "blkio": available (via io in version 2) (pass)
  CONFIG_CGROUPS: Control Group support: no kernel config found (warning)
  CONFIG_NAMESPACES: Namespaces support: no kernel config found (warning)
  CONFIG_NET: Networking support: no kernel config found (warning)
  CONFIG_EXT4_FS: The Extended 4 (ext4) filesystem: no kernel config found (warning)
  CONFIG_PROC_FS: /proc file system support: no kernel config found (warning)

What happened?

K0S running on a single node system (multiple services run by docker-compose) doesn't survive multiple reboots (it comes up after a few, and then it doesn't, or it doesn't come up at all after some reboots).
Attached below a sample docker-compose file to demo the problem.
Tried on Ubuntu 24.04 and CentOS 9 - same results

Steps to reproduce

  1. Take the sample docker-compose file (attached below)
  2. Run the following command: docker compose -f aio-compose-sample.yaml up -d --wait
  3. Reboot the host several times, at some point, after a handful of restarts (or even after the first restart), the K0S gets broken

Expected behavior

The K0S should survive restarts, always.

Actual behavior

After a few restarts, the K0S breaks down:

# docker compose -f aio-compose-sample.yaml exec k0s k0s kubectl get pods -A
NAMESPACE       NAME                                           READY   STATUS        RESTARTS       AGE
cert-manager    cert-manager-9647b459d-hlxr2                   1/1     Running       1 (4h3m ago)   4h15m
cert-manager    cert-manager-cainjector-5d8798687c-h8lk4       1/1     Running       2 (4h3m ago)   4h15m
cert-manager    cert-manager-webhook-c77744d75-b5vcn           1/1     Running       1 (4h3m ago)   4h15m
ingress-nginx   ingress-nginx-admission-create-bxxgb           0/1     Pending       0              17m
ingress-nginx   ingress-nginx-admission-create-lh4p7           0/1     Terminating   0              3h58m
ingress-nginx   ingress-nginx-controller-55df698df5-6vtxj      1/1     Running       1 (4h3m ago)   4h16m
k0s-system      k0s-pushgateway-86bd768578-cp7cq               1/1     Running       1 (4h3m ago)   4h17m
kube-system     coredns-85c69f454c-2hgn7                       1/1     Running       1 (4h3m ago)   4h17m
kube-system     konnectivity-agent-27m8k                       1/1     Terminating   1 (4h3m ago)   4h17m
kube-system     kube-proxy-pxnl5                               1/1     Running       1 (4h3m ago)   4h17m
kube-system     kube-router-84vsr                              1/1     Terminating   1 (4h3m ago)   4h17m
kube-system     metrics-server-7cc78958fc-gkj7l                1/1     Running       1 (4h3m ago)   4h17m
openebs         openebs-localpv-provisioner-86d8949887-49rr7   1/1     Running       0              4h2m
openebs         openebs-pre-upgrade-hook-6jcts                 0/1     Pending       0              3h38m
# docker compose -f aio-compose-sample.yaml exec k0s k0s kubectl get nodes
NAME   STATUS     ROLES           AGE     VERSION
k0s    NotReady   control-plane   4h18m   v1.30.4+k0s

Screenshots and logs

Kindly advise what logs are needed, and I'll be happy to add them.

Additional context

Adding a sample docker compose to demonstrate the problem:
aio-compose-sample.zip

Docker version info:

as360@AS360-AIO-Ubuntu:~$ docker compose -f aio-compose-sample.yaml exec k0s k0s status
Version: v1.30.4+k0s.0
Process ID: 7
Role: controller
Workloads: true
SingleNode: false
Kube-api probing successful: true
Kube-api probing last error:

as360@AS360-AIO-Ubuntu:~$ docker version
Client: Docker Engine - Community
 Version:           27.3.1
 API version:       1.47
 Go version:        go1.22.7
 Git commit:        ce12230
 Built:             Fri Sep 20 11:40:59 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          27.3.1
  API version:      1.47 (minimum version 1.24)
  Go version:       go1.22.7
  Git commit:       41ca978
  Built:            Fri Sep 20 11:40:59 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.22
  GitCommit:        7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
 runc:
  Version:          1.1.14
  GitCommit:        v1.1.14-0-g2c9f560
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
@tmeltser tmeltser added the bug Something isn't working label Sep 22, 2024
@juanluisvaladas juanluisvaladas self-assigned this Sep 23, 2024
@twz123
Copy link
Member

twz123 commented Sep 27, 2024

Kindly advise what logs are needed, and I'll be happy to add them.

The logs of the k0s Docker container would be helpful, the logs of the failing containers, too. You could also add the --debug flag to k0s, so there's even more detailed logs. You might want to try to collect a support bundle, as well.

@tmeltser
Copy link
Author

tmeltser commented Sep 29, 2024

Hi,
PFA the K0S (debug flag turned on) log file: k0s.log
As for the logs of the failed containers, kubectl doesn't seem to be able to supply them at this state (Ubuntu host, after reboor):

as360@AS360-AIO-Ubuntu:~$ docker compose -f aio-compose-sample.yaml exec k0s k0s kubectl logs konnectivity-agent-j6llm -n kube-system
Error from server: Get "https://172.17.0.2:10250/containerLogs/kube-system/konnectivity-agent-j6llm/konnectivity-agent": No agent available
as360@AS360-AIO-Ubuntu:~$ docker compose -f aio-compose-sample.yaml exec k0s k0s kubectl logs ingress-nginx-admission-create-9x6tb -n ingress-nginx
Error from server: Get "https://172.17.0.2:10250/containerLogs/ingress-nginx/ingress-nginx-admission-create-9x6tb/create": No agent available

@twz123
Copy link
Member

twz123 commented Oct 8, 2024

Could be that there are some stuck containers from previous runs. When shutting down k0s, it won't stop running pods/containers. You need to drain the node manually. Moreover, when running k0s in Docker, the cgroups hierarchy is possibly not properly respected, and container processes might keep running (or at least their cgroup hierarchy). I can imagine that this causes some troubles. Can you maybe try to add volumes for /opt/cni and /etc/cni/net.d to your compose config? After having looked at logs, I assume that some old kube-router container is blocking a new one, but the old one can't be removed properly by containerd, because after the restart, the CNI plugins are no longer installed.

@tmeltser
Copy link
Author

tmeltser commented Oct 8, 2024

I can't drain the node manually since we are talking about unexpected machine restart/reboot.
As for the asked volumes, NP, I'll do that and update back (I assume we are talking about anonymous volumes and not host mounted - right?).

@tmeltser
Copy link
Author

tmeltser commented Oct 9, 2024

I've added the 2 new (anonymous) volumes and it didn't make any difference (the K0S failed to come up after the first restart), Attaching the updated sample compose file:
aio-compose-sample.zip
Attaching K0S log file:
k0s.log
Any other sueggstions to resolve this problem?

@tmeltser
Copy link
Author

Any advice on the subject would be very much appreciated...

Copy link
Contributor

The issue is marked as stale since no activity has been recorded in 30 days

@github-actions github-actions bot added the Stale label Nov 29, 2024
@tmeltser
Copy link
Author

Anyone?

@github-actions github-actions bot removed the Stale label Nov 30, 2024
@twz123
Copy link
Member

twz123 commented Dec 2, 2024

The problem seems to stem from unstable k0s controller IPs. Once the cluster is initialized, the IP of the controller may not change, but docker compose will do exactly that. I could make it work after a reboot when using fixed IPs:

--- aio-compose-sample.yaml
+++ aio-compose-sample.yaml
@@ -23,7 +23,9 @@
       - "6443:6443"
       - "80:30080"
       - "443:30443"
-    network_mode: "bridge"
+    networks:
+      sample_net:
+        ipv4_address: 192.168.1.100
     environment:
       K0S_CONFIG: |-
         apiVersion: k0s.k0sproject.io/v1beta1
@@ -121,7 +123,9 @@
       - MSSQL_SA_PASSWORD=SomePass
     ports:
       - '1433:1433'
-    network_mode: "bridge"
+    networks:
+      sample_net:
+        ipv4_address: 192.168.1.200
   dpr:
     container_name: dpr
     image: registry:2
@@ -141,4 +145,13 @@
       REGISTRY_HTTP_TLS_KEY: /var/ssl/private/dpr.key
     ports:
       - '5443:5443'
-    network_mode: "bridge"
+    networks:
+      sample_net:
+        ipv4_address: 192.168.1.201
+
+networks:
+  sample_net:
+    driver: bridge
+    ipam:
+      config:
+        - subnet: 192.168.1.0/24

Seems that the paragraph about custom networks in Docker has to be revisited. Apparently it works with custom networks too, nowadays, at least as long as they are bridge networks?

@twz123 twz123 added question Further information is requested and removed bug Something isn't working labels Dec 2, 2024
@tmeltser
Copy link
Author

tmeltser commented Dec 2, 2024

10x for the prompt reply, but in the designated environment we don't control the IP(s), and we can't guarantee the ability to make the IP(s) constant, any other direction that doesn't involve fixing the IP(s)?

@twz123
Copy link
Member

twz123 commented Dec 2, 2024

You can use a load-balanced DNS name to access the controller(s), as well. See the docs on Control Plane High Availability for details on that. If all you have is a single-controller setup, you can make it simpler by using localhost or 127.0.0.1 or the docker-managed host name for everything. Note that this will then show up in the kubeconfig files generated by k0s/k0sctl, as well. You need to change the server URL in the kubeconfigs accordingly to connect to the cluster from the outside.

Try to set the following in your k0s config (note that I haven't tested this 🙃):

spec:
  api:
    externalAddress: 127.0.0.1
  storage:
    type: etcd
    etcd:
      peerAddress: 127.0.0.1

Also note that for a single node, it's usually easier not to use etcd, but an SQLite database via kine. I'd replace the k0s flags --enable-worker --no-taints with --single, which selects kine as a storage backend automatically, and also doesn't run the k0s join API, which is not required in a single node setup. In that case, you don't need to specify anything in spec.storage.

@tmeltser
Copy link
Author

tmeltser commented Dec 3, 2024

Thanks, I Will give it a try and update back how it works.

@tmeltser
Copy link
Author

tmeltser commented Dec 4, 2024

I've made the change but the cluster doesn't come up now.
Attaching my update docker compose file (zipped), what am I missing here?
aio-compose-sample.zip

@twz123
Copy link
Member

twz123 commented Dec 5, 2024

what am I missing here?

Two things: First, I didn't really think about the fact that endpoints in Kubernetes can't contain loopback addresses. Therefore, you also need to disable k0s's endpoint reconciler so that it doesn't try to set 127.0.0.1 as the address for kubernetes.default.svc. You can do this by adding --disable-components=endpoint-reconciler to the k0s flags. Second, your example still uses etcd and doesn't include the etcd peer address override in the k0s config. I changed the k0s flags in your new docker compose file as follows: I replaced --enable-worker --no-taints with --single and added --disable-components=endpoint-reconciler. That made k0s come up as usual after a host reboot.

--- aio-compose-sample.yaml
+++ aio-compose-sample.yaml
@@ -3,7 +3,7 @@
   k0s:
     container_name: k0s
     image: docker.io/k0sproject/k0s:v1.30.4-k0s.0
-    command: sh -c "apk add --no-cache --no-check-certificate ca-certificates && update-ca-certificates && k0s controller --config=/etc/k0s/config.yaml --enable-worker --no-taints --enable-metrics-scraper --debug"
+    command: sh -c "apk add --no-cache --no-check-certificate ca-certificates && update-ca-certificates && k0s controller --config=/etc/k0s/config.yaml --single --enable-metrics-scraper --disable-components=endpoint-reconciler"
     hostname: k0s
     privileged: true
     cgroup: host

@tmeltser
Copy link
Author

tmeltser commented Dec 5, 2024

10x, I will make another try with all the inputs (sorry for missing the command line required changes) and update back.

@tmeltser
Copy link
Author

tmeltser commented Dec 7, 2024

Hi @twz123,
First - good news, I have tried the latest configuration with multiple restarts and the system survived all of them, 10x!
Second - I will keep the issue open for a few days, as I want to perform some more tests to verify all is well in other scenarios
Third - I would like to suggest that the latest sample file would be assimilated in the formal documentation, as I'm sure anyone intending to use the K0S via a compose file, would benefit from it (the sample file also contains some insights of my own, comparing to what exists in the formal documentation)

@jnummelin jnummelin added the documentation Improvements or additions to documentation label Dec 9, 2024
@tmeltser
Copy link
Author

tmeltser commented Dec 27, 2024

@twz123 - sorry for the delay, but we have performed additional tests, and the solution seems rock solid!
@twz123 / @jnummelin - shall I close the issue, or leave it for someone else to close after documentation is updated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants