Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster can not start after reboot all pods #683

Open
chernomor opened this issue Jun 26, 2024 · 1 comment
Open

Cluster can not start after reboot all pods #683

chernomor opened this issue Jun 26, 2024 · 1 comment

Comments

@chernomor
Copy link

chernomor commented Jun 26, 2024

Report

I've setup cluster according https://docs.percona.com/percona-operator-for-mysql/ps/kubectl.html in single node k3s. All mysql pods working fine, when I've reboot k3s node and mysql cluster can not start.

More about the problem

I've touch file /var/lib/mysql/sleep-forever in master pod cluster1-mysql-0 and it running now, but slaves in CrashLoopBackOff:

$ kubectl -n mysql-test get pods -o wide
NAME                                             READY   STATUS             RESTARTS          AGE   IP           NODE                                  NOMINATED NODE   READINESS GATES
cluster1-haproxy-0                               2/2     Running            4 (18h ago)       22h   10.42.0.54   rt-chernomor   <none>           <none>
cluster1-haproxy-1                               2/2     Running            4 (18h ago)       22h   10.42.0.51   rt-chernomor   <none>           <none>
cluster1-haproxy-2                               2/2     Running            4 (18h ago)       22h   10.42.0.52   rt-chernomor   <none>           <none>
percona-server-mysql-operator-78ccf4bd45-67p2j   1/1     Running            2 (18h ago)       22h   10.42.0.47   rt-chernomor   <none>           <none>
cluster1-orc-0                                   2/2     Running            4 (18h ago)       22h   10.42.0.48   rt-chernomor   <none>           <none>
cluster1-orc-2                                   2/2     Running            4 (18h ago)       22h   10.42.0.50   rt-chernomor   <none>           <none>
cluster1-orc-1                                   2/2     Running            4 (18h ago)       22h   10.42.0.53   rt-chernomor   <none>           <none>
cluster1-mysql-0                                 3/3     Running            454 (41m ago)     18h   10.42.0.56   rt-chernomor   <none>           <none>
cluster1-mysql-2                                 2/3     CrashLoopBackOff   464 (2m24s ago)   18h   10.42.0.49   rt-chernomor   <none>           <none>
cluster1-mysql-1                                 1/3     CrashLoopBackOff   468 (26s ago)     18h   10.42.0.55   rt-chernomor   <none>           <none>

Some logs from bootstrap on slave pod:

$ kubectl -n mysql-test exec -it cluster1-mysql-1  -- tail -f /var/lib/mysql/bootstrap.log
Defaulted container "mysql" out of: mysql, xtrabackup, pt-heartbeat, mysql-init (init)
2024/06/26 07:54:31 bootstrap failed: select donor: connect to 10-42-0-49.cluster1-mysql-unready.mysql-test: ping DB: dial tcp 10.42.0.49:33062: connect: connection refused
2024/06/26 07:54:41 Peers: [10-42-0-49.cluster1-mysql-unready.mysql-test 10-42-0-55.cluster1-mysql-unready.mysql-test 10-42-0-56.cluster1-mysql-unready.mysql-test]
2024/06/26 07:54:41 bootstrap finished in 0.003150 seconds
2024/06/26 07:54:41 bootstrap failed: select donor: connect to 10-42-0-49.cluster1-mysql-unready.mysql-test: ping DB: dial tcp 10.42.0.49:33062: connect: connection refused
2024/06/26 07:54:51 Peers: [10-42-0-49.cluster1-mysql-unready.mysql-test 10-42-0-55.cluster1-mysql-unready.mysql-test 10-42-0-56.cluster1-mysql-unready.mysql-test]
2024/06/26 07:54:51 bootstrap finished in 0.003226 seconds
2024/06/26 07:54:51 bootstrap failed: select donor: connect to 10-42-0-49.cluster1-mysql-unready.mysql-test: ping DB: dial tcp 10.42.0.49:33062: connect: connection refused
2024/06/26 07:55:01 Peers: [10-42-0-49.cluster1-mysql-unready.mysql-test 10-42-0-55.cluster1-mysql-unready.mysql-test 10-42-0-56.cluster1-mysql-unready.mysql-test]
2024/06/26 07:55:01 bootstrap finished in 0.003110 seconds
2024/06/26 07:55:01 bootstrap failed: select donor: connect to 10-42-0-49.cluster1-mysql-unready.mysql-test: ping DB: dial tcp 10.42.0.49:33062: connect: connection refused
2024/06/26 08:00:31 Peers: [10-42-0-49.cluster1-mysql-unready.mysql-test 10-42-0-55.cluster1-mysql-unready.mysql-test 10-42-0-56.cluster1-mysql-unready.mysql-test]
2024/06/26 08:00:31 bootstrap finished in 0.003058 seconds
2024/06/26 08:00:31 bootstrap failed: select donor: connect to 10-42-0-49.cluster1-mysql-unready.mysql-test: ping DB: dial tcp 10.42.0.49:33062: connect: connection refused
2024/06/26 08:00:41 Peers: [10-42-0-49.cluster1-mysql-unready.mysql-test 10-42-0-55.cluster1-mysql-unready.mysql-test 10-42-0-56.cluster1-mysql-unready.mysql-test]
2024/06/26 08:00:41 bootstrap finished in 0.002679 seconds
2024/06/26 08:00:41 bootstrap failed: select donor: connect to 10-42-0-49.cluster1-mysql-unready.mysql-test: ping DB: dial tcp 10.42.0.49:33062: connect: connection refused
2024/06/26 08:00:51 Peers: [10-42-0-49.cluster1-mysql-unready.mysql-test 10-42-0-55.cluster1-mysql-unready.mysql-test 10-42-0-56.cluster1-mysql-unready.mysql-test]
2024/06/26 08:00:51 bootstrap finished in 0.003255 seconds
2024/06/26 08:00:51 bootstrap failed: select donor: connect to 10-42-0-49.cluster1-mysql-unready.mysql-test: ping DB: dial tcp 10.42.0.49:33062: connect: connection refused
2024/06/26 08:01:01 Peers: [10-42-0-49.cluster1-mysql-unready.mysql-test 10-42-0-55.cluster1-mysql-unready.mysql-test 10-42-0-56.cluster1-mysql-unready.mysql-test]
2024/06/26 08:01:01 bootstrap finished in 0.003455 seconds
2024/06/26 08:01:01 bootstrap failed: select donor: connect to 10-42-0-49.cluster1-mysql-unready.mysql-test: ping DB: dial tcp 10.42.0.49:33062: connect: connection refused
2024/06/26 08:01:11 Peers: [10-42-0-49.cluster1-mysql-unready.mysql-test 10-42-0-55.cluster1-mysql-unready.mysql-test 10-42-0-56.cluster1-mysql-unready.mysql-test]
2024/06/26 08:01:11 bootstrap finished in 0.002918 seconds
2024/06/26 08:01:11 bootstrap failed: select donor: connect to 10-42-0-49.cluster1-mysql-unready.mysql-test: ping DB: dial tcp 10.42.0.49:33062: connect: connection refused
command terminated with exit code 137

Steps to reproduce

  1. setup mysql cluster on single node
  2. reboot bode
  3. mysql pods do not running

Versions

  1. Kubernetes
    k3s version v1.29.5+k3s1 (4e53a323)
    go version go1.21.9

  2. Operator
    83b9f60, v0.7.0

  3. Database
    mysql Ver 8.0.36-28 for Linux on x86_64 (Percona Server (GPL), Release 28, Revision 47601f19)

Anything else?

No response

@chernomor
Copy link
Author

As I see, bootstrapAsyncReplication expects all cluster peers from getTopology, but getTopology can not connect to some peers as all nodes now in CrashLoopBackOff state. I think it not need to require all pods be available.

Another problem (or first?), which was suppressed with sleep-forever now: master pod can not start as it can not resolve primary name cluter1-mysql-0.cluster1-mysql.mysql-test retrived from replica status and this name is not resolved now, becouse pods has names like cluster1-mysql-unready.mysql-test while pods is in starting states (I could be wrong). I d't know how it may be fixed now.

2024/06/25 16:51:32 Peers: [10-42-0-49.cluster1-mysql-unready.mysql-test 10-42-0-55.cluster1-mysql-unready.mysql-test 10-42-0-56.cluster1-mysql-unready.mysql-test]
2024/06/25 16:51:32 Primary: cluster1-mysql-0.cluster1-mysql.mysql-test Replicas: [cluster1-mysql-1.cluster1-mysql.mysql-test cluster1-mysql-2.cluster1-mysql.mysql-test]
2024/06/25 16:51:32 FQDN: cluster1-mysql-0.cluster1-mysql.mysql-test
2024/06/25 16:51:32 lookup cluster1-mysql-0 [10.42.0.56]
2024/06/25 16:51:32 PodIP: 10.42.0.56
2024/06/25 16:51:32 bootstrap finished in 0.021992 seconds
2024/06/25 16:51:32 bootstrap failed: get primary IP: lookup cluster1-mysql-0.cluster1-mysql.mysql-test: lookup cluster1-mysql-0.cluster1-mysql.mysql-test on 10.43.0.10:53: server misbehaving
2024/06/25 16:51:42 Peers: [10-42-0-49.cluster1-mysql-unready.mysql-test 10-42-0-55.cluster1-mysql-unready.mysql-test 10-42-0-56.cluster1-mysql-unready.mysql-test]
2024/06/25 16:51:42 Primary: cluter1-mysql-0.cluster1-mysql.mysql-test Replicas: [cluster1-mysql-1.cluster1-mysql.mysql-test cluster1-mysql-2.cluster1-mysql.mysql-test]
2024/06/25 16:51:42 FQDN: cluster1-mysql-0.cluster1-mysql.mysql-test
2024/06/25 16:51:42 lookup cluster1-mysql-0 [10.42.0.56]
2024/06/25 16:51:42 PodIP: 10.42.0.56
2024/06/25 16:51:42 bootstrap finished in 0.021340 seconds
2024/06/25 16:51:42 bootstrap failed: get primary IP: lookup cluster1-mysql-0.cluster1-mysql.mysql-test: lookup cluster1-mysql-0.cluster1-mysql.mysql-test on 10.43.0.10:53: server misbehaving

Some changes in deploy/cr.yaml:

--- a/deploy/cr.yaml
+++ b/deploy/cr.yaml
@@ -31,7 +31,7 @@ spec:
 #      group: cert-manager.io
 
   mysql:
-    clusterType: group-replication
+    clusterType: async
     autoRecovery: true
     image: percona/percona-server:8.0.36-28
     imagePullPolicy: Always
@@ -58,9 +58,12 @@ spec:
 #      periodSeconds: 10
 #      failureThreshold: 3
 #      successThreshold: 1
+#
+    startupProbe:
+      failureThreshold: 5
 
     affinity:
-      antiAffinityTopologyKey: "kubernetes.io/hostname"
+       antiAffinityTopologyKey: "none"
 #      advanced:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant