Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spin application keep restarting in AKS cluster #89

Open
mikkelhegn opened this issue Apr 18, 2024 · 9 comments
Open

Spin application keep restarting in AKS cluster #89

mikkelhegn opened this issue Apr 18, 2024 · 9 comments
Assignees

Comments

@mikkelhegn
Copy link

I deployed the simple-app sample to an AKS cluster, and is observing that two of four pods are continuously restarting.

The cluster consists of four nodes (two are amd62, two are arm64). The restarts are happening on one node on each architecture. All nodes run Ubuntu 22.04.

I've obtained the containerd logs (using journalctl -u containerd), and attached here. Let me know what else could be needed to troubleshoot this problem. I still have the cluster running, and the app deployed.

containerd-restarting.log
containerd-not-restarting.log

@Mossaka
Copy link
Member

Mossaka commented Apr 18, 2024

invalid metric type for ...

This line from the end of the log file is interesting and it seems to come from containerd metric collection logic: https://github.com/containerd/containerd/blob/2f807b606a58020174e1aecfdec63810201e05bc/core/metrics/cgroups/v2/metrics.go#L151

Could you please verify that your nodes are on cgroups v2 by running mount | grep cgroup?

@mikkelhegn
Copy link
Author

From the node where the app is restarting:

cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
cgroup on /run/containerd/io.containerd.runtime.v2.task/k8s.io/79f773f2e2b1659aa4f3d7bb1d81a48666d3add27fce0b1b5d805a85d1021191/rootfs/sys/fs/cgroup type cgroup2 (ro,nosuid,nodev,noexec,relatime)

From the node where the app is not restarting:

cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
cgroup on /run/containerd/io.containerd.runtime.v2.task/k8s.io/7f139ffd06a9c6a7d95d1e6c019b537d90232ee53b0d4f8417598669ad76db9d/rootfs/sys/fs/cgroup type cgroup2 (ro,nosuid,nodev,noexec,relatime)

@Mossaka
Copy link
Member

Mossaka commented Apr 18, 2024

I am thinking there might be some metircs that we report at the shim level where they are not recognized by the containerd.

@mikkelhegn
Copy link
Author

I just deployed another application to the cluster (to rule out the simple sample app).

This one also restarts, but only of one of the nodes, which the other app was restrating on. I've also noted that the restrats have not happenede for the last 11 hours on the node, where the old app used to restart, and this node does not restart the new app. So I don't know if something "fell into place" on that node. (Let me know if this becomes too confusing to keep track on).

However current restart stats for the two apps across the four nodes:

NAME                                                    READY   STATUS             RESTARTS        AGE   IP             NODE                                NOMINATED NODE   READINESS GATES
simple-spinapp-56687588d9-7d8zf                         1/1     Running            260 (11h ago)   38h   10.244.0.35    aks-nodepool1-31969921-vmss000001   <none>           <none>
simple-spinapp-56687588d9-s5g95                         1/1     Running            364 (21m ago)   38h   10.244.3.118   aks-nodepool2-19133108-vmss000001   <none>           <none>
simple-spinapp-56687588d9-xffsh                         1/1     Running            0               38h   10.244.2.8     aks-nodepool2-19133108-vmss000000   <none>           <none>
simple-spinapp-56687588d9-xh8l7                         1/1     Running            0               38h   10.244.1.14    aks-nodepool1-31969921-vmss000000   <none>           <none>
spinkube-live-mikkel-75cc76c678-4r7rl                   0/1     CrashLoopBackOff   7 (2m25s ago)   38m   10.244.3.123   aks-nodepool2-19133108-vmss000001   <none>           <none>
spinkube-live-mikkel-75cc76c678-f7476                   1/1     Running            0               38m   10.244.0.36    aks-nodepool1-31969921-vmss000001   <none>           <none>
spinkube-live-mikkel-75cc76c678-fm8dd                   1/1     Running            0               38m   10.244.1.15    aks-nodepool1-31969921-vmss000000   <none>           <none>
spinkube-live-mikkel-75cc76c678-j5pvj                   1/1     Running            0               38m   10.244.2.9     aks-nodepool2-19133108-vmss000000   <none>           <none>

@mikkelhegn
Copy link
Author

Without having done anything, the other nodes, which previously also restarted the spin-sample, has started restarting both the simple-sample and the new app:

NAME                                    READY   STATUS             RESTARTS          AGE     IP             NODE                                NOMINATED NODE   READINESS GATES
simple-spinapp-56687588d9-7d8zf         1/1     Running            307 (6m28s ago)   43h     10.244.0.130   aks-nodepool1-31969921-vmss000001   <none>           <none>
simple-spinapp-56687588d9-s5g95         0/1     CrashLoopBackOff   407 (80s ago)     43h     10.244.3.208   aks-nodepool2-19133108-vmss000001   <none>           <none>
simple-spinapp-56687588d9-xffsh         1/1     Running            0                 43h     10.244.2.8     aks-nodepool2-19133108-vmss000000   <none>           <none>
simple-spinapp-56687588d9-xh8l7         1/1     Running            0                 43h     10.244.1.14    aks-nodepool1-31969921-vmss000000   <none>           <none>
spinkube-live-mikkel-75cc76c678-4r7rl   1/1     Running            51 (39s ago)      5h19m   10.244.3.209   aks-nodepool2-19133108-vmss000001   <none>           <none>
spinkube-live-mikkel-75cc76c678-f7476   0/1     CrashLoopBackOff   47 (4m48s ago)    5h19m   10.244.0.131   aks-nodepool1-31969921-vmss000001   <none>           <none>
spinkube-live-mikkel-75cc76c678-fm8dd   1/1     Running            0                 5h19m   10.244.1.15    aks-nodepool1-31969921-vmss000000   <none>           <none>
spinkube-live-mikkel-75cc76c678-j5pvj   1/1     Running            0                 5h19m   10.244.2.9     aks-nodepool2-19133108-vmss000000   <none>           <none>

@mikkelhegn
Copy link
Author

Just added a third nodepool using Mariner / Azue Linux, both nodes in that node pool are restarting the apps.

NAME                                    READY   STATUS             RESTARTS         AGE     IP             NODE                                NOMINATED NODE   READINESS GATES
simple-spinapp-56687588d9-7d8zf         1/1     Running            307 (23m ago)    43h     10.244.0.130   aks-nodepool1-31969921-vmss000001   <none>           <none>
simple-spinapp-56687588d9-jwp9b         1/1     Running            1 (2m1s ago)     2m34s   10.244.4.5     aks-nodepool3-26262681-vmss000000   <none>           <none>
simple-spinapp-56687588d9-s5g95         0/1     CrashLoopBackOff   409 (2m6s ago)   43h     10.244.3.215   aks-nodepool2-19133108-vmss000001   <none>           <none>
simple-spinapp-56687588d9-wdt9b         1/1     Running            1 (2m1s ago)     2m34s   10.244.5.6     aks-nodepool3-26262681-vmss000001   <none>           <none>
simple-spinapp-56687588d9-xffsh         1/1     Running            0                43h     10.244.2.8     aks-nodepool2-19133108-vmss000000   <none>           <none>
simple-spinapp-56687588d9-xh8l7         1/1     Running            0                43h     10.244.1.14    aks-nodepool1-31969921-vmss000000   <none>           <none>
spinkube-live-mikkel-75cc76c678-4r7rl   1/1     Running            55 (4m46s ago)   5h37m   10.244.3.214   aks-nodepool2-19133108-vmss000001   <none>           <none>
spinkube-live-mikkel-75cc76c678-dx8zv   1/1     Running            2 (50s ago)      2m23s   10.244.5.7     aks-nodepool3-26262681-vmss000001   <none>           <none>
spinkube-live-mikkel-75cc76c678-f7476   0/1     CrashLoopBackOff   50 (3m19s ago)   5h37m   10.244.0.136   aks-nodepool1-31969921-vmss000001   <none>           <none>
spinkube-live-mikkel-75cc76c678-fm8dd   1/1     Running            0                5h37m   10.244.1.15    aks-nodepool1-31969921-vmss000000   <none>           <none>
spinkube-live-mikkel-75cc76c678-j5pvj   1/1     Running            0                5h37m   10.244.2.9     aks-nodepool2-19133108-vmss000000   <none>           <none>
spinkube-live-mikkel-75cc76c678-q88q5   1/1     Running            1 (50s ago)      2m23s   10.244.4.6     aks-nodepool3-26262681-vmss000000   <none>           <none>

@rajatjindal
Copy link
Member

I checked on one of the node where this issue is happening and observed that it is returning cgroups v1 metrics:

root@aks-nodepool1-31969921-vmss000001:/# /tmp/ttrpc-client /run/containerd/f1e518aba0e9dccf.sock 94c1433b03b96c984b1bf15773e409682ae64b6d697f7392c1bb37256f492bca

type_url:"io.containerd.cgroups.v1.Metrics"  value:"\x12\x03\x08\xee\x01\x1a\x16\n\x12\x08\xfdܙ\xae\n\x10\xe4\xfd\xa0\xf9\x03\x18\x98\xdf\xf8\xb4\x06\x12\x00\"\t\x8a\x02\x06\x10\x80\xe0\x9c\xff\x02"

trying to find out why that is the case.

@rajatjindal
Copy link
Member

another data point, but i am unable to make anything out of it right now. (i was expecting crictl to fail as well)

root@aks-nodepool1-31969921-vmss000001:/# crictl ps | grep spin
01748f8974627       bca3c4d784227       59 minutes ago      Running             simple-spinapp          323                 94c1433b03b96       simple-spinapp-56687588d9-7d8zf
f4ff19248a4ef       f99fcd5dfb557       About an hour ago   Running             spinkube-live-mikkel    62                  b0540e08d14db       spinkube-live-mikkel-75cc76c678-f7476

crictl stats f4ff19248a4ef
CONTAINER           NAME                   CPU %               MEM                 DISK                INODES
f4ff19248a4ef       spinkube-live-mikkel   0.00                33.43MB             7.881MB             23

crictl stats 01748f8974627
CONTAINER           NAME                CPU %               MEM                 DISK                INODES
01748f8974627       simple-spinapp      0.00                10.28MB             90.11kB             26

@rajatjindal
Copy link
Member

i added some debug logging to shim and used that on the node having crashes:

Apr 23 17:05:22 aks-nodepool1-31969921-vmss000003 containerd[212274]: time="2024-04-23T17:05:22.943002063Z" level=info msg="stats: StatsRequest { id: "2008530c02eec982e25a6ebe9f5e005b0704b4ed8546f7c7b5f3e104f9c17f4
0", special_fields: SpecialFields { unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } } }"

Apr 23 17:05:22 aks-nodepool1-31969921-vmss000003 containerd[212274]: time="2024-04-23T17:05:22.945109028Z" level=info msg="stats resp: StatsResponse { stats: MessageField(Some(Any { type_url: "io.containerd.cgroup
s.v1.Metrics", value: [18, 2, 8, 5, 26, 16, 10, 12, 8, 159, 181, 14, 16, 154, 197, 7, 24, 133, 240, 6, 18, 0, 34, 8, 138, 2, 5, 16, 128, 128, 169, 16], special_fields: SpecialFields { unknown_fields: UnknownFields
{ fields: None }, cached_size: CachedSize { size: 0 } } })), special_fields: SpecialFields { unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } } }"

so it seems like it is some how receiving cgroup v1 metrics.

@Mossaka Mossaka self-assigned this Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants