-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spin application keep restarting in AKS cluster #89
Comments
This line from the end of the log file is interesting and it seems to come from containerd metric collection logic: https://github.com/containerd/containerd/blob/2f807b606a58020174e1aecfdec63810201e05bc/core/metrics/cgroups/v2/metrics.go#L151 Could you please verify that your nodes are on cgroups v2 by running |
From the node where the app is restarting:
From the node where the app is not restarting:
|
I am thinking there might be some metircs that we report at the shim level where they are not recognized by the containerd. |
I just deployed another application to the cluster (to rule out the simple sample app). This one also restarts, but only of one of the nodes, which the other app was restrating on. I've also noted that the restrats have not happenede for the last 11 hours on the node, where the old app used to restart, and this node does not restart the new app. So I don't know if something "fell into place" on that node. (Let me know if this becomes too confusing to keep track on). However current restart stats for the two apps across the four nodes:
|
Without having done anything, the other nodes, which previously also restarted the spin-sample, has started restarting both the simple-sample and the new app:
|
Just added a third nodepool using Mariner / Azue Linux, both nodes in that node pool are restarting the apps.
|
I checked on one of the node where this issue is happening and observed that it is returning cgroups v1 metrics:
trying to find out why that is the case. |
another data point, but i am unable to make anything out of it right now. (i was expecting crictl to fail as well)
|
i added some debug logging to shim and used that on the node having crashes:
so it seems like it is some how receiving cgroup v1 metrics. |
I deployed the simple-app sample to an AKS cluster, and is observing that two of four pods are continuously restarting.
The cluster consists of four nodes (two are amd62, two are arm64). The restarts are happening on one node on each architecture. All nodes run Ubuntu 22.04.
I've obtained the containerd logs (using
journalctl -u containerd
), and attached here. Let me know what else could be needed to troubleshoot this problem. I still have the cluster running, and the app deployed.containerd-restarting.log
containerd-not-restarting.log
The text was updated successfully, but these errors were encountered: