Pod level metrics are not being surfaced by the shim #180

kate-goldenring · 2024-08-21T17:44:47Z

This is a summary of a thread from the #spinkube CNCF slack. Thank you @asteurer for discovering this issue!

The issue

Initial discovery: When running a CPU intensive spin app with the shim, if load/ requests increases, CPU usage reporting on the pod stays static (output of kubectl top pods does not change). This makes it impossible to use the Horizontal Pod Autoscaler with SpinKube. This is consistent for all of the following tested K8s distributions -- note that the only distribution that does not exhibit this behavior is K3d:

distro	containerd	works
k3d	v1.7.7-k3s1.27	yes
AKS	1.7.15-1	no
k3s	1.6.28	no
Kind	1.7.15	no

Repro steps

Apply a CPU intensive spin app deployment and the HPA:

piVersion: apps/v1
kind: Deployment
metadata:
 name: spin-test
spec:
 replicas: 1
 selector:
   matchLabels:
     app: spin-test
 template:
   metadata:
     labels:
       app: spin-test
   spec:
     runtimeClassName: wasmtime-spin-v2
     containers:
     - name: spin-test
       image: ghcr.io/spinkube/spin-operator/cpu-load-gen:20240311-163328-g1121986
       command: ["/"]
       ports:
       - containerPort: 80
       resources:
         requests:
           cpu: 100m
           memory: 400Mi
         limits:
           cpu: 500m
           memory: 600M
---
apiVersion: v1
kind: Service
metadata:
 name: spin-test
spec:
 ports:
   - protocol: TCP
     port: 80
     targetPort: 80
 selector:
   app: spin-test
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
 name: spinapp-autoscaler
spec:
 scaleTargetRef:
   apiVersion: apps/v1
   kind: Deployment
   name: spin-test
 minReplicas: 1
 maxReplicas: 10
 metrics:
 - type: Resource
   resource:
     name: cpu
     target:
       type: Utilization
       averageUtilization: 50
---

Call the stats API for the node the Pod is running on:

kubectl get --raw "/api/v1/nodes/$NODE_NAME/proxy/stats/summary?only_cpu_and_memory=true" | grep "spin-test" -C 40

Output may look similar to the following:

  {
   "podRef": {
    "name": "spin-test-66d9dd45f5-csf5j",
    "namespace": "default",
    "uid": "d82c9414-c690-4b0c-925f-4ead983edce4"
   },
   "startTime": "2024-08-21T17:22:33Z",
   "containers": [
    {
     "name": "spin-test",
     "startTime": "2024-08-21T17:22:33Z",
     "cpu": {
      "time": "2024-08-21T17:28:32Z",
      "usageNanoCores": 4728,
      "usageCoreNanoSeconds": 1926321
     },
     "memory": {
      "time": "2024-08-21T17:28:32Z",
      "workingSetBytes": 25096192
     }
    }
   ],
   "cpu": {
    "time": "2024-08-21T17:28:32Z",
    "usageNanoCores": 0,
    "usageCoreNanoSeconds": 0
   },
   "memory": {
    "time": "2024-08-21T17:28:32Z",
    "availableBytes": 599998464,
    "usageBytes": 0,
    "workingSetBytes": 0,
    "rssBytes": 0,
    "pageFaults": 0,
    "majorPageFaults": 0
   }
  },

Notice how the Pod CPU and memory usage values are 0 while the container has properly propagated values.

Load the app to see if the HPA increases replicas
If Pod metrics were properly reported, the app replicas would increase.

# After port forwarding to port 3000
bombardier localhost:3000 -n 10 -t 30s

Calling the stats API during the load test shows that while the container usageNanoCores jumped from 4728 to 497486, the pod metrics did not change nor did the app replica count or the output of kubectl top pods for that Pod.

Other investigation

Pod metrics are surfaced for normal containers not executed with the shim (without runtime class wasmtime-spin-v2 specified):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: spin-test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: spin-test
  template:
    metadata:
      labels:
        app: spin-test
    spec:
      containers:
      - name: spin-test
        image: ghcr.io/kate-goldenring/spin-in-container:fib
        ports:
        - containerPort: 3000
---
apiVersion: v1
kind: Service
metadata:
  name: spin-test
spec:
  ports:
    - protocol: TCP
      port: 3000
      targetPort: 3000
  selector:
    app: spin-test
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: hpa-autoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: spin-test
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

However, if that same container is executed with the shim, Pod metrics are no longer surfaced.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: spin-test-runwasi
spec:
  replicas: 1
  selector:
    matchLabels:
      app: spin-test-runwasi
  template:
    metadata:
      labels:
        app: spin-test-runwasi
    spec:
      runtimeClassName: wasmtime-spin-v2
      containers:
      - name: spin-test-runwasi
        image: ghcr.io/kate-goldenring/spin-in-container:fib
        ports:
        - containerPort: 3000
---
apiVersion: v1
kind: Service
metadata:
  name: spin-test-runwasi
spec:
  ports:
    - protocol: TCP
      port: 3000
      targetPort: 3000
  selector:
    app: spin-test-runwasi

Possible solutions and areas to investigate

Some areas to investigate that @jsturtevant and @radu-matei mentioned are the following:

cgroup version being used
how runwasi handles / mocks Pod creation here

The text was updated successfully, but these errors were encountered:

kate-goldenring · 2024-08-22T17:45:08Z

I wonder if this may be the issue: containerd/cri#922. Specifically, we may need to add the io.kubernetes.container.name=="POD" label to the pause container. This may also explain why this works on k3d if it is using docker for the container runtime instead of containerd (not sure if this is the case though).

kate-goldenring · 2024-10-30T20:51:29Z

@jprendes and I spent some time setting up GDB debugging with the shim for this. While we did not come to any new conclusions, I wanted to share our repro steps:

Debugging with GDB

Install K3s. This uses kwasm to configure the containerd config to use a shim at /opt/kwasm/bin/containerd-shim-spin-v2 so be sure to move your debug binary here

wget https://gist.githubusercontent.com/kate-goldenring/a90bbe696d2cd48b44c093e1154047c0/raw/93f6ee1281123858290cb2a6ac61141e4671d38c/spin-kube-k3s.sh
chmod +x ./spin-kube-k3s.sh
./spin-kube-k3s.sh

Download the Native Debug VSCode extension
Create script to enable executing gbd as sudo user:
```
#!/bin/bash

sudo gdb $*
```
Create gdb launch.json

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "type": "gdb",
            "request": "attach",
            "name": "Attach to PID",
            "target": "{PID}",
            "cwd": "${workspaceRoot}",
            "valuesFormatting": "parseText",
            "gdbpath": "/home/kagold/projects/containerd-shim-spin/_scratch/resources-debug/sudo-gdb.sh"
        }
    ]
}

Build debug version of shim and move it to expected shim location for k8s distro
Apply spin app deployment
Get spin process PID and update launch.json to target it
Run debugger in VSCode
Pause debugger and add desired breakpoint
(repeat)

kate-goldenring · 2024-12-19T01:42:31Z

@Mossaka and I spent today debugging this issue. We ended up making the most progress by building kubelet with debug symbols and attaching a debugger to that. I'll attach a hackmd for our steps tomorrow. For now, the main discovery was that kublet is failing to find pod stats for the wasm pods because it is using the wrong cgroup.
Kubelet is using the kubepods.slice cgroup which has no cpu and mem, while the /kubepods cgroup does have cpu and mem:

cat /sys/fs/cgroup/kubepods.slice/kubepods-pod99afd520_5b3e_4c8d_8471_4b6c8ee26b4c.slice/cpu.stat
usage_usec 0
user_usec 0
system_usec 0
core_sched.force_idle_usec 0
nr_periods 2
nr_throttled 0
throttled_usec 0
nr_bursts 0
burst_usec 0


kagold@kagold-ThinkPad-X1-Carbon-6th:~$ cat /sys/fs/cgroup/kubepods-pod99afd520_5b3e_4c8d_8471_4b6c8ee26b4c.slice\:cri-containerd\:c339025b3f0d9bec908636b49764fbca659636f54ad45c2eddd281049eb4df00/cpu.stat
usage_usec 25442435
user_usec 13538017
system_usec 11904418
core_sched.force_idle_usec 0
nr_periods 43128
nr_throttled 130
throttled_usec 11639387
nr_bursts 0
burst_usec 0

This is the line in the kubelet that is setting pod cpu and memory stats to 0 because it grabbed the stats from the wrong cgroup. The work of gathering stats starts in `listPodStatsPartiallyFromCRI.

Kubelet gathers a table of container info that is indexed. The following is an example of that table, where the first two entries are from a linux container that gets appropriate pod stats and the latter two are the wasm container. Kubelet indexes this table with key pod99afd520-5b3e-4c8d-8471-4b6c8ee26b4c for the wasm pod, which has no stats

key	podCgroupKey	stats?	cm.IsSystemdStyleName(key)
"/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod79c5176e2321b3741f55aee0036af2a9.slice/cri-containerd-05b01919e2bced0c3c3d45f1065d500954fbf9e83205c5ccff00024b2e7d7821.scope”	"cri-containerd-05b01919e2bced0c3c3d45f1065d500954fbf9e83205c5ccff00024b2e7d7821.scope”	yes	false
"/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod79c5176e2321b3741f55aee0036af2a9.slice”	pod79c5176e2321b3741f55aee0036af2a9	yes	true (hits !isPodManagedContainer(&cinfo)=true)
/kubepods-pod99afd520_5b3e_4c8d_8471_4b6c8ee26b4c.slice:cri-containerd:c339025b3f0d9bec908636b49764fbca659636f54ad45c2eddd281049eb4df00”	kubepods-pod99afd520_5b3e_4c8d_8471_4b6c8ee26b4c.slice:cri-containerd:c339025b3f0d9bec908636b49764fbca659636f54ad45c2eddd281049eb4df00	yes	false
"/kubepods.slice/kubepods-pod99afd520_5b3e_4c8d_8471_4b6c8ee26b4c.slice”	"pod99afd520-5b3e-4c8d-8471-4b6c8ee26b4c”	no	true (hits !isPodManagedContainer(&cinfo)=true)

Note that in general, the wasm container's cgroup path looked different from the other containers and doesn't conform to either of the examples commented in line in this function:

func extractIDFromCgroupPath(cgroupPath string) string {
	// case0 == cgroupfs: "/kubepods/burstable/pod2fc932ce-fdcc-454b-97bd-aadfdeb4c340/9be25294016e2dc0340dd605ce1f57b492039b267a6a618a7ad2a7a58a740f32"
	id := filepath.Base(cgroupPath)

	// case1 == systemd: "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod2fc932ce_fdcc_454b_97bd_aadfdeb4c340.slice/cri-containerd-aaefb9d8feed2d453b543f6d928cede7a4dbefa6a0ae7c9b990dd234c56e93b9.scope"
	// trim anything before the final '-' and suffix .scope
	systemdSuffix := ".scope"
	if strings.HasSuffix(id, systemdSuffix) {
		id = strings.TrimSuffix(id, systemdSuffix)
		components := strings.Split(id, "-")
		if len(components) > 1 {
			id = components[len(components)-1]
		}
	}
	return id
}

jsturtevant · 2024-12-19T01:46:48Z

Great find!

kate-goldenring · 2024-12-20T00:09:13Z

Here is a HackMD with debugging configuration for the Shim, containerd, and kubelet: https://hackmd.io/@kgoldenring/Sy2ktGfSJe

Would any of this be useful to add to the repository as official documentation?

kate-goldenring · 2024-12-20T00:50:16Z

@jsturtevant @Mossaka I am not sure where the best place to look it. We know this is likely related to cgroups, either in how we are leveraging youki in runwasi or in youki itself. Want to reiterate as i mention in the issue summary that this issue also occurs with linux containers executed by the shim ( they also get boggled cgroups).

kate-goldenring · 2024-12-20T01:05:48Z

I wonder if it has something to do with the handling of not using systemd to manage cgroups: https://github.com/containerd/runwasi/blob/95853b4e3339d0509bc8ea195e742442268ff6a7/crates/containerd-shim-wasm/src/sys/unix/container/instance.rs#L62

We are hitting this block of using cgroup v2 manager: https://github.com/youki-dev/youki/blob/7c67acc33e45f21f1038109c36adec44b0b505b1/crates/libcgroups/src/common.rs#L361

kate-goldenring · 2025-01-20T21:32:21Z

This issues seems related @Mossaka containerd/runwasi#276

kate-goldenring self-assigned this Aug 21, 2024

Mossaka added the bug Something isn't working label Sep 19, 2024

calebschoepp mentioned this issue Sep 23, 2024

Fix autoscaling bug and update HPA sample spinkube/spin-operator#321

Open

Mossaka added this to containerd-shim-spin 1.0 Oct 23, 2024

Mossaka moved this to In Progress in containerd-shim-spin 1.0 Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod level metrics are not being surfaced by the shim #180

Pod level metrics are not being surfaced by the shim #180

kate-goldenring commented Aug 21, 2024

kate-goldenring commented Aug 22, 2024

kate-goldenring commented Oct 30, 2024

kate-goldenring commented Dec 19, 2024

jsturtevant commented Dec 19, 2024

kate-goldenring commented Dec 20, 2024

kate-goldenring commented Dec 20, 2024

kate-goldenring commented Dec 20, 2024 •

edited

Loading

kate-goldenring commented Jan 20, 2025

Pod level metrics are not being surfaced by the shim #180

Pod level metrics are not being surfaced by the shim #180

Comments

kate-goldenring commented Aug 21, 2024

The issue

Repro steps

Other investigation

Possible solutions and areas to investigate

kate-goldenring commented Aug 22, 2024

kate-goldenring commented Oct 30, 2024

Debugging with GDB

kate-goldenring commented Dec 19, 2024

jsturtevant commented Dec 19, 2024

kate-goldenring commented Dec 20, 2024

kate-goldenring commented Dec 20, 2024

kate-goldenring commented Dec 20, 2024 • edited Loading

kate-goldenring commented Jan 20, 2025

kate-goldenring commented Dec 20, 2024 •

edited

Loading