-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WASM Pod terminating stack in Kubernetes #418
Comments
Maybe it's stuck in the libcontainer part...? |
I'm also seeing this with containard-shim-spin on ubuntu-22.04: containerd logs (looping between StopPodSandbox/StopContainer for two wasi pods):
Given the Kubelet side of these logs are filled with timeouts being stuck somewhere in libcontainer would not be too surprising. I'm not really a rust person, but I'll try and get a runwasi dev environment setup later to at least get a little more insight into what could be failing here. |
I think this might be a bug that has been resolved in newer versions of the shims. This issue leads me to believe that a previous version of some of the underlying shared shim implementation had an issue terminating which when using a newer version of the shim it is resolved (via kwasm node installer): ZEISS/enterprise-wasm#4 (comment). However, we should add some integration tests to scale up and down and verify the pod is properly terminated. |
@mochizuki875 May I ask you to give it a try with the latest version? |
@devigned I saw that and was surprised (bc I was using the latest kwasm-operator release - 0.2.3, which pulls in the latest node-installer image). So I think there might be something Weird going on 😅 |
@utam0k |
BTW is the latest version v0.4.0? |
The latest version of I would recommend you to build the WasmEdge shim by yourself. |
Is there some case where we aren't sending a container exit event and/or not releasing the |
To expand on that: The cri module in containerd is listening on the |
I am trying to debug this and found one datapoint which may be useful. If you notice the output below, the PID is 0 and status is unknown. (some lines removed from o/p for readability)
upon further checking one of them is the pause container and one is the wasm workload that I am trying to run. environment: k3s based single-node cluster on Ubuntu 22. |
hello folks, thank you for the suggestions and pointers during the runwasi community meeting. As discussed there, following are the steps I used to reproduce the issue:
start k3s cluster
download the latest version of containerd-shim-spin
configure containerd to allow running spin wasm workloads
add following two lines to /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
restart k3s
Run ctr task list, observe that it returns quickly
create a spin workload deployment
observe ctr task list takes long time and return wasm workload/pause containers with pid 0 and status unknown
try deleting the pod, observe it remains stuck
attachment: containerd.log I tried to run through the steps after cleaning up my dev machine. if any of the steps above does not work as expected, pls let me know and I can try to check why that is. thanks again |
@rajatjindal pointed out that this may be related to this issue containerd/ttrpc#72 |
I wrote a minimal ttrpc client to debug this: ttrpc-client/main.go
When I run above client with a working container, it prints the
but with shim wasm workload it hangs:
Further I tried to check if the socket is listening and who owns it: with a working container (non wasm workload): (notice the output has a line with LISTENING and that there is a ## check which process is holding ttrpc socket file
netstat --all --program | grep ff1283bad53b
unix 3 [ ] STREAM CONNECTED 2907755 290017/containerd-s /run/containerd/s/ff1283bad53b7c324482797129a32896822a7693a8bd21c8b375b7e22914e543
unix 2 [ ACC ] STREAM LISTENING 2766418 290017/containerd-s /run/containerd/s/ff1283bad53b7c324482797129a32896822a7693a8bd21c8b375b7e22914e543
unix 3 [ ] STREAM CONNECTED 2906647 290017/containerd-s /run/containerd/s/ff1283bad53b7c324482797129a32896822a7693a8bd21c8b375b7e22914e543
## check the process details
root 290017 1 0 Mar12 ? 00:00:03 /var/lib/rancher/k3s/data/a3b46c0299091b71bfcc617b1e1fec1845c13bdd848584ceb39d2e700e702a4b/bin/containerd-shim-runc-v2 -namespace k8s.io -id 27cdf74b6c01c75f6d5f873d974cf7d0dc395016bd7742c73f307dfdbaf41539 -address /run/k3s/containerd/containerd.sock
65535 290103 290017 0 Mar12 ? 00:00:00 /pause
ubuntu 290369 290017 0 Mar12 ? 00:01:47 /metrics-server --cert-dir=/tmp --secure-port=10250 --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --kubelet-use-node-status-port --metric-resolution=15s --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
while for wasm workload shim output looks as follows: (notice some CONNECTING in output but no LISTENING and also no
update: maybe not having a
|
going further into this debug direction:
the number |
just an update on debugging this:
above steps, in addition to comments I added to the issue earlier, makes me feel that I am still making progress in the right direction and I will continue to debug it with this for now. if you have any insights as to why we might be seeing this ttrpc behavior, (or if you think this is not the right direction to debug this issue), those will be very helpful. |
Thanks for digging into this @rajatjindal and I believe you are in the right direction. I could reproduce this issue with my local k3s cluster in ubuntu. Will do more investigation on it |
I can verify that this happens on both latest and older version of k3s (e.g. |
Interestingly, I could not reproduce this issue with the wasmtime shim. Applying wasmtime workloads
sudo ctr -n k8s.io --address /run/k3s/containerd/containerd.sock task list
TASK PID STATUS
153872d5e15fa359d51cfd989fd0b4eda61f57865aa661c1c97d692fb33fa36f 1166646 RUNNING
211afdce974e6be87ea9ef44627805931c1a018868e6822133400474e0a615fc 1167640 RUNNING
2ef159fb84388f3640e1fa0267cbc46d0917ffb9111877b3c0d7e9422aea088d 1167449 RUNNING
3e07e808382fdb65b205f87ad6e29874fd58012809784f2e6991fca10800bdee 1166386 RUNNING
bc9880cf00931317d3001805a65f56f0cdca1b8ee7c4a92c04b903c0dbacf46f 1166290 RUNNING
6215eb541460ed8e5f42007970a1ca7aac63e6aef6576371cadd543e9792d562 1170221 RUNNING
6171486f830bf174ea29dc233ad2584bb4293296e263d131aebcb17e67cbf923 1167881 RUNNING
5ca536c77fc9dd77255312352687e99762e7603b569869b9683b7f70ee40c3d9 1170077 RUNNING
dcfb4f3005ebcf652770485d277687355ab3980c83c255b1a78442ee762bf13e 1170078 RUNNING
0ea09d624e471804bee930879f6f207b85ffd1fd6c8d5ca2eda47df3991aa016 1170220 RUNNING
e8c13301e036bd9185ddb9a2c507b8e4f60d318b36f3a7e373cfecf6a7b6223e 1170222 RUNNING
0dfbcf4bc8d76581aecede1a1fcb27c7e48ec445ca0aac486c0ba282553ffd81 1166301 RUNNING
18651052148a7e1e0ee358a974fd4a13e0cdf9bd2f2f2abbdcb377062a9716b6 1167563 RUNNING
90c5b1dc67a725d58f51c130878c1192f3a3d8ff6e3396dc1ada9462e43f67f8 1167693 RUNNING
e2eb4ed9372ea8fa933f8251dc5bc8d14eae87bee056d031c184673f61710b71 1166597 RUNNING
0fe7d7697e8d741d54b6166c4251e3a52b0e47d70f9c256fbb6e034ad2adcbf6 1169927 RUNNING
c0b943ee316c38a4cfd846eb2c4300d095ed48148c1fa724aabad43723b2f5b6 1170082 RUNNING
1701687094cf866f0de01a6b50b4d5bc53822265d315866e27a7c28e30afccbd 1166533 RUNNING
ee01e6a7c189473726781ae8703b17e94a3790eeac157441901b13265d6a7699 1169982 RUNNING
fd3f5fe4d9354f283db253ef5db20cedf17b8a0dd7b706c78bd9234e52135ee2 1169995 RUNNING There were no UNKNOWN and pid = 0 tasks Deleting the workloadsk get po
NAME READY STATUS RESTARTS AGE
wasi-demo-6c5dcf6ddb-mbnb5 2/2 Terminating 0 98s
wasi-demo-6c5dcf6ddb-v75l8 2/2 Terminating 0 98s
wasi-demo-6c5dcf6ddb-qn8k7 2/2 Terminating 0 98s
k get po
No resources found in default namespace. But, it seems like the shim processes are still running
I've attached containerd.log here if you found it useful. |
Is the problem exclusive to the |
Is the problem in the go codebase in |
Hi @jprendes, I am still quite new to containerd/shim workflow/codebase and trying out different things. so I cannot say for sure if the issue is in go side of I have been debugging this more, and it seems one more thing I noticed here is the difference in format of the cgroup path (refer below). I am not entirely sure if this is expected, but it seems like for the path as seen for spin container, the above code will crash.
update: I can confirm that once i comment out code to call cgroup_metrics here
|
closed with #254 |
What happened?
I've attempt to deploy WASM Pod on Kubernetes and WASM Pod is sccessfly created and running.
However, when I delete the WASM Pod, the WASM Pod is not terminated and the status remains as
Terminating
.I don't have deep understanding in this area and don't know if it is certainly issue of
runwasi
What did you expect to happen?
WASM Pod is terminated.
How can we reproduce it (as minimally and precisely as possible)?
Environment
The architecture is as follow:
Ubuntu 22.04.3 LTS
Kubernetes
: v1.28.4Containerd
: v1.7.10runwasi
(containerd-shim-wasmedge
): v0.3.0WasmEdge
: 0.13.5Containerd
's runtime configuration is as follow:/etc/containerd/config.toml
Kubernetes
RuntimeClass
forWasmEdge
is as follow:wasmedge-runtimeclass.yaml
WASM Pod manifest is as follow:
wasm-pod-manifest.yaml
Reproduction steps
Create WASM Pod and it will sccefully created and running.
Delete WASM Pod and it will not finish terminating.
$ kubectl delete pod wasm-example pod "wasm-example" deleted (it is stack here) $ kubectl get pod NAME READY STATUS RESTARTS AGE wasm-example 1/1 Terminating 0 22m
Then
Containerd
's log is as follow:Anything else we need to know?
Ubuntu 20.04.6 LTS
.$ kubectl apply -f wasm-pod-manifest-2.yaml pod/wasm-example-2 created $ kubectl get pod NAME READY STATUS RESTARTS AGE wasm-example 1/1 Running 0 7m40s wasm-example-2 1/1 Running 0 42s $ kubectl delete pod wasm-example-2 pod "wasm-example-2" deleted (Successfully terminated)
The text was updated successfully, but these errors were encountered: