Skip to content
This repository has been archived by the owner on Jun 28, 2024. It is now read-only.

CC: newly pulled pause image by snapshotter stored in an unexpected location #5781

Open
BbolroC opened this issue Oct 12, 2023 · 9 comments
Open
Labels
bug Incorrect behaviour needs-review Needs to be assessed by the team.

Comments

@BbolroC
Copy link
Member

BbolroC commented Oct 12, 2023

Description of problem

With a config IMAGE_OFFLOAD_TO_GUEST=yes and FORKED_CONTAINERD=no, a pod creation under IBM Z SE is sometimes stuck in a CreateContainerError state with the following error:

Error: failed to create containerd container: create instance 697: object with key "697" already exists: unknown

It is a known issue with an upstream containerd v1.6.8 (#5775 (comment)). A quick remedy would be to remove a pause image and get the snapshotter to pull the image. But the newly pulled image is stored in an unexpected location (originally /run/kata-containers/shared/sandboxes/${sandbox_id}/shared is expected) as follows:

# ls -lah /run/kata-containers/shared/sandboxes/a322d916b5dc547d1dce178d31b13091418793a9675a8aa006fcfecd49f8bbc1/shared
total 16K
drwxr-x--- 3 root root 160 Oct 12 11:04 .
drwx------ 5 root root 100 Oct 12 11:04 ..
-rw-r--r-- 1 root root 103 Oct 12 11:04 a322d916b5dc547d1dce178d31b13091418793a9675a8aa006fcfecd49f8bbc1-e9967091f9448d8a-resolv.conf
-rw-r--r-- 1 root root  11 Oct 12 11:04 efde0bf9b12e2e127bdb007f58e4dfb893d990fc64b8063f9594c1c1753c06ce-44e4e6f3b60b2926-hostname
-rw-r--r-- 1 root root 103 Oct 12 11:04 efde0bf9b12e2e127bdb007f58e4dfb893d990fc64b8063f9594c1c1753c06ce-4c6bb0d5b7fc98ff-resolv.conf
-rw-rw-rw- 1 root root   0 Oct 12 11:04 efde0bf9b12e2e127bdb007f58e4dfb893d990fc64b8063f9594c1c1753c06ce-83476f850307d009-termination-log
-rw-r--r-- 1 root root 205 Oct 12 11:04 efde0bf9b12e2e127bdb007f58e4dfb893d990fc64b8063f9594c1c1753c06ce-844b44105b991bcd-hosts
drwxrwxrwt 3 root root 140 Oct 12 11:04 efde0bf9b12e2e127bdb007f58e4dfb893d990fc64b8063f9594c1c1753c06ce-ab6d937a4d086125-serviceaccount
# ls -lah /run/containerd/io.containerd.runtime.v2.task/k8s.io/a322d916b5dc547d1dce178d31b13091418793a9675a8aa006fcfecd49f8bbc1/
total 28K
drwx------  3 root root  200 Oct 12 11:04 .
drwx--x--x 20 root root  400 Oct 12 11:04 ..
-rw-r--r--  1 root root   89 Oct 12 11:04 address
-rw-r--r--  1 root root 8.4K Oct 12 11:04 config.json
prwx------  1 root root    0 Oct 12 11:07 log
-rw-r--r--  1 root root  101 Oct 12 11:04 monitor_address
drwx--x--x  2 root root   40 Oct 12 11:04 rootfs
-rw-------  1 root root   32 Oct 12 11:04 shim-binary-path
-rw-r--r--  1 root root    7 Oct 12 11:04 shim.pid
lrwxrwxrwx  1 root root  121 Oct 12 11:04 work -> /var/lib/containerd/io.containerd.runtime.v2.task/k8s.io/a322d916b5dc547d1dce178d31b13091418793a9675a8aa006fcfecd49f8bbc1

This leads to a test failure for Test can pull an unencrypted image inside the guest.

This could be resolved by bumping the containerd to v1.7, but is not an option at the moment.

The error looks only happening at http://jenkins.katacontainers.io/job/kata-containers-CCv0-ubuntu-20.04-s390x-SE-daily/. We could skip the test until the update is finished.

@BbolroC BbolroC added bug Incorrect behaviour needs-review Needs to be assessed by the team. labels Oct 12, 2023
BbolroC added a commit to BbolroC/tests that referenced this issue Oct 12, 2023
This PR is to skip a test `Test can pull an unencrypted image inside the guest` for IBM Z secure execution until the containerd is updated to v1.7.

Fixes: kata-containers#5781

Signed-off-by: Hyounggyu Choi <[email protected]>
@fitzthum
Copy link

Btw, this issue also shows up on other platforms and has surfaced across multiple PRs. It seems likely that this would also affect users deploying our upcoming release.

@BbolroC
Copy link
Member Author

BbolroC commented Oct 17, 2023

Btw, this issue also shows up on other platforms and has surfaced across multiple PRs. It seems likely that this would also affect users deploying our upcoming release.

If this issue is also the case for other platforms, this would affect users using a cluster (containerd 1.6.x) created without the snapshotter. What do you think? @stevenhorsman @fidencio

@stevenhorsman
Copy link
Member

So I think there are potentially two separate things going on, that may, or may not be related:

Error: failed to create containerd container: create instance 697: object with key "697" already exists: unknown

issues which we've seen a few times on different platforms and

[ ${#rootfs[@]} -eq 1 ] 

which we've only seen on the s390x system. So either it is not related, or the fact that most of the key already exists errors have happened on the AMD nodes that don't run the same tests, so we wouldn't know, so I think we should potentially separate these issues?

@BbolroC
Copy link
Member Author

BbolroC commented Oct 17, 2023

Yeah, I was thinking that while writing the comment. I would say the latter doesn't seem @fitzthum wanted to bring on the table. We have to discuss whether the object with key "xxx" already exists issue will affect users or not in the next release.

stevenhorsman added a commit to stevenhorsman/tests that referenced this issue Nov 6, 2023
In the kubernetes agent_image test we currently have a check:
```
echo "Check the image was not pulled in the host"
	local pod_id=$(kubectl get pods -o jsonpath='{.items..metadata.name}')
	retrieve_sandbox_id
	rootfs=($(find /run/kata-containers/shared/sandboxes/${sandbox_id}/shared \
		-name rootfs))
	[ ${#rootfs[@]} -eq 1 ]
```
to ensure that the image hasn't been pulled onto the host.
The reason that the check is for a single rootfs is that we found that
the pause image was always pulled on the host, presumably due to
it being needed to create the pod sandbox.

With the introduction of the nydus-snapshotter code we've found
that on some systems (SE and TDX) it appears to be in a different
location with nydus-snapshotter, so check for 1, or 0. See an issue
at kata-containers#5781 to track this.

We don't have time to understand this fully now, so we just want the
tests to pass and check that we don't have both the pause and test
pod container image pulled, so set the check to pass if there are
1, or 0 rootfs' found in /run/kata-containers/shared/sandboxes/

Fixes: kata-containers#5790
Signed-off-by: stevenhorsman <[email protected]>
@ChengyuZhu6
Copy link
Member

I found that test 4 failed due to a stale kata process on the TDX CI machine while running the operator tests.:

/ ps -ef|grep kata
root      717683  716131  0 17:05 ?        00:00:00 sudo -E ./run-local.sh -r kata-qemu-tdx
root      717684  717683  0 17:05 ?        00:00:00 /bin/bash ./run-local.sh -r kata-qemu-tdx
root      721166  672128  0 17:07 pts/29   00:00:00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox kata
root     3051702       1  0 Nov01 ?        00:01:50 /opt/kata/bin/containerd-shim-kata-v2 -namespace k8s.io -address /run/containerd/containerd.sock -publish-binary /opt/confidential-containers/bin/containerd -id 70c83b7d3bf5ebb5bef7208bf816e2bccfb49962964d4559b50ab80d0112cf26

after I killing the stale kata process, all the tests(including test 4) passed.
http://10.112.240.228:8080/job/confidential-containers-operator-main-centos8stream-x86_64-containerd_kata-qemu-tdx-PR/639/console

@ChengyuZhu6
Copy link
Member

@BbolroC This could potentially be the reason for the failure of test 4 on the SE machine as well.

@BbolroC
Copy link
Member Author

BbolroC commented Nov 7, 2023

Thanks @ChengyuZhu6. I will check that out today if that is the cause for SE after the kata AC meeting (I have a schedule before it)

@BbolroC
Copy link
Member Author

BbolroC commented Nov 7, 2023

@ChengyuZhu6 @stevenhorsman @fidencio I've confirmed that the 4th test Test can pull an unencrypted image inside the guest passed on the SE machine (with the latest commit in a CCv0 branch) when I reverted the acceptance criteria back to [ ${#rootfs[@]} -eq 1 ].

@stevenhorsman
Copy link
Member

@ChengyuZhu6 @stevenhorsman @fidencio I've confirmed that the 4th test Test can pull an unencrypted image inside the guest passed on the SE machine (with the latest commit in a CCv0 branch) when I reverted the acceptance criteria back to [ ${#rootfs[@]} -eq 1 ].

Thanks, this means when we move this into main we can go back to the -eq 1 rather than -le 1. Thanks a lot to Chengyu for discovery the root cause of this mystery!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Incorrect behaviour needs-review Needs to be assessed by the team.
Projects
None yet
Development

No branches or pull requests

4 participants