Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] New DMZ solution to defeat CVE-2019-5736 #4450

Closed
lifubang opened this issue Oct 16, 2024 · 7 comments
Closed

[Proposal] New DMZ solution to defeat CVE-2019-5736 #4450

lifubang opened this issue Oct 16, 2024 · 7 comments

Comments

@lifubang
Copy link
Member

lifubang commented Oct 16, 2024

Because of CVE-2019-5736, we have to take some actions to prevent runc binary in the host
could be modified from the container, we had many solutions for these years, for example:
bindfd, memfd, otmpfile, os tmpfile, Embedded dmz and overlay. Most of then can work
for most of times, but in practice, there were many issues for these solutions. So we should find
some new possible solutions.

For Embedded dmz solution, we embed a small binary file to start the real container process,
but it may cause some issues(#4518), especially the capability behavior change issue(#4125). So
we make this solution as a opt-in solution. Consider we have moved the binary clone code
from runc init to runc parent process(#3987), so the memory usage of binary clone will not
be included in container's memory cgroup accounting. We can embed a big binary file in runc,
and copy it to memfd, then use it to start runc init process.

The whole steps should like this:

  1. Move runc init to contrib/cmd and it could be compiled as a separate binary file, for example
    name it as runc-dmz;
  2. Embed runc-dmz to runc binary;
  3. Remove the runc-dmz file;
  4. Change the old dmz solution to copy runc-dmz binary data to memfd;
  5. Use this memfd to start runc init, like runc-dmz init.

I think the size of runc-dmz should be smaller than runc, and we can read it to memory paralleled,
so it would help to reduce the start time of the container, and will fix the issue #4449.

@cyphar
Copy link
Member

cyphar commented Oct 16, 2024

I'm not sure that a separated runc init would be much smaller (most of runc's logic lives in runc init after all) and doing this split would make the whole runc binary a fair bit larger. So the overhead of copying the binary would still be fairly large (given that the current ~15MB copy adds ~60% overhead using the figures from #4448.)

The overlayfs approach (which it seems I suggested doing in the commit message when I wrote this protection in 0a8e411 😅) has basically no performance or memory overhead (based on my performance testing in #4448, it has <1% overhead over doing no copying at all and that might just be noise) and solves the most common case. It also has the benefit of giving us page cache sharing, which none of the copy-based solutions can.

Conceptually, having runc init be separate sounds nice but I don't think the extra binary size overhead is worth it. 15MB for a program as small as runc is already insane, let alone the ~25MB we would get if we had runc init be a separate embedded binary.

(Also O_TMPFILE and tempfiles are the same solution. It's just that O_TMPFILE is the nicer way of doing it on newer kernels.)

@kolyshkin
Copy link
Contributor

I played with the separate runc init a long time ago; I don't remember any details now except it was not worth it.

Overlay approach indeed looks very promising since most of the systems where runc is used have overlayfs. The only catch is it requires a relatively new kernel v5.1, released in May 2019, meaning some very old distros might not have the syscalls needed (fsopen/fsmount); but even Ubuntu 18.04 has kernel 5.15 (in HWE), and RHEL8 kernel has needed syscalls backported.

@cyphar
Copy link
Member

cyphar commented Oct 25, 2024

IMHO now that we have the overlay solution, maybe we should consider dropping runc-dmz since it has too many limitations, complicates the binary cloning logic, and complicates our build process a fair bit (we might even be able to drop cc_platform.mk). Even if we were to go for the "separate runc init binary" solution, the code we currently have would not be needed anyway.

@lifubang
Copy link
Member Author

Agree!

@kolyshkin
Copy link
Contributor

IMHO now that we have the overlay solution, maybe we should consider dropping runc-dmz since it has too many limitations, complicates the binary cloning logic, and complicates our build process a fair bit (we might even be able to drop cc_platform.mk). Even if we were to go for the "separate runc init binary" solution, the code we currently have would not be needed anyway.

@cyphar not really related, but I'm also thinking about dropping cmd/contrib/memfd-bind. While not as complicated and limited as dmz, it is still complicated (to setup and maintain wrt upgrades) and limited (non-root users are out). OTOH it's harmless (but kind of useless now).

@cyphar
Copy link
Member

cyphar commented Oct 31, 2024

I'm of two minds about that. Using it completely removes any potential overhead from /proc/self/exe cloning, but on the other hand the overlayfs stuff is so low overhead that it seems unlikely someone is going to care enough to use it. The fact that it also makes the binary accessible only by root and requires a daemon really make it unlikely that someone is going to use it in anger...

It really sucks that we can't bind-mount the memfd itself (even open_tree(OPEN_TREE_CLONE) doesn't work). If we could then it would make more sense to keep it. AFAIK there also is no (trivial and non-destructive) way of knowing if a file is in the lowerdir of an overlayfs -- if we could figure that out then we could adjust memfd-bind to create an overlayfs file instead, which would solve the problem for all users (even rootless containers would be able to run without overhead).

@kolyshkin
Copy link
Contributor

AFAIK there also is no (trivial and non-destructive) way of knowing if a file is in the lowerdir of an overlayfs -- if we could figure that out then we could adjust memfd-bind to create an overlayfs file instead, which would solve the problem for all users (even rootless containers would be able to run without overhead).

OK, this is a valid argument for keeping memfd-bind -- while not very useful now, it could be improved to become more useful. I guess we can keep it for now and either improve or remove in time for v1.3 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants