-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
seccomp filter should return ENOSYS for unknown syscalls #2151
Comments
The inherent fragility of seccomp(): https://lwn.net/Articles/738694/ The discussion on the article is very instructive. Basically whatever you do with seccomp, there are potential future landmines. And those article comments didn't even go into kernel syscall & libc differences between HW architectures. |
|
Note that this is breaking running of musl 1.2.0+ binaries - see docker/for-win#8326 for another example - due to inability to perform the correct meaningful fallback. We can't just treat |
On the other hand, if we return |
I don't think glibc actually does that, but if you're blocking specific functionality of a syscall rather than the whole syscall, the choice of error code should be made on a per-syscall basis to match existing error semantics. For example |
@cyphar For unrecognized flag arguments, most system calls use an |
Note that this issue is now somewhat urgent because Linux has added an |
@fweimer Thanks for adding that. I really don't want this to remain a game of whack-a-mole where exceptions get added for each syscall once breakage is found. Upstream should do the right thing and stop producing |
I want to point out that if you want to change the seccomp filter in Docker, you'll need to make an issue in the Docker repo. We don't control the seccomp filter that Docker uses. Until recently the runtime-spec didn't support custom return values, but that has changed so Docker will need to update their default seccomp profile. As I said, returning ENOSYS for syscalls deliberately blocked is less than ideal so I'd suggest instead there should be a new seccomp rule added which encodes the highest-known syscall number and return ENOSYS if the requested syscall has a higher number. This isn't fool-proof (several architectures have gaps in their syscall tables for historical reasons) but with the new unified syscall number work it's incredibly unlikely this will cause significant issues in the future. |
@cyphar First, runc needs to add the ability to specify two different "defaults": one for known but not specifically specified syscalls, and one for unknown syscalls. Currently, non-specified calls all return the same defaultAction. |
OK, can someone familiar with Docker and how that works open an issue on their side and link to this one? |
the default seccomp profile in docker is in https://github.com/moby/moby/blob/master/profiles/seccomp/default_linux.go and the runtime spec; https://github.com/opencontainers/runtime-spec/blob/f1164e526717e7d0b6e5ac24b05cb4b7401b0a98/config-linux.md#seccomp |
@jethrogb I don't think runc is in a position to do that -- what set of syscalls are "known" is a property of the profile being written, not the container runtime. If you write a profile today and 50 syscalls get added next week, runc (or rather libseccomp) will know about those syscalls but the old profile will not. Since Docker is the thing generating these profiles (and accepting user-specified profiles too), you would want the user-specified profile to say "this is the latest syscall at the time when I wrote this profile". The Docker-generated profile will then have a default action of |
Not today no, as you explain the profile description needs to be expanded to convey that information. |
Currently, runc also needs to know about the syscalls if they're included in the profile (opencontainers/runtime-spec#1071), so if a profile specifies syscalls that runc doesn't know about, the container fails to start (see moby/moby#41562). |
On Linux the major C libraries expect that syscalls that are blocked from running in the container runtime return ENOSYS to allow fallbacks to be used. Returning EPERM by default is not useful particularly for syscalls that would return EPERM for actual access restrictions e.g. the new faccessat2. The runtime-spec should set the standard and recommend ENOSYS be returned just like a kernel would that doesn't support that syscall. This allows C runtimes to fall back on other possible implementations given the userspace policies. Please see the upstream discussions: https://lwn.net/Articles/738694/ - Discusses fragility of syscall filtering. opencontainers/runc#2151 - glibc and musl request ENOSYS return for unknown syscalls. systemd/systemd#16739 - Discusses systemd-nspawn breakage with faccessat2. systemd/systemd#16819 - General policy for systemd-nspawn to return ENOSYS. seccomp/libseccomp#286 - Block unknown syscalls and erturn ENOSYS.
On Linux the major C libraries expect that syscalls that are blocked from running in the container runtime return ENOSYS to allow fallbacks to be used. Returning EPERM by default is not useful particularly for syscalls that would return EPERM for actual access restrictions e.g. the new faccessat2. The runtime-spec should set the standard and recommend ENOSYS be returned just like a kernel would that doesn't support that syscall. This allows C runtimes to fall back on other possible implementations given the userspace policies. Please see the upstream discussions: https://lwn.net/Articles/738694/ - Discusses fragility of syscall filtering. opencontainers/runc#2151 - glibc and musl request ENOSYS return for unknown syscalls. systemd/systemd#16739 - Discusses systemd-nspawn breakage with faccessat2. systemd/systemd#16819 - General policy for systemd-nspawn to return ENOSYS. seccomp/libseccomp#286 - Block unknown syscalls and return ENOSYS.
As a first-pass solution we can implement this in Docker et al by just assuming the largest syscall number specified in the profile at all is the last syscall before we give @thaJeztah That is an issue but not really one that is super relevant here IMHO -- we just use libseccomp's syscall lookup features so updating libseccomp will update the supported syscall numbers. 🤷 |
Ah, I was wrong above -- I forgot that Docker could still work around it (by making the default |
Yes. That's what I tried to refer to in #2151 (comment), but later saw my complete brain-fart that the issue I linked to was about capabilities, not syscalls 😂 |
I talked about some of these issues in my Kubecon talk last week. In particular just listing calls to block, not an allowlist makes more sense, even if it results on failing open for new syscalls. Could tweak the error codes more easily in this case. |
Would there be a risk if new syscalls are added that were not known at the time that the seccomp profile was generated? I think this thread (or the one on the mailing list) mentions the option to have the profile include information about the highest syscall number that was known at the time the profile was generated (potentially allowing new syscalls to be treated with some default ( |
@justincormack I think failing-closed is very reasonable here; it's just that the error code is wrong. There's nothing wrong with a fail-closed mechanism that effectively just emulates an old kernel, which is what you get if you fail with |
It should remain an allow-list (fail-closed), the issue is that there are two kinds of failures that the current allow-list is handling with the same error code:
The issue is that currently we pretend that all syscalls not included in the allow-list are in category (1) when in reality we should be defaulting to (2) for syscalls that were not known about at profile-creation time. In other words, the issue is not simply that "the error code is wrong" -- it's that there are two errors being handled with one error code. Changing the default action to return We could loosely infer which syscalls are in category (2) by assuming any syscall with a larger syscall number than the largest one in the profile is in category (2). However it would be nicer to have this behaviour be something that profile writers control (either by explicitly specifying the "largest known" syscall, or even better by allowing profiles to do |
I wouldn't characterize it as "just as incorrect". Semantically Of course I'd like to see this solved in a way that distinguishes the two cases, if this can be done in a way that works right by default and doesn't depend on the profile author understanding why |
I guess that's a fair point. We could switch Docker to use |
@cyphar This rule worries me because it could mean that as soon as the profile is re-created with knowledge of the |
I'm not entirely sure, but I don't believe the seccomp profile is included in the Docker container image? This is just a runtime thing? |
I actually meant rebuilding the runtime against newer kernel headers/libseccomp. It should not have this effect, either. |
What is the actual fix for this one? I think the massive number of references above show this is breaking a LOT of stuff in a lot of places. A lot of the bugs above mention using older os images as the "workaround", but nobody has found a solution. In the concourse bug I opened above, 32-bit binaries are failing to run in runc containers. statx is allowed via the seccomp config, but is failing with EPERM for an unknown reason. I opened opencontainers/runtime-spec#1122 to "change the default errnoRet", but even that seems like an incorrect solution. Does anyone have an idea on what is going on? |
That sounds like a bug in our BPF patching code (that really shouldn't be happening -- we do have handling for different architectures including the 32-on-64-bit "architecture" so it's a bit puzzling that it's not working as expected). I will take a look at this this week. |
@cyphar Thanks :-) Would moving away from Fedora 34 to something like Alpine be a temporary workaround? We really didn't run into this until we upgraded the container host from Fedora 33 to 34. |
I'm not sure to be honest -- I'm also confused how the container host upgrade could've caused this as well. If there's an issue with our 32-bit compat handling it should happen on every system. The obvious contender (kernel version change) doesn't really explain it either. |
@cyphar @fweimer-rh keeps mentioning something changed in Fedora 34, but really hasn't identified what beyond this:
@fweimer , @fweimer-rh , can you provide clarification on what changed in Fedora 34? |
On the host? |
The ENOSYS boundary is primarily set by the seccomp profile not libseccomp (if libseccomp doesn't know about a syscall it's ignored, but it's still primarily set by the profile -- which wouldn't have changed between versions since it's a Docker-version thing not related to the rest of the host and statx in particular has been allowed for quite a while). But yeah, as I said I'll take a look at this this wee. |
This allows application to detect whether the kernel supports syscall or not. Previously, an error was unconditionally EPERM. There are many issues about glibc failed with new syscalls in containerized environments if their host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json Fixes: https://pagure.io/freeipa/issue/9008 Signed-off-by: Stanislav Levin <[email protected]>
This allows application to detect whether the kernel supports syscall or not. Previously, an error was unconditionally EPERM. There are many issues about glibc failed with new syscalls in containerized environments if their host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json Fixes: https://pagure.io/freeipa/issue/9008 Signed-off-by: Stanislav Levin <[email protected]> Reviewed-By: Alexander Bokovoy <[email protected]>
This allows application to detect whether the kernel supports syscall or not. Previously, an error was unconditionally EPERM. There are many issues about glibc failed with new syscalls in containerized environments if their host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json Fixes: https://pagure.io/freeipa/issue/9008 Signed-off-by: Stanislav Levin <[email protected]>
This allows application to detect whether the kernel supports syscall or not. Previously, an error was unconditionally EPERM. There are many issues about glibc failed with new syscalls in containerized environments if their host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json Fixes: https://pagure.io/freeipa/issue/9008 Signed-off-by: Stanislav Levin <[email protected]>
This allows application to detect whether the kernel supports syscall or not. Previously, an error unconditionally was EPERM. There are many issues about glibc failed to new syscalls in containerized environments for which host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json
This allows application to detect whether the kernel supports syscall or not. Previously, an error was unconditionally EPERM. There are many issues about glibc failed with new syscalls in containerized environments if their host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json Fixes: https://pagure.io/freeipa/issue/9008 Signed-off-by: Stanislav Levin <[email protected]> Reviewed-By: Alexander Bokovoy <[email protected]>
This allows application to detect whether the kernel supports syscall or not. Previously, an error unconditionally was EPERM. There are many issues about glibc failed to new syscalls in containerized environments for which host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json
This allows application to detect whether the kernel supports syscall or not. Previously, an error was unconditionally EPERM. There are many issues about glibc failed with new syscalls in containerized environments if their host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json Fixes: https://pagure.io/freeipa/issue/9008 Signed-off-by: Stanislav Levin <[email protected]>
This allows application to detect whether the kernel supports syscall or not. Previously, an error was unconditionally EPERM. There are many issues about glibc failed with new syscalls in containerized environments if their host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json Fixes: https://pagure.io/freeipa/issue/9008 Signed-off-by: Stanislav Levin <[email protected]> Reviewed-By: Alexander Bokovoy <[email protected]>
This allows application to detect whether the kernel supports syscall or not. Previously, an error unconditionally was EPERM. There are many issues about glibc failed to new syscalls in containerized environments for which host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json
The glib2 shipping in Fedora 37 is hitting the classic seccomp EPERM vs ENOSYS issue for `close_range` when used via `createrepo_c`. Interestingly, Fedora 36 carried a patch for this: https://src.fedoraproject.org/rpms/glib2/c/a2259ad90593383c5ce982fbb233fd3658c0a7a1?branch=f36 But this patch is not carried in Fedora 37, presumably on the basis that by then hosts should be running a new enough runc to fix opencontainers/runc#2151 But clearly, that hasn't happened yet for whatever version runc that moby-engine uses in `ubuntu-latest`. Hack around this by running the container in privileged mode.
The glib2 shipping in Fedora 37 is hitting the classic seccomp EPERM vs ENOSYS issue for `close_range` when used via `createrepo_c`. Interestingly, Fedora 36 carried a patch for this: https://src.fedoraproject.org/rpms/glib2/c/a2259ad90593383c5ce982fbb233fd3658c0a7a1?branch=f36 But this patch is not carried in Fedora 37, presumably on the basis that by then hosts should be running a new enough runc to fix opencontainers/runc#2151 But clearly, that hasn't happened yet for whatever version runc that moby-engine uses in `ubuntu-latest`. Hack around this by running the container in privileged mode.
This allows application to detect whether the kernel supports syscall or not. Previously, an error was unconditionally EPERM. There are many issues about glibc failed with new syscalls in containerized environments if their host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json Fixes: https://pagure.io/freeipa/issue/9008 Signed-off-by: Stanislav Levin <[email protected]> Reviewed-By: Alexander Bokovoy <[email protected]>
Currently, the seccomp filter installed on Linux returns EPERM even for system calls that are unknown. This is problematic when new system calls are added by Linux. Programs wishing to use the new system call will try to call it, and will implement a fallback mechanism when ENOSYS is returned (indicating the kernel doesn't support the call). However, when using containers, it will likely receive EPERM instead, failing instead of trying the fallback path.
In addition to the list of acceptable syscalls, the container definition should include a maximum known syscall number. The seccomp filter should be configured such that calls above the maximum return ENOSYS. When new syscalls are added, the maximum can be increased after the seccomp policy is updated.
The text was updated successfully, but these errors were encountered: