-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Driver crashes unexpectedly with Failed to read /host/proc/mounts
requiring pod restart
#284
Comments
Thanks for opening the bug report, @dienhartd. We'll investigate further. Would you be able to review |
Please can you let us know what operating system you're running on the cluster nodes too! |
i have the same problem, i was runing on amazon linux 2 |
Thanks for sharing, @John-Funcity. Please can you open a new issue so we can get logs relevant to your problem, and also include information such as the |
2.280531] systemd[1]: systemd 219 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN)
[ 2.291650] systemd[1]: Detected virtualization amazon.
[ 2.295150] systemd[1]: Detected architecture x86-64.
[ 2.298554] systemd[1]: Running in initial RAM disk.
[ 2.302928] systemd[1]: No hostname configured.
[ 2.306128] systemd[1]: Set hostname to <localhost>.
[ 2.309546] systemd[1]: Initializing machine ID from VM UUID.
[ 2.336041] systemd[1]: Reached target Local File Systems.
[ 2.340338] systemd[1]: Reached target Swap.
[ 2.344257] systemd[1]: Created slice Root Slice.
[ 2.497890] XFS (nvme0n1p1): Mounting V5 Filesystem
[ 2.666828] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input0
[ 3.033970] XFS (nvme0n1p1): Ending clean mount
[ 3.253141] systemd-journald[863]: Received SIGTERM from PID 1 (systemd).
[ 3.309998] printk: systemd: 18 output lines suppressed due to ratelimiting
[ 3.537461] SELinux: Runtime disable is deprecated, use selinux=0 on the kernel cmdline.
[ 3.543529] SELinux: Disabled at runtime.
[ 3.610275] audit: type=1404 audit(1732528464.939:2): enforcing=0 old_enforcing=0 auid=4294967 |
Thanks @John-Funcity for the information, but could you please open a new issue so we're able to root cause the issues separately from this one. Please include the dmsg logs and other logs following the logging guide: https://github.com/awslabs/mountpoint-s3-csi-driver/blob/main/docs/LOGGING.md |
Maybe this problem? |
MountVolume.SetUp failed for volume "s3-models-pv" : rpc error: code = Internal desc = Could not mount "xxxx-models-test" at "/var/lib/kubelet/pods/xxxxxxxxx/volumes/kubernetes.io |
Same issue and logs. When I delete the CSI pod running on the node that I get the error from, it is fixed. |
Thanks for the reports @John-Funcity @fatihmete. Would you be able to share any log that might be relevant from |
The error does not occur in a specific pattern, and I cannot understand when it will happen. Similarly, I am getting the following error.
CSI pods appear to be working without errors. I will add the logs when the problem occurs again. |
@dannycjones would you prefer I opened a new issue as well? Seems I'm getting the exact same issue. Running k3s v1.30.6+k3s1 on Ubuntu 22.04 (also on Ubuntu 24.04) and s3-mountpoint 1.10.0. I am able to access /proc/mounts on the host, but I don't see anything in there related to s3 or CSI, what do we expect to find in there relating to the s3 csi driver? Not much in dmesg (not sure if this is relevant):
|
My team's now moved this data from s3 to EFS. That said when we were using s3 and the s3 mountpoint driver, I'm not positive I had access to This was usually Amazon Linux 2, though I believe this also happened with the Ubuntu AMI |
We are yet to identify the root cause for this issue, but we're looking for a way to omit mounting |
We were able to reproduce the issue by using https://github.com/aws-samples/comfyui-on-eks thanks to @Shellmode's suggestion. We have a potential fix with #321, we're working on verifying if this indeed solves the problem. We're using $ cat /proc/`pgrep -f aws-s3-csi-driver`/mountinfo | grep /host/proc/mounts
632 546 0:19 /1657/mounts /host/proc/mounts rw,nosuid,nodev,noexec,relatime - proc proc rw
$ ps -p 1657
PID TTY TIME CMD
1657 ? 00:00:12 containerd and if containerd process restarts, our $ cat /proc/`pgrep -f aws-s3-csi-driver`/root/host/proc/mounts > /dev/null # it currently works
$ service containerd restart # restart containerd service
$ cat /proc/`pgrep -f aws-s3-csi-driver`/root/host/proc/mounts > /dev/null # now it fails
cat: /proc/13065/root/host/proc/mounts: Invalid argument
$ ps aux | grep containerd # because containerd got a new pid 13767
root 13767 1.3 1.5 1896628 58808 ? Ssl 18:08 0:00 /usr/bin/containerd
$ cat /proc/`pgrep -f aws-s3-csi-driver`/mountinfo | grep /host/proc/mounts # but our mount still refers to old pid 8985
1033 1001 0:19 /8985/mounts /host/proc/mounts rw,nosuid,nodev,noexec,relatime - proc proc rw and the reason this happens more frequently with Karpenter/GPU nodes is that, NVIDIA's container toolkit sends |
#321) *Description of changes:* Currently, we spawn Mountpoint processes on the host using systemd. As a result, the mounts created by Mountpoint are not visible inside the CSI Driver Pod. To work around this, we were mounting `/proc/mounts` from host and parsing this file to check existing mounts on the host. Mounting `/proc/mounts` causes problems with Karpenter sometimes and its also its blocked by some SELinux policies. Such as this [issue](#284). This commit instead uses `HostToContainer` mount propagation on `hostPath` mount for `/var/lib/kubelet`. Thanks to `HostToContainer`, any new mounts created inside `/var/lib/kubelet` gets automatically propagated to our Pod from the host. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. --------- Signed-off-by: Burak Varlı <[email protected]> Co-authored-by: Burak Varlı <[email protected]> Co-authored-by: Jiayi Nie <[email protected]>
We've confirmed the fix on both regular pre-provisioned nodes (it was reproduceable by restarting We're hoping that this will solve the issue for others too, we'll be sharing here once we release this fix. |
/kind bug
What happened?
Periodically without warning one of my s3 mountpoint driver pods will crash with GRPC errors until I delete it. It will usually cause a dependent pod to fail to start. The replacement immediately after this pod's deletion works fine, but requires manual intervention after noticing dependent pod crashes due to missing pv.
What you expected to happen?
Error not to occur.
How to reproduce it (as minimally and precisely as possible)?
Unclear.
Anything else we need to know?:
Logs
Environment
Kubernetes version (use
kubectl version
):Client Version: v1.31.1
Server Version: v1.30.5-eks-ce1d5eb
Driver version: v1.9.0
Installation of s3 mountpoint driver is through
eksctl
, i.e.eksctl create addon aws-mountpoint-s3-csi-driver
Was directed by @muddyfish to file this issue here: #174 (comment)
The text was updated successfully, but these errors were encountered: