Volume ID already exists issue with statically provisioned Azure NFS File Shares #1955

ashwajce · 2024-07-02T11:53:40Z

What happened:
In an AKS environment, mounting a large NFS file share (12Ti, 11.45M files) to a pod causes significant delays, with the pod staying in "ContainerCreating" status for about 3 days. This occurs when syncing data via a linux vm on the nfs share, after sync the nfs share does mount in the pv and pvc correctly, pod does not start up due to operation on current pv already exists

What you expected to happen:
Pod starts with the volume attached immediately

How to reproduce it:

have a storage count
expose a file share under that storage account approx 1Ti with data with NFS
mount the disk under a linux vm
provision data via the linux vm (have considerable amount of files in the storage account eg 2.4M)
on aks provision the pv and attach the pvc in a namespace eg:

kubectl apply -f persistent-volumes.yaml --namespace <masked>

apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: file.csi.azure.com
  name: pv-<masked>-shared-home
spec:
  capacity:
    storage: 1Ti
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: <masked>-shared
  mountOptions:
    - rsize=1048576 # for recommended values see https://docs.microsoft.com/en-us/azure/storage/files/storage-troubleshoot-linux-file-connection-problems#troubleshoot-mount-issuescheck https://learn.microsoft.com/en-us/azure/storage/files/storage-files-how-to-mount-nfs-shares?tabs=portal#mount-options
    - wsize=1048576
  csi:
    driver: file.csi.azure.com
    readOnly: false
    volumeHandle: pv-<masked>-shared-home
    volumeAttributes:
      resourceGroup: rg-app
      storageAccount: <masked>
      shareName: shared-home
      server: <masked>.privatelink.file.core.windows.net
      protocol: nfs
      skuName: Premium_LRS

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: <masked>-shared-home
spec:
  storageClassName: <masked>-shared
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Ti

Run the helm chart with pod

Actual result:
describe pod msg:

Warning  FailedMount  2m24s (x857 over 28h)  kubelet  MountVolume.MountDevice failed for volume "pv-shared-home" : rpc error: code = Aborted desc = An operation with the given Volume ID pv-shared-home already exists

Anything else we need to know?:
Issue is occurs also after we have synced data (rsync to linux vm in azure, with nfs share mounted) into this NFS share.

Once the PV has been mounted and no operations are ongoing the PV and PVC can be removed. 2nd run the pvc is immediately mounted and available. Even if we switch from cluster its fine.

Environment:

CSI Driver version: v1.29.5
Kubernetes version (use kubectl version): 1.28.3
OS (e.g. from /etc/os-release): AKSUbuntu-2204gen2containerd-202405.20.0
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

andyzhangx · 2024-07-07T14:11:32Z

@ashwajce could you provide the kubelet logs on that node in problem? have you set any securityContext in pod? this issue could be related to slow chown operation if you set fsGroup in securityContext in pod, one workaround is set fsGroupChangePolicy: None in pv

fsGroupChangePolicy: indicates how volume's ownership will be changed by the driver, pod securityContext.fsGroupChangePolicy is ignored

https://github.com/kubernetes-sigs/azurefile-csi-driver/blob/master/docs/driver-parameters.md

k8s-triage-robot · 2024-10-05T14:40:22Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-11-04T15:28:56Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 5, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Volume ID already exists issue with statically provisioned Azure NFS File Shares #1955

Volume ID already exists issue with statically provisioned Azure NFS File Shares #1955

ashwajce commented Jul 2, 2024 •

edited

Loading

andyzhangx commented Jul 7, 2024

k8s-triage-robot commented Oct 5, 2024

k8s-triage-robot commented Nov 4, 2024

Volume ID already exists issue with statically provisioned Azure NFS File Shares #1955

Volume ID already exists issue with statically provisioned Azure NFS File Shares #1955

Comments

ashwajce commented Jul 2, 2024 • edited Loading

andyzhangx commented Jul 7, 2024

k8s-triage-robot commented Oct 5, 2024

k8s-triage-robot commented Nov 4, 2024

ashwajce commented Jul 2, 2024 •

edited

Loading