Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI ControllerUnpublishVolume called twice #24130

Open
ygersie opened this issue Oct 4, 2024 · 0 comments
Open

CSI ControllerUnpublishVolume called twice #24130

ygersie opened this issue Oct 4, 2024 · 0 comments
Labels

Comments

@ygersie
Copy link
Contributor

ygersie commented Oct 4, 2024

Nomad version

1.8.2+ent

Problem description

When looking into an issue that still sometimes leads to stuck CSI volumes I ran into the following scenario. When I stop an allocation and it is rescheduled onto the same node I see events on 2 CSI controller plugins instead of just 1. It looks like the ControllerUnpublishVolume RPC is called a second time incorrectly. I'm not sure if this is ever going to be causing problems but it's at least somewhat unexpected.

Logs

client plugin

I1004 14:31:49.090273       1 node.go:461] "NodeUnpublishVolume: called" args={"volume_id":"vol-025ef963301234567","target_path":"/local/csi/per-alloc/f10806cb-1dce-6fd1-1700-f49c4869442d/ygersie-test[0]/rw-file-system-single-node-writer"}
I1004 14:31:49.090423       1 node.go:479] "NodeUnpublishVolume: unmounting" target="/local/csi/per-alloc/f10806cb-1dce-6fd1-1700-f49c4869442d/ygersie-test[0]/rw-file-system-single-node-writer"
I1004 14:31:49.091389       1 mount_helper_common.go:93] unmounting "/local/csi/per-alloc/f10806cb-1dce-6fd1-1700-f49c4869442d/ygersie-test[0]/rw-file-system-single-node-writer" (corruptedMount: false, mounterCanSkipMountPointChecks: true)
I1004 14:31:49.091411       1 mount_linux.go:362] Unmounting /local/csi/per-alloc/f10806cb-1dce-6fd1-1700-f49c4869442d/ygersie-test[0]/rw-file-system-single-node-writer
I1004 14:31:49.095315       1 mount_helper_common.go:150] Warning: deleting path "/local/csi/per-alloc/f10806cb-1dce-6fd1-1700-f49c4869442d/ygersie-test[0]/rw-file-system-single-node-writer"
I1004 14:31:49.095409       1 node.go:475] "NodeUnPublishVolume: volume operation finished" volumeId="vol-025ef963301234567"
I1004 14:31:49.095427       1 inflight.go:74] "Node Service: volume operation finished" key="vol-025ef963301234567"
I1004 14:31:49.096428       1 node.go:289] "NodeUnstageVolume: called" args={"volume_id":"vol-025ef963301234567","staging_target_path":"/local/csi/staging/cie-orch/ygersie-test[0]/rw-file-system-single-node-writer"}
I1004 14:31:49.096694       1 node.go:329] "NodeUnstageVolume: unmounting" target="/local/csi/staging/cie-orch/ygersie-test[0]/rw-file-system-single-node-writer"
I1004 14:31:49.096715       1 mount_helper_common.go:93] unmounting "/local/csi/staging/cie-orch/ygersie-test[0]/rw-file-system-single-node-writer" (corruptedMount: false, mounterCanSkipMountPointChecks: true)
I1004 14:31:49.096726       1 mount_linux.go:362] Unmounting /local/csi/staging/cie-orch/ygersie-test[0]/rw-file-system-single-node-writer
I1004 14:31:49.102049       1 mount_helper_common.go:150] Warning: deleting path "/local/csi/staging/cie-orch/ygersie-test[0]/rw-file-system-single-node-writer"
I1004 14:31:49.102506       1 node.go:334] "NodeUnStageVolume: successfully unstaged volume" volumeID="vol-025ef963301234567" target="/local/csi/staging/cie-orch/ygersie-test[0]/rw-file-system-single-node-writer"
I1004 14:31:49.103043       1 node.go:304] "NodeUnStageVolume: volume operation finished" volumeID="vol-025ef963301234567"
I1004 14:31:49.103063       1 inflight.go:74] "Node Service: volume operation finished" key="vol-025ef963301234567"
I1004 14:32:09.950816       1 node.go:552] "NodeGetCapabilities: called" args={}
I1004 14:32:10.068667       1 node.go:113] "NodeStageVolume: called" args={"volume_id":"vol-025ef963301234567","publish_context":{"devicePath":"/dev/xvdaa"},"staging_target_path":"/local/csi/staging/cie-orch/ygersie-test[0]/rw-file-system-single-node-writer","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I1004 14:32:10.068830       1 node_linux.go:82] "[Debug] Falling back to nvme volume ID lookup" devicePath="/dev/xvdaa"
I1004 14:32:10.068887       1 node_linux.go:95] "[Debug] successfully resolved" nvmeName="nvme-Amazon_Elastic_Block_Store_vol025ef963301234567" nvmeDevicePath="/dev/nvme1n1"
I1004 14:32:10.068912       1 node.go:207] "NodeStageVolume: find device path" devicePath="/dev/xvdaa" source="/dev/nvme1n1"
I1004 14:32:10.069209       1 node.go:235] "NodeStageVolume: checking if volume is already staged" device="" source="/dev/nvme1n1" target="/local/csi/staging/cie-orch/ygersie-test[0]/rw-file-system-single-node-writer"
I1004 14:32:10.069237       1 node.go:242] "NodeStageVolume: staging volume" source="/dev/nvme1n1" volumeID="vol-025ef963301234567" target="/local/csi/staging/cie-orch/ygersie-test[0]/rw-file-system-single-node-writer" fstype="ext4"
I1004 14:32:10.069251       1 mount_linux.go:579] Attempting to determine if disk "/dev/nvme1n1" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/nvme1n1])
I1004 14:32:10.093990       1 mount_linux.go:582] Output: "DEVNAME=/dev/nvme1n1\nTYPE=ext4\n"
I1004 14:32:10.094016       1 mount_linux.go:441] Checking for issues with fsck on disk: /dev/nvme1n1
I1004 14:32:10.122820       1 mount_linux.go:546] Attempting to mount disk /dev/nvme1n1 in ext4 format at /local/csi/staging/cie-orch/ygersie-test[0]/rw-file-system-single-node-writer
I1004 14:32:10.122851       1 mount_linux.go:220] Mounting cmd (mount) with arguments (-t ext4 -o defaults /dev/nvme1n1 /local/csi/staging/cie-orch/ygersie-test[0]/rw-file-system-single-node-writer)
I1004 14:32:10.137170       1 mount_linux.go:579] Attempting to determine if disk "/dev/nvme1n1" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/nvme1n1])
I1004 14:32:10.148963       1 mount_linux.go:582] Output: "DEVNAME=/dev/nvme1n1\nTYPE=ext4\n"
I1004 14:32:10.149041       1 resizefs_linux.go:139] ResizeFs.needResize - checking mounted volume /dev/nvme1n1
I1004 14:32:10.152965       1 resizefs_linux.go:143] Ext size: filesystem size=1073741824, block size=4096
I1004 14:32:10.152982       1 resizefs_linux.go:158] Volume /dev/nvme1n1: device size=1073741824, filesystem size=1073741824, block size=4096
I1004 14:32:10.153017       1 node.go:284] "NodeStageVolume: successfully staged volume" source="/dev/nvme1n1" volumeID="vol-025ef963301234567" target="/local/csi/staging/cie-orch/ygersie-test[0]/rw-file-system-single-node-writer" fstype="ext4"
I1004 14:32:10.153036       1 node.go:184] "NodeStageVolume: volume operation finished" volumeID="vol-025ef963301234567"
I1004 14:32:10.153049       1 inflight.go:74] "Node Service: volume operation finished" key="vol-025ef963301234567"
I1004 14:32:10.154175       1 node.go:408] "NodePublishVolume: called" args={"volume_id":"vol-025ef963301234567","publish_context":{"devicePath":"/dev/xvdaa"},"staging_target_path":"/local/csi/staging/cie-orch/ygersie-test[0]/rw-file-system-single-node-writer","target_path":"/local/csi/per-alloc/892e2aba-61fb-f2d3-1e29-a07ce47d715d/ygersie-test[0]/rw-file-system-single-node-writer","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I1004 14:32:10.154677       1 node_linux.go:155] "NodePublishVolume: creating dir" target="/local/csi/per-alloc/892e2aba-61fb-f2d3-1e29-a07ce47d715d/ygersie-test[0]/rw-file-system-single-node-writer"
I1004 14:32:10.155798       1 node.go:744] "NodePublishVolume: mounting" source="/local/csi/staging/cie-orch/ygersie-test[0]/rw-file-system-single-node-writer" target="/local/csi/per-alloc/892e2aba-61fb-f2d3-1e29-a07ce47d715d/ygersie-test[0]/rw-file-system-single-node-writer" mountOptions=["bind"] fsType="ext4"
I1004 14:32:10.156953       1 mount_linux.go:220] Mounting cmd (mount) with arguments (-t ext4 -o bind /local/csi/staging/cie-orch/ygersie-test[0]/rw-file-system-single-node-writer /local/csi/per-alloc/892e2aba-61fb-f2d3-1e29-a07ce47d715d/ygersie-test[0]/rw-file-system-single-node-writer)
I1004 14:32:10.164868       1 mount_linux.go:220] Mounting cmd (mount) with arguments (-t ext4 -o bind,remount /local/csi/staging/cie-orch/ygersie-test[0]/rw-file-system-single-node-writer /local/csi/per-alloc/892e2aba-61fb-f2d3-1e29-a07ce47d715d/ygersie-test[0]/rw-file-system-single-node-writer)
I1004 14:32:10.169105       1 node.go:437] "NodePublishVolume: volume operation finished" volumeId="vol-025ef963301234567"
I1004 14:32:10.169168       1 inflight.go:74] "Node Service: volume operation finished" key="vol-025ef963301234567"

csi-ebs-controller plugin logs 1

I1004 14:31:49.128796       1 controller.go:431] "ControllerUnpublishVolume: called" args={"volume_id":"vol-025ef963301234567","node_id":"i-0b2cad6b68d0f7cb2"}
I1004 14:31:49.128815       1 controller.go:444] "ControllerUnpublishVolume: detaching" volumeID="vol-025ef963301234567" nodeID="i-0b2cad6b68d0f7cb2"
I1004 14:31:49.644706       1 cloud.go:746] "Waiting for volume state" volumeID="vol-025ef963301234567" actual="detaching" desired="detached"
I1004 14:31:50.747221       1 cloud.go:746] "Waiting for volume state" volumeID="vol-025ef963301234567" actual="detaching" desired="detached"
I1004 14:31:52.602655       1 cloud.go:746] "Waiting for volume state" volumeID="vol-025ef963301234567" actual="detaching" desired="detached"
I1004 14:31:55.945694       1 controller.go:452] "ControllerUnpublishVolume: detached" volumeID="vol-025ef963301234567" nodeID="i-0b2cad6b68d0f7cb2"
I1004 14:31:55.945720       1 inflight.go:74] "Node Service: volume operation finished" key="vol-025ef963301234567"
I1004 14:31:55.958235       1 controller.go:431] "ControllerUnpublishVolume: called" args={"volume_id":"vol-025ef963301234567","node_id":"i-0b2cad6b68d0f7cb2"}
I1004 14:31:55.958263       1 controller.go:444] "ControllerUnpublishVolume: detaching" volumeID="vol-025ef963301234567" nodeID="i-0b2cad6b68d0f7cb2"
I1004 14:31:56.035361       1 cloud.go:620] "DetachDisk: called on non-attached volume" volumeID="vol-025ef963301234567"
I1004 14:31:56.250962       1 controller.go:447] "ControllerUnpublishVolume: attachment not found" volumeID="vol-025ef963301234567" nodeID="i-0b2cad6b68d0f7cb2"
I1004 14:31:56.250989       1 inflight.go:74] "Node Service: volume operation finished" key="vol-025ef963301234567"

csi-ebs-controller plugin logs 2

I1004 14:32:03.171015       1 controller.go:379] "ControllerPublishVolume: called" args={"volume_id":"vol-025ef963301234567","node_id":"i-0b2cad6b68d0f7cb2","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I1004 14:32:03.171346       1 controller.go:392] "ControllerPublishVolume: attaching" volumeID="vol-025ef963301234567" nodeID="i-0b2cad6b68d0f7cb2"
I1004 14:32:03.639316       1 cloud.go:574] "[Debug] AttachVolume" volumeID="vol-025ef963301234567" nodeID="i-0b2cad6b68d0f7cb2" resp=<
	{
	  AttachTime: 2024-10-04 14:32:03.625 +0000 UTC,
	  Device: "/dev/xvdaa",
	  InstanceId: "i-0b2cad6b68d0f7cb2",
	  State: "attaching",
	  VolumeId: "vol-025ef963301234567"
	}
 >
I1004 14:32:03.766956       1 cloud.go:746] "Waiting for volume state" volumeID="vol-025ef963301234567" actual="attaching" desired="attached"
I1004 14:32:04.838842       1 cloud.go:746] "Waiting for volume state" volumeID="vol-025ef963301234567" actual="attaching" desired="attached"
I1004 14:32:05.966122       1 controller.go:470] "ControllerGetCapabilities: called" args={}
I1004 14:32:06.717816       1 cloud.go:746] "Waiting for volume state" volumeID="vol-025ef963301234567" actual="attaching" desired="attached"
I1004 14:32:10.065279       1 manager.go:189] "[Debug] Releasing in-process" attachment entry="/dev/xvdaa" volume="vol-025ef963301234567"
I1004 14:32:10.065307       1 controller.go:400] "ControllerPublishVolume: attached" volumeID="vol-025ef963301234567" nodeID="i-0b2cad6b68d0f7cb2" devicePath="/dev/xvdaa"
I1004 14:32:10.065317       1 inflight.go:74] "Node Service: volume operation finished" key="vol-025ef963301234567"

nomad client logs

    2024-10-04T14:31:49.069Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=f10806cb-1dce-6fd1-1700-f49c4869442d task=ygersie type=Killed msg="Task successfully killed" failed=false
    2024-10-04T14:31:49.079Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=f10806cb-1dce-6fd1-1700-f49c4869442d task=ygersie plugin=/usr/sbin/nomad id=1306
    2024-10-04T14:31:49.087Z [INFO]  client.gc: marking allocation for GC: alloc_id=f10806cb-1dce-6fd1-1700-f49c4869442d
    2024-10-04T14:31:49.101Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.219:4647
    2024-10-04T14:31:49.101Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.219:4647
    2024-10-04T14:31:51.112Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.181:4647
    2024-10-04T14:31:51.112Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.181:4647
    2024-10-04T14:31:55.121Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.140:4647
    2024-10-04T14:31:55.121Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.140:4647

If the allocation moves to a different node things I don't see the duplicate ControllerUnpublishVolume, however I do always see these on the node that is going to run the replacement allocation:

    2024-10-04T14:47:30.266Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=87fc7373-8c51-16ba-2bde-fa763370ea2d task=ygersie type=Received msg="Task received by client" failed=false
    2024-10-04T14:47:35.918Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.219:4647
    2024-10-04T14:47:35.918Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.219:4647
    2024-10-04T14:47:37.928Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.140:4647
    2024-10-04T14:47:37.928Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.140:4647
    2024-10-04T14:47:41.970Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.181:4647
    2024-10-04T14:47:41.970Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.181:4647
    2024-10-04T14:47:51.828Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=87fc7373-8c51-16ba-2bde-fa763370ea2d task=ygersie type="Task Setup" msg="Building Task Directory" failed=false
    2024-10-04T14:47:51.890Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=87fc7373-8c51-16ba-2bde-fa763370ea2d task=ygersie type=Driver msg="Downloading image" failed=false
    2024-10-04T14:47:53.858Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=d935cc0940bb82b81e92facc2f8e9611427fc5d375afc14b8261bf1a63f1d95f
    2024-10-04T14:47:54.147Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=d935cc0940bb82b81e92facc2f8e9611427fc5d375afc14b8261bf1a63f1d95f
    2024-10-04T14:47:54.183Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=87fc7373-8c51-16ba-2bde-fa763370ea2d task=ygersie type=Started msg="Task started by client" failed=false

I guess, although logged as an error, that's not really a problem as the client keeps retrying to gain a Claim but the volume wasn't completely Unpublished yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

1 participant