You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When looking into an issue that still sometimes leads to stuck CSI volumes I ran into the following scenario. When I stop an allocation and it is rescheduled onto the same node I see events on 2 CSI controller plugins instead of just 1. It looks like the ControllerUnpublishVolume RPC is called a second time incorrectly. I'm not sure if this is ever going to be causing problems but it's at least somewhat unexpected.
2024-10-04T14:31:49.069Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=f10806cb-1dce-6fd1-1700-f49c4869442d task=ygersie type=Killed msg="Task successfully killed" failed=false
2024-10-04T14:31:49.079Z [INFO] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=f10806cb-1dce-6fd1-1700-f49c4869442d task=ygersie plugin=/usr/sbin/nomad id=1306
2024-10-04T14:31:49.087Z [INFO] client.gc: marking allocation for GC: alloc_id=f10806cb-1dce-6fd1-1700-f49c4869442d
2024-10-04T14:31:49.101Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.219:4647
2024-10-04T14:31:49.101Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.219:4647
2024-10-04T14:31:51.112Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.181:4647
2024-10-04T14:31:51.112Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.181:4647
2024-10-04T14:31:55.121Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.140:4647
2024-10-04T14:31:55.121Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.140:4647
If the allocation moves to a different node things I don't see the duplicate ControllerUnpublishVolume, however I do always see these on the node that is going to run the replacement allocation:
2024-10-04T14:47:30.266Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=87fc7373-8c51-16ba-2bde-fa763370ea2d task=ygersie type=Received msg="Task received by client" failed=false
2024-10-04T14:47:35.918Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.219:4647
2024-10-04T14:47:35.918Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.219:4647
2024-10-04T14:47:37.928Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.140:4647
2024-10-04T14:47:37.928Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.140:4647
2024-10-04T14:47:41.970Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.181:4647
2024-10-04T14:47:41.970Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.125.56.181:4647
2024-10-04T14:47:51.828Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=87fc7373-8c51-16ba-2bde-fa763370ea2d task=ygersie type="Task Setup" msg="Building Task Directory" failed=false
2024-10-04T14:47:51.890Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=87fc7373-8c51-16ba-2bde-fa763370ea2d task=ygersie type=Driver msg="Downloading image" failed=false
2024-10-04T14:47:53.858Z [INFO] client.driver_mgr.docker: created container: driver=docker container_id=d935cc0940bb82b81e92facc2f8e9611427fc5d375afc14b8261bf1a63f1d95f
2024-10-04T14:47:54.147Z [INFO] client.driver_mgr.docker: started container: driver=docker container_id=d935cc0940bb82b81e92facc2f8e9611427fc5d375afc14b8261bf1a63f1d95f
2024-10-04T14:47:54.183Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=87fc7373-8c51-16ba-2bde-fa763370ea2d task=ygersie type=Started msg="Task started by client" failed=false
I guess, although logged as an error, that's not really a problem as the client keeps retrying to gain a Claim but the volume wasn't completely Unpublished yet.
The text was updated successfully, but these errors were encountered:
Nomad version
1.8.2+ent
Problem description
When looking into an issue that still sometimes leads to stuck CSI volumes I ran into the following scenario. When I stop an allocation and it is rescheduled onto the same node I see events on 2 CSI controller plugins instead of just 1. It looks like the
ControllerUnpublishVolume
RPC is called a second time incorrectly. I'm not sure if this is ever going to be causing problems but it's at least somewhat unexpected.Logs
client plugin
csi-ebs-controller plugin logs 1
csi-ebs-controller plugin logs 2
nomad client logs
If the allocation moves to a different node things I don't see the duplicate
ControllerUnpublishVolume
, however I do always see these on the node that is going to run the replacement allocation:I guess, although logged as an error, that's not really a problem as the client keeps retrying to gain a Claim but the volume wasn't completely Unpublished yet.
The text was updated successfully, but these errors were encountered: