Timed out after 180.001s. waiting for cluster deletion timed out #11162

adilGhaffarDev · 2024-09-10T10:02:04Z

Which jobs are flaking?

periodic-cluster-api-e2e-mink8s-release-1-8
periodic-cluster-api-e2e-main
periodic-cluster-api-e2e-latestk8s-main
periodic-cluster-api-e2e-release-1-8

Which tests are flaking?

capi-e2e [It] When testing Cluster API working on self-hosted clusters using ClusterClass with a HA control plane [ClusterClass] Should pivot the bootstrap cluster to a self-hosted cluster
capi-e2e [It] When testing Cluster API working on self-hosted clusters using ClusterClass [ClusterClass] Should pivot the bootstrap cluster to a self-hosted cluster
capi-e2e [It] When testing ClusterClass changes [ClusterClass] Should successfully rollout the managed topology upon changes to the ClusterClass

Since when has it been flaking?

there were few before, more flakes after 6-9-2024

Testgrid link

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-release-1-8/1833344003316125696

https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*%7C.*-operator-.*#8899ccb732f9f0e048cb

Reason for failure (if possible)

MachinePool deletion is stuck

Anything else we need to know?

No response

Label(s) to be applied

/kind flake
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

adilGhaffarDev · 2024-09-10T10:04:26Z

cc @kubernetes-sigs/cluster-api-release-team

sbueringer · 2024-09-10T10:13:25Z

I took an initial look and the cluster deletion is stuck because MachinePools are stuck in deletion

cc @jackfrancis @willie-yao @Jont828
(feel free to cc other folks that are interested in help maintaining MachinePools)

jackfrancis · 2024-09-10T17:57:21Z

@sbueringer thx for triaging

sbueringer · 2024-09-12T05:22:03Z

Given how often this test fails and how long it has been failing, I think we should consider removing machinepools from the affected tests.

There's a high chance we are missing other issues in the rest of Cluster API because of this flake.

willie-yao · 2024-09-16T20:05:54Z

I can take a look at this if there are no other more urgent MP-related issues @sbueringer

sbueringer · 2024-09-17T09:19:34Z

I can't prioritize MP issues for MP maintainers. But from a CI stability perspective this one is really important.

I think either we can fix it soon, or we have to disable MPs for all affected tests. I just don't want to take the risk for much longer that this flake is hiding other issues.

Sunnatillo · 2024-10-23T16:41:00Z

/assign @serngawy

k8s-ci-robot · 2024-10-23T16:41:03Z

@Sunnatillo: GitHub didn't allow me to assign the following users: serngawy.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @serngawy

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

serngawy · 2024-11-05T19:34:41Z

Hi,
adding comment to keep the issue investigation updated. Cannot reproduce the 3 tests failure locally, one thing should be mentioned the tests failed when the test env has the fs.inotify is set to its default value 128. The Capd-manager logs throwing error "Failed to allocate manager object: Too many open files." . After increasing the fs.inotify to 512 sysctl -w fs.inotify.max_user_instances=512 the tests never failed.

serngawy · 2024-11-05T19:35:11Z

/assign @serngawy

sbueringer · 2024-11-05T19:38:12Z

@serngawy That's a known issue documented here: https://cluster-api.sigs.k8s.io/user/troubleshooting#macos-and-docker-desktop----too-many-open-files

sbueringer · 2024-11-06T19:11:26Z

@serngawy Just sharing what I found when I was looking at this issue previously.

Concrete example: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-main/1853889499981942784

Test flow: When testing Cluster API working on self-hosted clusters using ClusterClass with a HA control plane
https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-main/1853889499981942784 (click on open stderr)

Creating the workload cluster with name "self-hosted-zdnwwf"
Turning the workload cluster into a management cluster
clusterctl init against this new mgmt cluster (which will be the self-hosted cluster)
Moving the cluster to self hosted (clusterctl move) (21:36:21.532)
Upgrading the self-hosted cluster
Upgrade done
Moving the cluster back to bootstrap (clusterctl move)
Deleting cluster
=> cluster deletion times out

Resources of the cluster after timeout: https://gcsweb.k8s.io/gcs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-main/1853889499981942784/artifacts/clusters-afterDeletionTimedOut/self-hosted-zdnwwf/resources/self-hosted-enizsm/

Noteworthy:

MachinePool has deletionTimestamp but is still there: https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-main/1853889499981942784/artifacts/clusters-afterDeletionTimedOut/self-hosted-zdnwwf/resources/self-hosted-enizsm/MachinePool/self-hosted-zdnwwf-mp-0-7p4ps.yaml
DockerMachine "worker-ftgwry" that has a deletionTimestamp belonging to that MachinePool: https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-main/1853889499981942784/artifacts/clusters-afterDeletionTimedOut/self-hosted-zdnwwf/resources/self-hosted-enizsm/DockerMachine/worker-ftgwry.yaml

CAPD logs:

1105 21:56:25.380515       1 machine.go:545] "Deleting machine container" controller="dockermachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachine" DockerMachine="self-hosted-enizsm/worker-cnyosl" namespace="self-hosted-enizsm" name="worker-cnyosl" reconcileID="f6281306-f8eb-4e1f-9801-2d6ae3b8ea38" DockerMachinePool="self-hosted-enizsm/self-hosted-zdnwwf-mp-0-7jnzg" Machine="self-hosted-enizsm/worker-cnyosl" Machine="self-hosted-enizsm/worker-cnyosl" Cluster="self-hosted-enizsm/self-hosted-zdnwwf"
I1105 21:56:27.018308       1 dockermachinepool_controller.go:213] "DockerMachinePool still has dependent DockerMachines, deleting them first and requeuing" controller="dockermachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachinePool" DockerMachinePool="self-hosted-enizsm/self-hosted-zdnwwf-mp-0-7jnzg" namespace="self-hosted-enizsm" name="self-hosted-zdnwwf-mp-0-7jnzg" reconcileID="cbf73c06-b700-445b-bad2-8199f4956373" MachinePool="self-hosted-zdnwwf-mp-0-7p4ps" Cluster="self-hosted-enizsm/self-hosted-zdnwwf" count=1

If I see correctly the uid in the owner of DockerMachine does not match the uid of the DockerMachinePool
Also the CAPD controller tries to delete DockerMachine self-hosted-enizsm/worker-cnyosl while we only have DockerMachine: self-hosted-enizsm/worker-ftgwry

DockerMachine: worker-ftgwry

  creationTimestamp: "2024-11-05T21:36:37Z"
  ownerReferences:
  - apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: DockerMachinePool
    name: self-hosted-zdnwwf-mp-0-7jnzg
    uid: 1c5a4578-ea51-490d-8e96-ca3287b129d8

Note that this DockerMachine was created on the bootstrap (kind) cluster ~ 16 seconds after we triggered clusterctl move.

DockerMachinePool: self-hosted-zdnwwf-mp-0-7jnzg

  name: self-hosted-zdnwwf-mp-0-7jnzg
  uid: 6cffc598-a6dc-4766-ac3d-b6e2024b5d92

My initial guess is that there is something going wrong with clusterctl move and the DockerMachines.

I would recommend to go through the test locally to see how it is supposed to happen. Then we can try to see via the artifacts folder in Prow at which point the failed tests diverge from this behavior. If the current data in the artifacts folder is not enough we can consider adding more logs / data.

serngawy · 2024-11-08T20:39:33Z

Here are some finding after investigation;
For the "self-host" tests; after a successful run (not failed) checking the logs and CRs one dockerMachine still remain available even after the dockerMachinePool is deleted. In the logs below the dockerMachinePool was patching and suddenly a warning message from KubeAPIWarningLogger about finalizer name show up and then dockerMachinePool is gone. That make the dockerMachine CR stays forever.
Checking the code here this is wrong. When the dockerMachine has no ownerRef for a machine and only ownerRer for dockerMachinePool that make the dockerMachine never get deleted. AND here assuming that dokcer/machinePool will create the machine is not correct this will never happen.

My guessing is for a failed test run we hit this race condition and the DockerMachine stay forever which make dockerMachinePool remain and the dockerCluster remain ...etc. (Not sure what happen in my test run make the dockerMachinePool finalizer get removed so the dockerCluster get deleted and test pass as successful)

I will change the logic for the deleteMachinePoolMachine then check if we have better test results.

I1108 19:22:16.526651       1 dockermachinepool_controller_phases.go:118] "Reconciling DockerMachines" controller="dockermachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachinePool" namespace="self-hosted-an99i8" name="self-hosted-jdn4p2-mp-0-t7zrj" reconcileID="de9d43ca-372e-4a10-aab3-12a117ffe081" MachinePool="self-hosted-jdn4p2-mp-0-zjs5c" Cluster="self-hosted-an99i8/self-hosted-jdn4p2" DockerMachinePool="self-hosted-an99i8/self-hosted-jdn4p2-mp-0-t7zrj"
I1108 19:22:16.527535       1 dockermachinepool_controller_phases.go:147] "Patching existing DockerMachine" controller="dockermachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachinePool" DockerMachinePool="self-hosted-an99i8/self-hosted-jdn4p2-mp-0-t7zrj" namespace="self-hosted-an99i8" name="self-hosted-jdn4p2-mp-0-t7zrj" reconcileID="de9d43ca-372e-4a10-aab3-12a117ffe081" MachinePool="self-hosted-jdn4p2-mp-0-zjs5c" Cluster="self-hosted-an99i8/self-hosted-jdn4p2" DockerMachine="self-hosted-an99i8/worker-gnn71g"
I1108 19:22:16.550604       1 warning_handler.go:65] "metadata.finalizers: \"dockermachinepool.infrastructure.cluster.x-k8s.io\": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers" logger="KubeAPIWarningLogger"
E1108 19:22:16.550688       1 dockermachinepool_controller.go:133] "Failed to patch DockerMachinePool" err="failed to patch DockerMachinePool self-hosted-an99i8/self-hosted-jdn4p2-mp-0-t7zrj: dockermachinepools.infrastructure.cluster.x-k8s.io \"self-hosted-jdn4p2-mp-0-t7zrj\" not found" controller="dockermachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachinePool" DockerMachinePool="self-hosted-an99i8/self-hosted-jdn4p2-mp-0-t7zrj" namespace="self-hosted-an99i8" name="self-hosted-jdn4p2-mp-0-t7zrj" reconcileID="1b291049-8764-45f5-b595-4f3faaf8e3f1" MachinePool="self-hosted-jdn4p2-mp-0-zjs5c" Cluster="self-hosted-an99i8/self-hosted-jdn4p2"
E1108 19:22:16.550722       1 controller.go:316] "Reconciler error" err="failed to patch DockerMachinePool self-hosted-an99i8/self-hosted-jdn4p2-mp-0-t7zrj: dockermachinepools.infrastructure.cluster.x-k8s.io \"self-hosted-jdn4p2-mp-0-t7zrj\" not found" controller="dockermachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachinePool" DockerMachinePool="self-hosted-an99i8/self-hosted-jdn4p2-mp-0-t7zrj" namespace="self-hosted-an99i8" name="self-hosted-jdn4p2-mp-0-t7zrj" reconcileID="1b291049-8764-45f5-b595-4f3faaf8e3f1"
I1108 19:22:16.563893       1 dockermachine_controller.go:104] "Waiting for Machine Controller to set OwnerRef on DockerMachine" controller="dockermachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachine" DockerMachine="self-hosted-an99i8/worker-gnn71g" namespace="self-hosted-an99i8" name="worker-gnn71g" reconcileID="eb740c15-c1d1-4dd9-a6bf-9f686c27d6d3" DockerMachinePool="self-hosted-an99i8/self-hosted-jdn4p2-mp-0-t7zrj"
I1108 19:22:16.835334       1 dockermachine_controller.go:104] "Waiting for Machine Controller to set OwnerRef on DockerMachine" controller="dockermachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachine" DockerMachine="self-hosted-an99i8/worker-gnn71g" namespace="self-hosted-an99i8" name="worker-gnn71g" reconcileID="f69f5782-9b50-4092-afe8-1746726f94ca" DockerMachinePool="self-hosted-an99i8/self-hosted-jdn4p2-mp-0-t7zrj"

sbueringer · 2024-11-12T19:03:22Z

Feel free to open a PR. Not sure if I understood how you want to change the code.

In the example case I had above the DockerMachine already has a deletionTimestamp: https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-main/1853889499981942784/artifacts/clusters-afterDeletionTimedOut/self-hosted-zdnwwf/resources/self-hosted-enizsm/DockerMachine/worker-ftgwry.yaml

So changing this if (

cluster-api/test/infrastructure/docker/exp/internal/controllers/dockermachinepool_controller_phases.go

Line 290 in 6d30801

if machine == nil {

) won't change anything.

It still seems to me that clusterctl move is racy with MachinePools (#11162 (comment)).

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 10, 2024

sbueringer mentioned this issue Sep 10, 2024

🐛 Make KCP pre-terminate hook more robust #11161

Merged

k8s-ci-robot removed needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 11, 2024

Sunnatillo added this to CAPI v1.9 release improvement tasks Sep 11, 2024

sbueringer mentioned this issue Sep 12, 2024

Upgrading a workload cluster using ClusterClass with RuntimeSDK test is flaky with error: Resource versions didn't stay stable #10838

Open

k8s-ci-robot assigned serngawy Nov 5, 2024

sbueringer added the area/machinepool Issues or PRs related to machinepools label Nov 6, 2024

sbueringer mentioned this issue Nov 6, 2024

Graduate MachinePools from experimental to stable API #9005

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timed out after 180.001s. waiting for cluster deletion timed out #11162

Timed out after 180.001s. waiting for cluster deletion timed out #11162

adilGhaffarDev commented Sep 10, 2024 •

edited

Loading

adilGhaffarDev commented Sep 10, 2024

sbueringer commented Sep 10, 2024

jackfrancis commented Sep 10, 2024

sbueringer commented Sep 12, 2024 •

edited

Loading

willie-yao commented Sep 16, 2024

sbueringer commented Sep 17, 2024 •

edited

Loading

Sunnatillo commented Oct 23, 2024

k8s-ci-robot commented Oct 23, 2024

serngawy commented Nov 5, 2024

serngawy commented Nov 5, 2024

sbueringer commented Nov 5, 2024 •

edited

Loading

sbueringer commented Nov 6, 2024

serngawy commented Nov 8, 2024

sbueringer commented Nov 12, 2024 •

edited

Loading

Timed out after 180.001s. waiting for cluster deletion timed out #11162

Timed out after 180.001s. waiting for cluster deletion timed out #11162

Comments

adilGhaffarDev commented Sep 10, 2024 • edited Loading

Which jobs are flaking?

Which tests are flaking?

Since when has it been flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Label(s) to be applied

adilGhaffarDev commented Sep 10, 2024

sbueringer commented Sep 10, 2024

jackfrancis commented Sep 10, 2024

sbueringer commented Sep 12, 2024 • edited Loading

willie-yao commented Sep 16, 2024

sbueringer commented Sep 17, 2024 • edited Loading

Sunnatillo commented Oct 23, 2024

k8s-ci-robot commented Oct 23, 2024

serngawy commented Nov 5, 2024

serngawy commented Nov 5, 2024

sbueringer commented Nov 5, 2024 • edited Loading

sbueringer commented Nov 6, 2024

serngawy commented Nov 8, 2024

sbueringer commented Nov 12, 2024 • edited Loading

adilGhaffarDev commented Sep 10, 2024 •

edited

Loading

sbueringer commented Sep 12, 2024 •

edited

Loading

sbueringer commented Sep 17, 2024 •

edited

Loading

sbueringer commented Nov 5, 2024 •

edited

Loading

sbueringer commented Nov 12, 2024 •

edited

Loading