-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timed out after 180.001s. waiting for cluster deletion timed out #11162
Comments
cc @kubernetes-sigs/cluster-api-release-team |
I took an initial look and the cluster deletion is stuck because MachinePools are stuck in deletion cc @jackfrancis @willie-yao @Jont828 |
@sbueringer thx for triaging |
Given how often this test fails and how long it has been failing, I think we should consider removing machinepools from the affected tests. There's a high chance we are missing other issues in the rest of Cluster API because of this flake. |
I can take a look at this if there are no other more urgent MP-related issues @sbueringer |
I can't prioritize MP issues for MP maintainers. But from a CI stability perspective this one is really important. I think either we can fix it soon, or we have to disable MPs for all affected tests. I just don't want to take the risk for much longer that this flake is hiding other issues. |
/assign @serngawy |
@Sunnatillo: GitHub didn't allow me to assign the following users: serngawy. Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi, |
/assign @serngawy |
@serngawy That's a known issue documented here: https://cluster-api.sigs.k8s.io/user/troubleshooting#macos-and-docker-desktop----too-many-open-files |
@serngawy Just sharing what I found when I was looking at this issue previously. Concrete example: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-main/1853889499981942784 Test flow: When testing Cluster API working on self-hosted clusters using ClusterClass with a HA control plane
Resources of the cluster after timeout: https://gcsweb.k8s.io/gcs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-main/1853889499981942784/artifacts/clusters-afterDeletionTimedOut/self-hosted-zdnwwf/resources/self-hosted-enizsm/ Noteworthy:
CAPD logs:
If I see correctly the DockerMachine: worker-ftgwry creationTimestamp: "2024-11-05T21:36:37Z"
ownerReferences:
- apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: DockerMachinePool
name: self-hosted-zdnwwf-mp-0-7jnzg
uid: 1c5a4578-ea51-490d-8e96-ca3287b129d8 Note that this DockerMachine was created on the bootstrap (kind) cluster ~ 16 seconds after we triggered clusterctl move. DockerMachinePool: self-hosted-zdnwwf-mp-0-7jnzg name: self-hosted-zdnwwf-mp-0-7jnzg
uid: 6cffc598-a6dc-4766-ac3d-b6e2024b5d92 My initial guess is that there is something going wrong with clusterctl move and the DockerMachines. I would recommend to go through the test locally to see how it is supposed to happen. Then we can try to see via the artifacts folder in Prow at which point the failed tests diverge from this behavior. If the current data in the artifacts folder is not enough we can consider adding more logs / data. |
Here are some finding after investigation; My guessing is for a failed test run we hit this race condition and the DockerMachine stay forever which make dockerMachinePool remain and the dockerCluster remain ...etc. (Not sure what happen in my test run make the dockerMachinePool finalizer get removed so the dockerCluster get deleted and test pass as successful) I will change the logic for the deleteMachinePoolMachine then check if we have better test results.
|
Feel free to open a PR. Not sure if I understood how you want to change the code. In the example case I had above the DockerMachine already has a deletionTimestamp: https://storage.googleapis.com/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-main/1853889499981942784/artifacts/clusters-afterDeletionTimedOut/self-hosted-zdnwwf/resources/self-hosted-enizsm/DockerMachine/worker-ftgwry.yaml So changing this if ( Line 290 in 6d30801
It still seems to me that clusterctl move is racy with MachinePools (#11162 (comment)). |
Which jobs are flaking?
Which tests are flaking?
Since when has it been flaking?
there were few before, more flakes after 6-9-2024
Testgrid link
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-release-1-8/1833344003316125696
https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*%7C.*-operator-.*#8899ccb732f9f0e048cb
Reason for failure (if possible)
MachinePool deletion is stuck
Anything else we need to know?
No response
Label(s) to be applied
/kind flake
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.
The text was updated successfully, but these errors were encountered: