CAPV controller manager stuck during reconcile #2832

jansoukup · 2024-03-18T17:07:12Z

/kind bug

We have 1 management cluster with 7 workload clusters. Each workload cluster has ~25 worker nodes.
Sometimes during the reconciliation of all workload clusters, CAPV stops reconciling without any significant information in the logs (nor in CAPI logs). No new VMs are visible in vCenter, nothing is deleted, and new Machines remain in the "Provisioning" state indefinitely.
The quickest fix is to restart the CAPV deployment, after which everything runs smoothly again.

CAPV controller manager:

I0315 16:23:04.608993       1 vimmachine.go:432] "capv-controller-manager/vspheremachine-controller/cluster-1-test/cluster-1-test-worker-ce3e0b-8rg5v: updated vm" vm="cluster-1-test/cluster-1-test-worker-ktfbx-7lrt2"
I0315 16:23:04.653791       1 http.go:143] "controller-runtime/webhook/webhooks: wrote response" webhook="/validate-infrastructure-cluster-x-k8s-io-v1beta1-vspherevm" code=200 reason= UID=8c8eb2a6-ccb9-4663-b78a-aee2dc62a9f4 allowed=true
I0315 16:23:04.645912       1 request.go:622] Waited for 91.100271ms due to client-side throttling, not priority and fairness, request: PATCH:https://10.43.0.1:443/apis/infrastructure.cluster.x-k8s.io/v1beta1/namespaces/cluster-2-test/vspherevms/cluster-2-test-master-mnmsp
I0315 16:23:04.653134       1 http.go:96] "controller-runtime/webhook/webhooks: received request" webhook="/validate-infrastructure-cluster-x-k8s-io-v1beta1-vspherevm" UID=8c8eb2a6-ccb9-4663-b78a-aee2dc62a9f4 kind="infrastructure.cluster.x-k8s.io/v1beta1, Kind=VSphereVM" resource={Group:infrastructure.cluster.x-k8s.io Version:v1beta1 Resource:vspherevms}
I0315 16:23:04.696343       1 request.go:622] Waited for 138.639409ms due to client-side throttling, not priority and fairness, request: PATCH:https://10.43.0.1:443/apis/infrastructure.cluster.x-k8s.io/v1beta1/namespaces/cluster-1-test/vspherevms/cluster-1-test-worker-ktfbx-rvsbd
I0315 16:23:04.658346       1 vimmachine.go:432] "capv-controller-manager/vspheremachine-controller/cluster-2-test/cluster-2-test-master-87ac03-9lxbp: updated vm" vm="cluster-2-test/cluster-2-test-master-mnmsp"
I0315 16:24:21.806753       1 http.go:96] "controller-runtime/webhook/webhooks: received request" webhook="/validate-infrastructure-cluster-x-k8s-io-v1beta1-vspheremachine" UID=518c3035-294d-4a27-940b-6099a1e849f3 kind="infrastructure.cluster.x-k8s.io/v1beta1, Kind=VSphereMachine" resource={Group:infrastructure.cluster.x-k8s.io Version:v1beta1 Resource:vspheremachines}
I0315 16:24:21.807204       1 http.go:143] "controller-runtime/webhook/webhooks: wrote response" webhook="/validate-infrastructure-cluster-x-k8s-io-v1beta1-vspheremachine" code=200 reason= UID=518c3035-294d-4a27-940b-6099a1e849f3 allowed=true
I0315 16:24:21.904439       1 http.go:96] "controller-runtime/webhook/webhooks: received request" webhook="/mutate-infrastructure-cluster-x-k8s-io-v1beta1-vspheremachine" UID=db175ab7-9f4c-4554-9d74-73756c742219 kind="infrastructure.cluster.x-k8s.io/v1beta1, Kind=VSphereMachine" resource={Group:infrastructure.cluster.x-k8s.io Version:v1beta1 Resource:vspheremachines}
I0315 16:24:21.905326       1 http.go:143] "controller-runtime/webhook/webhooks: wrote response" webhook="/mutate-infrastructure-cluster-x-k8s-io-v1beta1-vspheremachine" code=200 reason= UID=db175ab7-9f4c-4554-9d74-73756c742219 allowed=true
I0315 16:24:21.908133       1 http.go:96] "controller-runtime/webhook/webhooks: received request" webhook="/validate-infrastructure-cluster-x-k8s-io-v1beta1-vspheremachine" UID=a3c5edd2-4f72-4b3c-a5ef-5a6079e753f5 kind="infrastructure.cluster.x-k8s.io/v1beta1, Kind=VSphereMachine" resource={Group:infrastructure.cluster.x-k8s.io Version:v1beta1 Resource:vspheremachines}
I0315 16:24:21.909097       1 http.go:143] "controller-runtime/webhook/webhooks: wrote response" webhook="/validate-infrastructure-cluster-x-k8s-io-v1beta1-vspheremachine" code=200 reason= UID=a3c5edd2-4f72-4b3c-a5ef-5a6079e753f5 allowed=true

CAPI controller manager:

I0315 15:17:35.753443       1 machine_controller_phases.go:286] "Waiting for infrastructure provider to create machine infrastructure and report status.ready" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="cluster-2-test/cluster-2-test-worker-2z784-52fc2" namespace="cluster-2-test" name="cluster-2-test-worker-2z784-52fc2" reconcileID=f9599a99-1da2-4872-b531-98446f266c73 MachineSet="cluster-2-test/cluster-2-test-worker-2z784" MachineDeployment="cluster-2-test/cluster-2-test-worker" Cluster="cluster-2-test/cluster-2-test" VSphereMachine="cluster-2-test/cluster-2-test-worker-1e6e95-4cvmt"

Omit state where CAPV runs in this strange state, without any info in logs.

Our workaround is scheduled Job for CAPV restart twice per day.

Environment:

Cluster-api-provider-vsphere version: 1.7.5
Kubernetes version: (use kubectl version): 1.24.17
OS (e.g. from /etc/os-release): Ubuntu 22.04

The text was updated successfully, but these errors were encountered:

sbueringer · 2024-03-22T10:46:52Z

I would guess that maybe a controller is stuck.

This could be confirmed via metrics (active workers or something) and via a go routine dump of the controller (via kill -ABRT )

Jellyfrog · 2024-06-13T11:59:49Z

We feel we have the same problem, especially when deleting clusters nothing really happens until we restart CAPV.
Will try to dump the controller next time

chrischdi · 2024-06-13T12:19:22Z

We feel we have the same problem, especially when deleting clusters nothing really happens until we restart CAPV. Will try to dump the controller next time

Could you please note which version of CAPV you had been using when this issue occured?

Jellyfrog · 2024-06-13T12:33:39Z

for this env the combo is:

NAME                     NAMESPACE       TYPE                     CURRENT VERSION   NEXT VERSION
addon-helm               caaph-system    AddonProvider            v0.2.3            v0.2.4
bootstrap-talos          cabpt-system    BootstrapProvider        v0.6.5            Already up to date
control-plane-talos      cacppt-system   ControlPlaneProvider     v0.5.6            Already up to date
cluster-api              capi-system     CoreProvider             v1.7.2            v1.7.3
infrastructure-vsphere   capv-system     InfrastructureProvider   v1.10.0           v1.10.1

ill try updating all of them and see.

sbueringer · 2024-06-13T15:37:50Z

Please check if you have --enable-keep-alive set: #2896. It should not be set, it can lead to deadlocks and we've dropped it on main already.

Jellyfrog · 2024-06-14T14:10:27Z

Please check if you have --enable-keep-alive set: #2896. It should not be set, it can lead to deadlocks and we've dropped it on main already.

Not set!

sbueringer · 2024-07-04T08:48:21Z

I would guess that maybe a controller is stuck.

This could be confirmed via metrics (active workers or something) and via a go routine dump of the controller (via kill -ABRT )

^^ Should help to figure out where the controller is stuck

mslga · 2024-08-28T08:06:39Z

Same bug after update CAPV controller from v1.8.4 to v1.10.0
CAPI from v1.5.3 to v1.7.2

sbueringer · 2024-08-28T09:25:24Z

There's no way for anyone to debug this without a go routine dump / stack traces. Until then we can only recommend for anyone using older versions to ensure --enable-keep-alive is set to false

k8s-triage-robot · 2024-11-26T09:30:10Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-12-26T10:08:56Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 18, 2024

sbueringer added the area/govmomi Issues or PRs related to the govmomi mode label Apr 18, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 26, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAPV controller manager stuck during reconcile #2832

CAPV controller manager stuck during reconcile #2832

jansoukup commented Mar 18, 2024

sbueringer commented Mar 22, 2024 •

edited

Loading

Jellyfrog commented Jun 13, 2024 •

edited

Loading

chrischdi commented Jun 13, 2024

Jellyfrog commented Jun 13, 2024

sbueringer commented Jun 13, 2024

Jellyfrog commented Jun 14, 2024 •

edited

Loading

sbueringer commented Jul 4, 2024

mslga commented Aug 28, 2024

sbueringer commented Aug 28, 2024

k8s-triage-robot commented Nov 26, 2024

k8s-triage-robot commented Dec 26, 2024

CAPV controller manager stuck during reconcile #2832

CAPV controller manager stuck during reconcile #2832

Comments

jansoukup commented Mar 18, 2024

sbueringer commented Mar 22, 2024 • edited Loading

Jellyfrog commented Jun 13, 2024 • edited Loading

chrischdi commented Jun 13, 2024

Jellyfrog commented Jun 13, 2024

sbueringer commented Jun 13, 2024

Jellyfrog commented Jun 14, 2024 • edited Loading

sbueringer commented Jul 4, 2024

mslga commented Aug 28, 2024

sbueringer commented Aug 28, 2024

k8s-triage-robot commented Nov 26, 2024

k8s-triage-robot commented Dec 26, 2024

sbueringer commented Mar 22, 2024 •

edited

Loading

Jellyfrog commented Jun 13, 2024 •

edited

Loading

Jellyfrog commented Jun 14, 2024 •

edited

Loading