-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error #10195
Comments
/triage needs-discussion I would like to get @vincepri, @sbueringer, and @JoelSpeed opinions on this one. Currently, we consider available a replica for which we know it has node ready, and this seems semantically correct to me. The downside of this formulation is that available can flick whenever the node status changes, or whenever there are connection problems to the workload cluster and we cannot retrieve the node status anymore (like in this use case). If this is still the behavior we all want, then IMO the behavior of KCP and MD is correct: they should both reduce the number of available replicas based on the info available at a given time. However, what we can do is
|
Thanks @fabriziopandini. The error ErrClusterLocked should be gone in a short time, so marking the Node as notReady or unknown replica immediately after hitting error ErrClusterLocked might be over responsive. Also considering kube-controller-manager marks a Node as unhealthy after 40s unresponsive state.
|
BTW this could also impacted by #9810 discussed in #10165 (comment) |
Yes I think we may want to take a leaf out of KCMs book here and not immediately flick to the unready state. I would expect in the real world that users monitor things like unready nodes and, want to remediate that situation. Going unready early may lead to false positive alerts. I think in this case specifically, the |
I made a PR to fix this bug with a simple approach (not implementing unknownReplicas). Please kindly take a look. Thanks! |
/area machineset |
/priority important-longterm |
Note: CAPI v1.7 has fixes for the Did we already test if this is still an issue with v1.7.0-rc.1 or so? (v1.7.0 will be released today) |
Thanks @chrischdi. That PR solves the reconcile latence when hitting ErrClusterLocked errors. This issue describes the issue " |
Yeah just noted when reading code: The fix it may not directly change the above behavior, but it would lead to retrying every minute. |
This issue is currently awaiting triage. CAPI contributors will take a look as soon as possible, apply one of the Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
PTAL #10897 The new |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
What steps did you take and what happened?
When calling clusterctlclient.Client.ApplyUpgrade(upgrade) to upgrade CAPI core components (its version is not changed) and a CAPI Infra Provider component(the version is changed), there is a very low probability that capi-controller-manager pod is restarted. Both capi-controller-manager pod log and pod previous log contains the error log "Unable to retrieve Node status":
E0223 18:31:51.557569 1 machineset_controller.go:883] "Unable to retrieve Node status" err="failed to create cluster accessor: failed to get lock for cluster: cluster is locked already" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="e2e-mycluster-v1-24-106-sks-upgrade/e2e-50a8u5-sks-upgrade-3m3w-workers-bkzcs" namespace="e2e-mycluster-v1-24-106-sks-upgrade" name="e2e-50a8u5-sks-upgrade-3m3w-workers-bkzcs" reconcileID=b9a3b2d2-00e9-4d0f-97b4-f2448292404d MachineDeployment="e2e-mycluster-v1-24-106-sks-upgrade/e2e-50a8u5-sks-upgrade-3m3w-workers" Cluster="e2e-mycluster-v1-24-106-sks-upgrade/e2e-50a8u5-sks-upgrade-3m3w" Machine="e2e-mycluster-v1-24-106-sks-upgrade/e2e-50a8u5-sks-upgrade-3m3w-workers-bkzcs-75tm4" node=""
This error causes the MD.Status.ReadyReplicas changes from 3 to 0 and after about 90s it will be changed back to 3. The reason is updateStatus() in machineset_controller.go ignores the error returned by getMachineNode() and treats the Node as not ready. In the mean time, KCP.Status.ReadyReplicas changes from 3 to 2 and back to 3 (after about only 8 seconds), and the reason might be kcp Reconcile() issues a requeue immediately after hitting ErrClusterLocked error.
Our code on top of CAPI watches MD.Status.ReadyReplicas and leads to an issue when MD.Status.ReadyReplicas changes from 3 to 0.
What did you expect to happen?
Cluster API version
1.5.2
Kubernetes version
1.24.17
Anything else you would like to add?
To avoid MD.Status.ReadyReplicas change in this case, we can
return error
rather thancontiune
athttps://github.com/kubernetes-sigs/cluster-api/blob/v1.5.2/internal/controllers/machineset/machineset_controller.go#L882-L884 when the error is ErrClusterLocked (or even return error on any error).
Label(s) to be applied
/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.
The text was updated successfully, but these errors were encountered: