MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error #10195

jessehu · 2024-02-26T05:34:15Z

What steps did you take and what happened?

When calling clusterctlclient.Client.ApplyUpgrade(upgrade) to upgrade CAPI core components (its version is not changed) and a CAPI Infra Provider component(the version is changed), there is a very low probability that capi-controller-manager pod is restarted. Both capi-controller-manager pod log and pod previous log contains the error log "Unable to retrieve Node status":

E0223 18:31:51.557569 1 machineset_controller.go:883] "Unable to retrieve Node status" err="failed to create cluster accessor: failed to get lock for cluster: cluster is locked already" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="e2e-mycluster-v1-24-106-sks-upgrade/e2e-50a8u5-sks-upgrade-3m3w-workers-bkzcs" namespace="e2e-mycluster-v1-24-106-sks-upgrade" name="e2e-50a8u5-sks-upgrade-3m3w-workers-bkzcs" reconcileID=b9a3b2d2-00e9-4d0f-97b4-f2448292404d MachineDeployment="e2e-mycluster-v1-24-106-sks-upgrade/e2e-50a8u5-sks-upgrade-3m3w-workers" Cluster="e2e-mycluster-v1-24-106-sks-upgrade/e2e-50a8u5-sks-upgrade-3m3w" Machine="e2e-mycluster-v1-24-106-sks-upgrade/e2e-50a8u5-sks-upgrade-3m3w-workers-bkzcs-75tm4" node=""

This error causes the MD.Status.ReadyReplicas changes from 3 to 0 and after about 90s it will be changed back to 3. The reason is updateStatus() in machineset_controller.go ignores the error returned by getMachineNode() and treats the Node as not ready. In the mean time, KCP.Status.ReadyReplicas changes from 3 to 2 and back to 3 (after about only 8 seconds), and the reason might be kcp Reconcile() issues a requeue immediately after hitting ErrClusterLocked error.

Our code on top of CAPI watches MD.Status.ReadyReplicas and leads to an issue when MD.Status.ReadyReplicas changes from 3 to 0.

What did you expect to happen?

MD.Status.ReadyReplicas should not change from 3 to 0 when (at least) hitting ErrClusterLocked error and even other errors, because the Nodes are ready actually.
KCP.Status.ReadyReplicas should not change either when hitting ErrClusterLocked error.

Cluster API version

1.5.2

Kubernetes version

1.24.17

Anything else you would like to add?

To avoid MD.Status.ReadyReplicas change in this case, we can return error rather than contiune at
https://github.com/kubernetes-sigs/cluster-api/blob/v1.5.2/internal/controllers/machineset/machineset_controller.go#L882-L884 when the error is ErrClusterLocked (or even return error on any error).

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

The text was updated successfully, but these errors were encountered:

fabriziopandini · 2024-02-26T21:26:23Z

/triage needs-discussion

I would like to get @vincepri, @sbueringer, and @JoelSpeed opinions on this one.

Currently, we consider available a replica for which we know it has node ready, and this seems semantically correct to me.

The downside of this formulation is that available can flick whenever the node status changes, or whenever there are connection problems to the workload cluster and we cannot retrieve the node status anymore (like in this use case).

If this is still the behavior we all want, then IMO the behavior of KCP and MD is correct: they should both reduce the number of available replicas based on the info available at a given time.

However, what we can do is

Eventually, consider if and how to make MS recover more quickly in case of ErrClusterLocked, but this comes with the caveat that the time of the next reconciliation is not something deterministic, because it depends on the content of the controller queue, number of workers, etc.
Eventually, consider if we need a way to be more explicit about what's going on (e.g. unknownReplicas)

jessehu · 2024-02-27T00:57:57Z

Thanks @fabriziopandini. The error ErrClusterLocked should be gone in a short time, so marking the Node as notReady or unknown replica immediately after hitting error ErrClusterLocked might be over responsive. Also considering kube-controller-manager marks a Node as unhealthy after 40s unresponsive state.

--node-monitor-grace-period duration     Default: 40s
Amount of time which we allow running Node to be unresponsive before marking it unhealthy.
Must be N times more than kubelet's nodeStatusUpdateFrequency, where N means number
of retries allowed for kubelet to post node status.

jessehu · 2024-02-28T01:45:55Z

BTW this could also impacted by #9810 discussed in #10165 (comment)

JoelSpeed · 2024-02-28T08:36:55Z

Yes I think we may want to take a leaf out of KCMs book here and not immediately flick to the unready state. I would expect in the real world that users monitor things like unready nodes and, want to remediate that situation. Going unready early may lead to false positive alerts.

I think in this case specifically, the ErrClusterLocked, we would want to leave the Nodes in whatever state they were previously in, and requeue the request to try again in say 20s. Do we track when we last observed the Node currently? We probably also want to have a timeout on this behaviour. If we haven't seen the Node in x time then we assume its status is unknown

jessehu · 2024-03-06T05:07:09Z

I made a PR to fix this bug with a simple approach (not implementing unknownReplicas). Please kindly take a look. Thanks!

jessehu · 2024-03-06T05:49:42Z

/area machineset

fabriziopandini · 2024-04-11T17:52:53Z

/priority important-longterm

chrischdi · 2024-04-16T13:18:18Z

Note: CAPI v1.7 has fixes for the ErrClusterLocked error:

🐛 ClusterCacheTracker: Use RequeueAfter instead of immediate requeue on ErrClusterLocked to not have exponentially increasing requeue time #9810

Did we already test if this is still an issue with v1.7.0-rc.1 or so? (v1.7.0 will be released today)

jessehu · 2024-04-16T13:23:30Z

Thanks @chrischdi. That PR solves the reconcile latence when hitting ErrClusterLocked errors. This issue describes the issue "
MD.Status.ReadyReplicas changes from 3 to 0" when hitting ErrClusterLocked error for the first time.

chrischdi · 2024-04-16T13:24:05Z

Yeah just noted when reading code:

The fix it may not directly change the above behavior, but it would lead to retrying every minute.

k8s-ci-robot · 2024-04-18T13:15:38Z

This issue is currently awaiting triage.

CAPI contributors will take a look as soon as possible, apply one of the triage/* labels and provide further guidance.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2024-07-17T14:02:41Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jessehu · 2024-07-26T08:31:38Z

/remove-lifecycle stale
There is a PR for resolving this issue, although with some concern: #10436

fabriziopandini · 2024-07-29T09:19:55Z

PTAL #10897

The new RemoteConnectionProbe condition and related --remote-connection-grace-period --remote-conditions-grace-period flags are laying the foundation for handling connection issues with the workload cluster in a consistent way across all the controllers.

k8s-triage-robot · 2024-10-27T09:25:52Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 26, 2024

k8s-ci-robot added triage/needs-discussion and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 26, 2024

jessehu mentioned this issue Mar 6, 2024

🐛 Do not update KCP and MS status when unable to get workload cluster #10229

Closed

k8s-ci-robot added the area/machineset Issues or PRs related to machinesets label Mar 6, 2024

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 11, 2024

jessehu mentioned this issue Apr 15, 2024

🐛 Do not update MS status when unable to get workload cluster or machine node #10436

Closed

fabriziopandini removed the triage/needs-discussion label Apr 18, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 18, 2024

enxebre mentioned this issue May 15, 2024

OCPBUGS-33659: Improve resliency of size tagging when hostedcluster KAS down openshift/hypershift#4034

Merged

2 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 17, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 26, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error #10195

MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error #10195

jessehu commented Feb 26, 2024

fabriziopandini commented Feb 26, 2024 •

edited

Loading

jessehu commented Feb 27, 2024

jessehu commented Feb 28, 2024

JoelSpeed commented Feb 28, 2024

jessehu commented Mar 6, 2024

jessehu commented Mar 6, 2024

fabriziopandini commented Apr 11, 2024

chrischdi commented Apr 16, 2024

jessehu commented Apr 16, 2024

chrischdi commented Apr 16, 2024

k8s-ci-robot commented Apr 18, 2024

k8s-triage-robot commented Jul 17, 2024

jessehu commented Jul 26, 2024

fabriziopandini commented Jul 29, 2024

k8s-triage-robot commented Oct 27, 2024

MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error #10195

MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error #10195

Comments

jessehu commented Feb 26, 2024

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

fabriziopandini commented Feb 26, 2024 • edited Loading

jessehu commented Feb 27, 2024

jessehu commented Feb 28, 2024

JoelSpeed commented Feb 28, 2024

jessehu commented Mar 6, 2024

jessehu commented Mar 6, 2024

fabriziopandini commented Apr 11, 2024

chrischdi commented Apr 16, 2024

jessehu commented Apr 16, 2024

chrischdi commented Apr 16, 2024

k8s-ci-robot commented Apr 18, 2024

k8s-triage-robot commented Jul 17, 2024

jessehu commented Jul 26, 2024

fabriziopandini commented Jul 29, 2024

k8s-triage-robot commented Oct 27, 2024

fabriziopandini commented Feb 26, 2024 •

edited

Loading