Proactively bounce capi-controller-manager in case of netsplits #7445

jayunit100 · 2022-10-24T00:34:03Z

(edit: removed ccm, bc its an ambiguous abbreviation)

User Story

As a user on the edge, in case that one node in my MC gets somehow netsplit off from my Controlplane worker nodes, id like the capi controller manager to fail, that way it bounces to a "potentially" healthy, connected other node in my MC.

Detailed Description

As an example of how I was stepping through this earlier, the steps i followed here worked kubernetes-sigs/cluster-api-provider-vsphere#1660...

Anything else you would like to add:

I saw this in a very odd environment, admittedly, but I think its still a good idea to broaden the definition of health for the capi controller manager, if possible. We wouldnt want a single capi-controller-manager that was having issues to slow down the remediation of a fleet of clusters running on different networks .

/kind feature

Diagram below shows the issue i ran into in this netsplit situation.

chrischdi · 2022-10-24T08:17:07Z

I think this is a general problem for all kind of controllers and is not specific to CAPI's controllers. It should be (or maybe is already) solved in Kubernetes itself.

Did you check if the pod gets evicted after five minutes which would map to the default of the kube-controller-managers pod-eviction-timeout duration parameter (which defaults to 5m0s, xref)

As addition, there are also two other maybe interesting configuration parameters for kube-apiserver which I got from this issue: FeatureGate DefaultTolerationSeconds , default-not-ready-toleration-seconds and default-unreachable-toleration-seconds (xref)

A workaround would be to run the CAPI Controller Manager (and also the other controllers) by multiple replicas and anti-affinity. Failed leader election should lead to failover to another replica.

sbueringer · 2022-10-24T08:29:30Z

id like the CCM to fail, that way it bounces to a "potentially" healthy, connected other node in my MC.

Not sure if I understood the issue correctly, but changes to the CCM should be out of scope of the Cluster API project?

I thought the issue is that CCM in the workload cluster doesn't set the providerID not the CAPI controller in the mgmt cluster?

sbueringer · 2022-10-24T08:41:31Z

To be clear:

CCM = cloud controller manager
not capi controller manager

Wasn't your problem that the cloud controller manager was not setting the providerIDs on nodes?

jayunit100 · 2022-10-24T13:30:04Z

I think the cloud controller Manager issue may also have been a problem, but i saw that
during a net split situation, capi-controller-manager flips out if it cant connect to a control plane of a wl cluster , but it doesnt actually shutdown or bounce to a new node.

killianmuldoon · 2022-10-24T13:49:31Z

during a net split situation, capi-controller-manager flips out if it cant connect to a control plane of a wl cluster , but it doesnt actually shutdown or bounce to a new node

Is the CAPI Controller Manager running with multiple replicas? If the controller is out of the network (split from the leader API server on the mgmt cluster) it should lose the lease and another replica should take up the lease.

If the management Cluster is cut off from the workload cluster network entirely it will just keep retrying to contact the workload cluster.

jayunit100 · 2022-10-24T14:44:15Z

in this case, the capi-controller-manager's node

is able to connect to its own apiserver, (so it doesnt lose the lease)
but not that of other WL apiservers, (so it keeps failing at any MHC type steps)

chrischdi · 2022-10-24T15:30:48Z

So it looks something like this, the red or orange connection here is broken in someway?

jayunit100 · 2022-10-24T19:40:23Z

Yup !!!!!!!!!

jayunit100 · 2022-10-24T19:41:09Z

similar pic to yours

killianmuldoon · 2022-10-25T11:57:07Z

There's a few questions that come to mind:

Should CAPI be considered in a non-healthy state if it can't contact a single workload cluster, or a majority or all of them? How long should this be the case before taking action?
Is the management cluster in HA? Is the assumption that a CAPI controller on another node would be able to communicate with the Workload Cluster? Would it be able to communicate with the APIServer on the initial CAPI controller node?
A compromise / best-effort mitigation could be that the CAPI controller should lose the lease if it can't contact all workload clusters (or some subset of them?). Then it's likely that another controller would get the lease and possibly be able to contact those clusters.

Could this be approached by watching a metric and alerting that the CP is not able to contact workloads? This is discussed here.

That would give someone with insight into the network set up a change to remediate manually, or set up some deployment specific automation.

sbueringer · 2022-11-29T09:36:33Z

I think the answer is monitoring to be honest. Monitor Cluster API and then produce corresponding alerts with either manual playbooks or some automatic mitigation (although automatic is super hard with things like net splits)

Some thoughts:

the capi controller manager to fail, that way it bounces to a "potentially" healthy, connected other node in my MC.

If we include "can the CAPI controller contact workload clusters" into our readiness/liveness probes (or if the controller just shuts down) we get the following consequences:

we will get container restarts and CrashLoopBackoffs
this doesn't trigger a Pod eviction and no re-scheduling of the Pod
even if the CAPI controller Pod deletes itself it would get rescheduled on the same Node with a very high probability
- the kube-scheduler afaik per default prefers Nodes which already have the container image locally

I don't think with a behavior like that we could maintain stable operations.

Let's assume we have a bunch of workload clusters and some of them are offline (either because they are simply broken / misconfigured or some edge "temporary not reachable" scenario). The Cluster API controller would just be permanently restarted. Imagine what happens if you have a few hundred clusters..

Same result if you have some workload clusters that are reachable on some mgmt cluster nodes and others only from other nodes.

jayunit100 · 2022-11-29T14:56:59Z

(deleting my comment as I meant to put it in the other issue) but I still think we should proactively time bomb this container :)

fabriziopandini · 2022-11-30T21:55:14Z

/triage accepted

trying to make up my mind on two sides of the problem:

how can we detect netsplit (a problem on the machine controller run on) from other network problems that affect connectivity to all or a subset of workload clusters
how bounce should happen is interesting. What trigger the bumps? how do we ensure the newly scheduled controller doesn't get scheduled on the same machine.

jayunit100 · 2022-12-03T22:18:07Z

Maybe it's just a naive time bomb expiration? Like once every week or something the pod self terminates gracefully .

k8s-triage-robot · 2024-01-19T17:00:49Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

fabriziopandini · 2024-03-29T17:34:51Z

/close

unfortunately we did not reach an agreement on a way forward and the issue was not active in the last year

k8s-ci-robot · 2024-03-29T17:34:55Z

@fabriziopandini: Closing this issue.

In response to this:

/close

unfortunately we did not reach an agreement on a way forward and the issue was not active in the last year

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 24, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 30, 2022

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024

k8s-ci-robot closed this as completed Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proactively bounce capi-controller-manager in case of netsplits #7445

Proactively bounce capi-controller-manager in case of netsplits #7445

jayunit100 commented Oct 24, 2022 •

edited

Loading

chrischdi commented Oct 24, 2022

sbueringer commented Oct 24, 2022 •

edited

Loading

sbueringer commented Oct 24, 2022

jayunit100 commented Oct 24, 2022

killianmuldoon commented Oct 24, 2022 •

edited

Loading

jayunit100 commented Oct 24, 2022

chrischdi commented Oct 24, 2022 •

edited

Loading

jayunit100 commented Oct 24, 2022

jayunit100 commented Oct 24, 2022

killianmuldoon commented Oct 25, 2022

sbueringer commented Nov 29, 2022 •

edited

Loading

jayunit100 commented Nov 29, 2022 •

edited

Loading

fabriziopandini commented Nov 30, 2022

jayunit100 commented Dec 3, 2022 •

edited

Loading

k8s-triage-robot commented Jan 19, 2024

fabriziopandini commented Mar 29, 2024

k8s-ci-robot commented Mar 29, 2024

Proactively bounce capi-controller-manager in case of netsplits #7445

Proactively bounce capi-controller-manager in case of netsplits #7445

Comments

jayunit100 commented Oct 24, 2022 • edited Loading

chrischdi commented Oct 24, 2022

sbueringer commented Oct 24, 2022 • edited Loading

sbueringer commented Oct 24, 2022

jayunit100 commented Oct 24, 2022

killianmuldoon commented Oct 24, 2022 • edited Loading

jayunit100 commented Oct 24, 2022

chrischdi commented Oct 24, 2022 • edited Loading

jayunit100 commented Oct 24, 2022

jayunit100 commented Oct 24, 2022

killianmuldoon commented Oct 25, 2022

sbueringer commented Nov 29, 2022 • edited Loading

jayunit100 commented Nov 29, 2022 • edited Loading

fabriziopandini commented Nov 30, 2022

jayunit100 commented Dec 3, 2022 • edited Loading

k8s-triage-robot commented Jan 19, 2024

fabriziopandini commented Mar 29, 2024

k8s-ci-robot commented Mar 29, 2024

jayunit100 commented Oct 24, 2022 •

edited

Loading

sbueringer commented Oct 24, 2022 •

edited

Loading

killianmuldoon commented Oct 24, 2022 •

edited

Loading

chrischdi commented Oct 24, 2022 •

edited

Loading

sbueringer commented Nov 29, 2022 •

edited

Loading

jayunit100 commented Nov 29, 2022 •

edited

Loading

jayunit100 commented Dec 3, 2022 •

edited

Loading