[occm] multizone race condition #2590

sergelogvinov · 2024-05-08T14:10:30Z

We cannot definitively determine whether a node exists within our zone without the provider ID string.

What this PR does / why we need it:

It affects if OS_CCM_REGIONAL is true
The node trying to join to the cluster. It has non ready status and uninitialized taint.
The cloud-node controller is attempting to initialize the node simultaneously with the node-lifecycle controller's attempt to delete the node, as the node is located in another region

So we need to skip nodes without ProviderID in OS_CCM_REGIONAL mode.
Azure does the same az.IsNodeUnmanaged -> [IsNodeUnmanagedByProviderID] (https://github.com/kubernetes-sigs/cloud-provider-azure/blob/master/pkg/provider/azure_wrap.go#L86) - if ProviderID is not azure string (or empty) - set it as unmanaged node.

Which issue this PR fixes(if applicable):

part of #1924

Special notes for reviewers:

Release note:

OS_CCM_REGIONAL is alpha feature, we do not need to update release note.
Feature flag was introduced here #1909

NONE

linux-foundation-easycla · 2024-05-08T14:10:34Z

The committers listed above are authorized under a signed CLA.

✅ login: sergelogvinov / name: Serge (a2ca362)

k8s-ci-robot · 2024-05-08T14:10:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign anguslees for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/openstack/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-05-08T14:10:40Z

Hi @sergelogvinov. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mdbooth · 2024-05-10T09:21:09Z

pkg/openstack/instancesv2.go

+			klog.V(4).Infof("Instance %s should initialized first", node.Name)
+			return true, nil


I haven't checked the call-sites, but my initial thought is that this should return an error rather than true. It feels like we are potentially asserting that the instance exists when in fact we don't know.

cloud-provider does nothing if we return positive (it's alive and exist) response. Otherwise it delete/taint the node. The CCM in another region can make right decision.

Than ProviderID is empty - yep, better return an error.
If ProviderID is not openstack magic string - better do nothing. Errors will create noisy logs.
AzureCCM does the same.

This providerID topic was discussed recently, it seems there is the assumptions for cloud providers to set it, so new nodes will wait until the providerID is set unless the cloud provider indicates is not implemented ,kubernetes/kubernetes#123713

Otherwise the node is not going to be initialized

mdbooth · 2024-05-10T09:34:24Z

pkg/openstack/instancesv2.go

@@ -79,6 +79,18 @@ func (os *OpenStack) instancesv2() (*InstancesV2, bool) {

 // InstanceExists indicates whether a given node exists according to the cloud provider
 func (i *InstancesV2) InstanceExists(ctx context.Context, node *v1.Node) (bool, error) {
+	if i.regionProviderID {
+		if node.Spec.ProviderID == "" {
+			klog.V(4).Infof("Instance %s should initialized first", node.Name)


nit:

Suggested change

klog.V(4).Infof("Instance %s should initialized first", node.Name)

klog.V(4).Infof("Node %s should be initialized first", node.Name)

Incidentally, it looks like you've started moving to structured logging in at least the loadbalancer code.

This might be written as a structured log instead:

klog.V(4).InfoS("Node is not yet initialised", "node", klog.KObj(node))

mdbooth · 2024-05-10T09:37:25Z

pkg/openstack/utils.go

+// Return true if instance is not openstack.
+func instanceNodeUnmanaged(providerID string) bool {
+	if providerID != "" {
+		return strings.Contains(providerID, "://") && !strings.HasPrefix(providerID, ProviderName+"://")


What does the first assertion (strings.Contains(providerID, "://"))buy us?

instanceIDFromProviderID adds openstack:// if providerID has string like /instance-id.
We are checking whether the providerID is in the old style or not. If yes - node is ours (backward compatibility)

mdbooth · 2024-05-10T09:40:22Z

pkg/openstack/instancesv2.go

+
+		if instanceNodeUnmanaged(node.Spec.ProviderID) {
+			klog.V(4).Infof("Instance %s is not an OpenStack instance", node.Name)
+			return true, nil


Again, is true the correct return value here? I'd have to check the call-sites, but my feeling is that this could lead to incorrect behaviour possibly today, but if not in the future. My feeling is that we should return an error here.

mdbooth · 2024-05-10T09:41:49Z

pkg/openstack/instancesv2.go

+		if node.Spec.ProviderID == "" {
+			klog.V(4).Infof("Instance %s should initialized first", node.Name)
+			return false, nil
+		}
+
+		if instanceNodeUnmanaged(node.Spec.ProviderID) {
+			klog.V(4).Infof("Instance %s is not an OpenStack instance", node.Name)
+			return false, nil
+		}


Same comments as above re structured logging and returning a non-error result.

We cannot definitively determine whether a node exists within our zone without the provider ID string.

mdbooth · 2024-05-13T09:07:05Z

I feel like the underlying problem here is that getInstance doesn't support multiple regions when providerID is not set: https://github.com/sergelogvinov/cloud-provider-openstack/blob/3c97ceb61c74d4a58716b1d3109169d05451ea67/pkg/openstack/instancesv2.go#L153-L175

Would it be safer and less prone to hard-to-fix future edge cases to address that instead?

sergelogvinov · 2024-05-13T10:40:02Z

My changes based on these thoughts:

If we decide to add a new region into the existing cluster, we'll need to redeploy the CCM with the OS_CCM_REGIONAL flag. Subsequently, all new nodes will be assigned a region name in their ProviderID string.

CCM will use default region (regionA) for nodes lacking a region in their ProviderID.
Therefore, two instances of CCM, each configured with different region names, will operate independently.
CCM-regionA will locate instances by ProviderID regardless of region presence, while CCM-regionB will only recognize nodes with a region name in their ProviderID (as all nodes in regionB will have this).

mdbooth · 2024-05-13T11:24:52Z

Got it. However, this would still be baking logic into the CCM based on a particular deployment strategy. I think it would still address your use case if we, e.g. updated getInstance to be multi-region aware in the event that providerID is not set. This would also mean that we would not be asserting that an instance exists when we don't know that it does. I can't help but feel that the fact this this happens not to break anything currently is not something we should rely on indefinitely.

mdbooth · 2024-05-13T13:42:39Z

IIUC, the situation you're describing is one where we have multiple CCMs: one per region. The motivation for doing this is...

OpenStack CCM is not multi-region aware? Is there any other reason?

My initial thought was that it would be required for HA, but as long as your control plane is multi-region, a failure of a complete region where the CCM is running would result in the CCM running in another region.

It's not to reduce east-west traffic, either, because the CCM is only talking to KAS, and etcd data is presumably replicated across your control plane anyway.

That isn't quite where I thought I was going with this comment, btw. I just thought I'd state the motivation for wanting to do this first, only to realise I didn't understand it.

Regardless, I think this is fundamentally a design question of supporting multiple multiple CCMs in a single cluster. I wonder if the node folks have any existing guidance on this. I vaguely recall having discussed this before. My impressions from that discussion, which would have been around the time of the external cloud-provider split, where:

Multiple CCMs are not supported
Nobody seemed to be thinking about multiple CCMs from the same provider, only different providers

Still, I wonder if it's worth reaching out to those folks. Potential outcomes:

We're heading in a direction which is explicitly and deliberately unsupported for... reasons
There's some existing prior art we can reference
They're interested in a solution and are waiting for somebody to work on it

Not sure what the forum to raise this is, tbh. I wonder if @aojea, @danwinship, or @andrewsykim could signpost us appropriately?

aojea · 2024-05-13T14:31:11Z

pkg/openstack/utils.go

+// A providerID is build out of '${ProviderName}:///${instance-id}' which contains ':///'.
+// or '${ProviderName}://${region}/${instance-id}' which contains '://'.
+// See cloudprovider.GetInstanceProviderID and Instances.InstanceID.
+func instanceIDFromProviderID(providerID string) (instanceID string, region string, err error) {


it seems this is just moving code from one place to another, can you make this in a different commit so it simplifies the review?

aojea · 2024-05-13T14:32:11Z

pkg/openstack/utils.go

+}
+
+// Return true if instance is not openstack.
+func isNodeUnmanaged(providerID string) bool {


ok, this is the new change, the notion of "unmanaged node"

aojea · 2024-05-13T14:35:02Z

pkg/openstack/instancesv2.go

@@ -79,9 +79,20 @@ func (os *OpenStack) instancesv2() (*InstancesV2, bool) {

 // InstanceExists indicates whether a given node exists according to the cloud provider
 func (i *InstancesV2) InstanceExists(ctx context.Context, node *v1.Node) (bool, error) {
+	if i.regionProviderID {


I'm not familiar with this concept, it seems that this capability makes the controller to skip the nodes considered unmanaged

My point is that, the regionProviderID is an alpha feature, meaning that all code and logic within this block can be improved in the future, if i miss something.

sergelogvinov · 2024-05-13T14:37:01Z

I agree with you. Multi-region or hybrid cloud setups introduce complexity. Determining the affiliation of uninitialized nodes becomes challenging.

I've conducted extensive research and experimented with various setups, including multi-region or hybrid Kubernetes clusters. For instance https://github.com/sergelogvinov/terraform-talos?tab=readme-ov-file#multi-cloud-compatibility

Currently, I've been working on a branch OCCM that supports multi-region OpenStack setups, particularly for node and node-lifecycle controllers. However, configuring routes and services in such setups can be quite challenging. I believe that CSI-cinder can also be adapted for multi-region deployments. I had initially planned to introduce it after this PR to simplify the review process.

PS. I've developed my own multi-regional implementation of CCM/CSI for Proxmox, Proxmox CSI Plugin and Proxmox Cloud Controller Manager. After about a year of production experience with my implementation, I'm confident in its viability.

aojea · 2024-05-13T14:37:37Z

definitely a topic to discuss in sig-cloud-provider , we recently had a long discussion on the provider ID topic that seems to not be well specified kubernetes/kubernetes#123713 so I will also validate all the assumptions so we not make any assumption now that is not correct

sergelogvinov · 2024-05-13T15:22:45Z

definitely a topic to discuss in sig-cloud-provider , we recently had a long discussion on the provider ID topic that seems to not be well specified kubernetes/kubernetes#123713 so I will also validate all the assumptions so we not make any assumption now that is not correct

this changes affect only node-lifecycle controller, but discussion kubernetes/kubernetes#123713 was about node-controller

PS. With this patch a node without a ProviderID can never be deleted...

mdbooth · 2024-05-13T16:12:18Z

The architecture here is:

providerID has the form openstack://<region>/<instance uuid>
One CCM per region; Nodes are sharded by region in the providerID
A single 'primary' CCM is responsible for assigning a providerID to Nodes which don't have one yet

Is that right?

Do the per-region CCMs actually shard the Nodes, or do we just ensure the responses from InstancesV2 are such that the controller will never take any action for a Node which is not in its region?

mdbooth · 2024-05-13T16:24:11Z

Have you considered having kubelet set the providerID on Node creation, btw? Here's how CAPO does this:

https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/1e0b60fd0ffd7fc1e940ada7e85ec8d724381146/templates/cluster-template.yaml#L21-L24

This is used by kubeadm when it creates the service which runs kubelet. It gets the instance-id from the local metadata service.

Here's how we do it in OpenShift... for AWS unfortunately but the principal is the same (I'm convinced we did this for OpenStack too... may have to fix that 🤔):

https://github.com/openshift/machine-config-operator/blob/master/templates/common/aws/files/usr-local-bin-aws-kubelet-providerid.yaml

That unit runs on boot and sets a KUBELET_PROVIDERID environment variable which the kubelet unit uses on startup. Again it uses the local (AWS) metadata service to get the instanceID, and AZ in that case.

Unfortunately I don't think OpenStack region is available via the metadata service. However, a workaround for this might be to have a separate template for each region where the region was hard-coded and it just filled in instanceID from the metadata service.

mdbooth · 2024-05-13T16:48:13Z

Coming back to this patch specifically, I think I'd be more comfortable proposing a patch in k/k similar to kubernetes/kubernetes#123713 which causes the node-lifecycle controller to have some explicit behaviour (e.g. ignore) for Nodes which don't have a providerID.

I continue to worry that in this PR we are mis-reporting to cloud-provider on the basis that we expect cloud-provider to have a certain behaviour. I think that will be fragile and hard to maintain.

dulek · 2024-05-23T09:18:39Z

/ok-to-test

k8s-triage-robot · 2024-08-21T09:59:23Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kayrus · 2024-09-17T18:13:10Z

/test pull-cloud-provider-openstack-check

kayrus · 2024-09-17T18:13:18Z

/remove-lifecycle stale

MatthieuFin · 2024-09-26T12:31:16Z

Have you considered having kubelet set the providerID on Node creation, btw? Here's how CAPO does this:

https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/1e0b60fd0ffd7fc1e940ada7e85ec8d724381146/templates/cluster-template.yaml#L21-L24

This is used by kubeadm when it creates the service which runs kubelet. It gets the instance-id from the local metadata service.

Here's how we do it in OpenShift... for AWS unfortunately but the principal is the same (I'm convinced we did this for OpenStack too... may have to fix that 🤔):

https://github.com/openshift/machine-config-operator/blob/master/templates/common/aws/files/usr-local-bin-aws-kubelet-providerid.yaml

That unit runs on boot and sets a KUBELET_PROVIDERID environment variable which the kubelet unit uses on startup. Again it uses the local (AWS) metadata service to get the instanceID, and AZ in that case.

Unfortunately I don't think OpenStack region is available via the metadata service. However, a workaround for this might be to have a separate template for each region where the region was hard-coded and it just filled in instanceID from the metadata service.

Hi @mdbooth , I try to used CAPO in multi-cloud environment and you're right I don't encounter this race condition scenario since provider-id is set by kubeadm, I create different template one per cloud with region hardcoded like that:

---
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfigTemplate
metadata:
  name: pool-region-one
  annotations:
    controlplane.cluster.x-k8s.io/skip-coredns: ""
    controlplane.cluster.x-k8s.io/skip-kube-proxy: ""
spec:
  template:
    spec:
      joinConfiguration:
        nodeRegistration:
          kubeletExtraArgs:
            cloud-provider: external
            provider-id: openstack://region-one/'{{ instance_id }}'
...

This work pretty well with multiple OCCM deployed with variables OS_CCM_REGIONAL="true" and OS_V1_INSTANCES="true" because instance v2 implementation seems to not handle support of OS_CCM_REGIONAL behavior.

However, CAPO don't handle region in magic string provider-id in his CRD:

kubectl get openstackmachine
NAME                       CLUSTER   INSTANCESTATE   READY   PROVIDERID                                          MACHINE                    AGE
control-plane-6r5xs        capoc     ACTIVE          true    openstack:///87db7660-bc3b-48fd-b5c9-0fa2b2816eed   control-plane-6r5xs        3h14m
pool-region-one-66rgq-t5c72   capoc     ACTIVE          true    openstack:///f48f60bd-bae1-4ed9-b58f-aa1d90aa6579   pool-region-one-66rgq-t5c72   154m

And he don't create other node because he is stuck on waiting this first one because CRD's provider-id don't containt region name and don't match kubernetes node spec.providerID which contains region name.

I think that we simply have to add region in provider-id here https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/main/controllers/openstackmachine_controller.go#L425C1-L425C112 .

I am looking to find a way to implement that correctly in CAPO.

k8s-ci-robot · 2024-10-22T20:21:27Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label May 8, 2024

k8s-ci-robot requested review from anguslees and FengyunPan2 May 8, 2024 14:10

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label May 8, 2024

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 8, 2024

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 8, 2024

sergelogvinov force-pushed the multizone-join-node branch 2 times, most recently from ff0fb62 to 6016b87 Compare May 8, 2024 14:25

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 8, 2024

sergelogvinov force-pushed the multizone-join-node branch from 6016b87 to 745d91c Compare May 10, 2024 08:04

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 10, 2024

mdbooth reviewed May 10, 2024

View reviewed changes

sergelogvinov force-pushed the multizone-join-node branch 2 times, most recently from 33c3917 to 68fbb88 Compare May 10, 2024 16:39

[occm] multizone race condition

a2ca362

We cannot definitively determine whether a node exists within our zone without the provider ID string.

sergelogvinov force-pushed the multizone-join-node branch from 68fbb88 to a2ca362 Compare May 10, 2024 16:40

aojea reviewed May 13, 2024

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 23, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 21, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 17, 2024

MatthieuFin mentioned this pull request Sep 27, 2024

Multi openstack cloud support with OCCM kubernetes-sigs/cluster-api-provider-openstack#2183

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[occm] multizone race condition #2590

[occm] multizone race condition #2590

sergelogvinov commented May 8, 2024 •

edited

Loading

linux-foundation-easycla bot commented May 8, 2024 •

edited

Loading

k8s-ci-robot commented May 8, 2024

k8s-ci-robot commented May 8, 2024

mdbooth May 10, 2024

sergelogvinov May 10, 2024

aojea May 13, 2024

mdbooth May 10, 2024

mdbooth May 10, 2024

sergelogvinov May 10, 2024 •

edited

Loading

mdbooth May 10, 2024

mdbooth May 10, 2024

mdbooth commented May 13, 2024

sergelogvinov commented May 13, 2024

mdbooth commented May 13, 2024

mdbooth commented May 13, 2024

aojea May 13, 2024

aojea May 13, 2024

aojea May 13, 2024

sergelogvinov May 13, 2024

sergelogvinov commented May 13, 2024

aojea commented May 13, 2024

sergelogvinov commented May 13, 2024

mdbooth commented May 13, 2024 •

edited

Loading

mdbooth commented May 13, 2024

mdbooth commented May 13, 2024

dulek commented May 23, 2024

k8s-triage-robot commented Aug 21, 2024

kayrus commented Sep 17, 2024

kayrus commented Sep 17, 2024

MatthieuFin commented Sep 26, 2024

k8s-ci-robot commented Oct 22, 2024

		klog.V(4).Infof("Instance %s should initialized first", node.Name)
		return true, nil

	klog.V(4).Infof("Instance %s should initialized first", node.Name)
	klog.V(4).Infof("Node %s should be initialized first", node.Name)

[occm] multizone race condition #2590

Are you sure you want to change the base?

[occm] multizone race condition #2590

Conversation

sergelogvinov commented May 8, 2024 • edited Loading

linux-foundation-easycla bot commented May 8, 2024 • edited Loading

k8s-ci-robot commented May 8, 2024

k8s-ci-robot commented May 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sergelogvinov May 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdbooth commented May 13, 2024

sergelogvinov commented May 13, 2024

mdbooth commented May 13, 2024

mdbooth commented May 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sergelogvinov commented May 13, 2024

aojea commented May 13, 2024

sergelogvinov commented May 13, 2024

mdbooth commented May 13, 2024 • edited Loading

mdbooth commented May 13, 2024

mdbooth commented May 13, 2024

dulek commented May 23, 2024

k8s-triage-robot commented Aug 21, 2024

kayrus commented Sep 17, 2024

kayrus commented Sep 17, 2024

MatthieuFin commented Sep 26, 2024

k8s-ci-robot commented Oct 22, 2024

sergelogvinov commented May 8, 2024 •

edited

Loading

linux-foundation-easycla bot commented May 8, 2024 •

edited

Loading

sergelogvinov May 10, 2024 •

edited

Loading

mdbooth commented May 13, 2024 •

edited

Loading