OCPBUGS-35256: Allow bootstrap node in existing instance groups #66

patrickdillon · 2024-08-01T15:55:42Z

With the move to CAPG-based installs, the installer places the bootstrap node in the same instance group as one of the masters, due to limitations in CAPG. (In Terraform installs the bootstrap node has its own instance group.) Having the bootstrap node in the same instance group as a master exposes a bug in the logic of a patch we carry. As the commit message states:

This change finds pre-existing instance-groups that ONLY contain
instances that belong to the cluster, uses them for the backend
service.

The logic for that check is implemented here:

cloud-provider-gcp/providers/gce/gce_loadbalancer_internal.go

Lines 671 to 674 in 539eb06

    
           if names.HasAll(groupInstances.UnsortedList()...) { 
        
           	igLinks = append(igLinks, ig.SelfLink) 
        
           	skip.Insert(groupInstances.UnsortedList()...) 
        
           }

The issue is due to the fact that the check only uses "node" names, and cloud-provider-gcp does not consider the bootstrap VM a node. We can resolve the issue by relaxing the logic to ensure that all instances in the instance group have the node prefix.

patrickdillon · 2024-08-03T00:28:36Z

/retest images

openshift-ci · 2024-08-03T00:28:39Z

@patrickdillon: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test fmt
/test images

The following commands are available to trigger optional jobs:

/test okd-scos-images
/test verify-commits

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-cloud-provider-gcp-master-e2e-gcp-ovn
pull-ci-openshift-cloud-provider-gcp-master-e2e-gcp-ovn-upgrade
pull-ci-openshift-cloud-provider-gcp-master-fmt
pull-ci-openshift-cloud-provider-gcp-master-images
pull-ci-openshift-cloud-provider-gcp-master-verify-commits

In response to this:

/retest images

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot · 2024-08-05T13:23:14Z

@patrickdillon: This pull request references Jira Issue OCPBUGS-35256, which is invalid:

expected the bug to target the "4.17.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The logic to allow cloud provider gcp to utilize existing instance groups checks to ensure the instance group contains only cluster nodes. cloud provider gcp does not consider the bootstrap node to be a node, so if the bootstrap node is in the same ig as one of the masters, cloud provider gcp will try to add that master to another ig.

We can resolve the issue by relaxing the logic to ensure that all instances in the instance group have the node prefix.

Opening this to test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

patrickdillon · 2024-08-05T13:28:32Z

/jira refresh

openshift-ci-robot · 2024-08-05T13:28:36Z

@patrickdillon: This pull request references Jira Issue OCPBUGS-35256, which is invalid:

expected the bug to target the "4.17.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

patrickdillon · 2024-08-05T13:37:10Z

/jira refresh

openshift-ci-robot · 2024-08-05T13:37:13Z

@patrickdillon: This pull request references Jira Issue OCPBUGS-35256, which is invalid:

expected the bug to target the "4.17.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

patrickdillon · 2024-08-05T13:39:13Z

/jira refresh

openshift-ci-robot · 2024-08-05T13:39:19Z

@patrickdillon: This pull request references Jira Issue OCPBUGS-35256, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.17.0) matches configured target version for branch (4.17.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jianli-wei

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

patrickdillon · 2024-08-05T13:40:29Z

While testing my changes to this PR, it was necessary to run go mod vendor to pull my changes into the vendor directory, apparently due to:

cloud-provider-gcp/go.mod

Line 165 in 539eb06

k8s.io/cloud-provider-gcp/providers => ./providers

patrickdillon · 2024-08-05T13:41:04Z

As this is building, minimally, on top of another carry patch, please LMK if there's anything I can do commit-wise to make this easier for future carries/rebases.

nrb · 2024-08-05T15:28:17Z

vendor/k8s.io/cloud-provider-gcp/providers/gce/gce_loadbalancer.go

@@ -166,7 +166,7 @@ func (g *Cloud) EnsureLoadBalancer(ctx context.Context, clusterName string, svc
 		return nil, err
 	}

-	klog.V(4).Infof("EnsureLoadBalancer(%v, %v, %v, %v, %v): ensure %v loadbalancer", clusterName, svc.Namespace, svc.Name, loadBalancerName, g.region, desiredScheme)
+	klog.V(4).Infof("debug EnsureLoadBalancer(%v, %v, %v, %v, %v): ensure %v loadbalancer", clusterName, svc.Namespace, svc.Name, loadBalancerName, g.region, desiredScheme)


We should either remove the debug prefix in all the logging calls or change the log level.

I suspect this is left over from debugging work, though :)

oh right, but I forgot to revendor after cleaning up my work. Thanks for catching this. Should be fixed now!

nrb · 2024-08-05T15:30:04Z

the installer places the bootstrap node in the same instance group as one of the masters, due to limitations in CAPG

Can you refresh my memory - how many instance groups are there total for the masters?

bfournie · 2024-08-05T18:02:18Z

the installer places the bootstrap node in the same instance group as one of the masters, due to limitations in CAPG

Can you refresh my memory - how many instance groups are there total for the masters?

One for each master, so 3 instance groups. The bootstrap VM is included in the first instance group. See https://issues.redhat.com/browse/OCPBUGS-35256?focusedId=24931575&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-24931575

patrickdillon · 2024-08-05T18:23:58Z

the installer places the bootstrap node in the same instance group as one of the masters, due to limitations in CAPG

Can you refresh my memory - how many instance groups are there total for the masters?

This code update is only relevant for private installs, in which case the control plane nodes only belong to a single instance group. Here is a screenshot from a terraform install (note the bootstrap ig), showing the VMs and their instance groups:

Here is a screenshot for capg-based, where bootstrap shares an IG with a master:

nrb · 2024-08-05T19:03:02Z

Ah, ok, this has to do with the instance groups being confined to a region/zone, and the bootstrap node will have to be in one of them.

I'm ok with carrying this change to our carried logic.

I see there was discussion about using kubernetes-sigs/cluster-api-provider-gcp#1266 instead. Usually I'd prefer to use upstream, but IMO expanding the CAPG API to accommodate for this specific use case might be too broad of a change. I'm not a CAPG reviewer or maintainer, though.

nrb · 2024-08-05T19:04:20Z

providers/gce/gce_loadbalancer_internal.go

@@ -668,7 +668,8 @@ func (g *Cloud) ensureInternalInstanceGroups(name string, nodes []*v1.Node) ([]s
 				parts := strings.Split(ins.Instance, "/")
 				groupInstances.Insert(parts[len(parts)-1])
 			}
-			if names.HasAll(groupInstances.UnsortedList()...) {
+			groupInstanceNames := groupInstances.UnsortedList()


I'm fine with the code as is, but since this is looking like it will be a long term carry, a comment as to why we have these conditions will help reduce issue archeology in the future.

The motivation was included in the commit message. I just brushed up again.

Do you think that works or we also need an in-code comment?

This updates the logic from ac00767 for using existing instance groups to allow the bootstrap node to be present in the instance group.. Prior to this commit logic to utilize existing instance groups checked to ensure the instance group contains only cluster nodes (which the bootstrap node is not), so if the bootstrap node is in the same ig as ne of the masters, the master would be added to a second ig, which breaks things. We can resolve the issue by relaxing the logic to ensure that all instances in the instance group have the node prefix. Vendors the changes in order to be utilized by the cloud provider. Fixes OCPBUGS-35256

bfournie · 2024-08-05T21:25:28Z

I tested this PR with an internal cluster and it seems to work, I am no longer see the wrongSubnetwork error in cloud-controller-manager that I was seeing in
https://issues.redhat.com/browse/OCPBUGS-35256

I think its preferable to use this PR as opposed to adding knowledge of a bootstrap VM to kubernetes-sigs/cluster-api-provider-gcp#1266

bfournie · 2024-08-05T22:01:57Z

/lgtm

openshift-ci · 2024-08-05T22:50:02Z

@patrickdillon: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

nrb · 2024-08-06T14:57:32Z

/approve

openshift-ci · 2024-08-06T15:02:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nrb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [nrb]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2024-08-06T15:07:49Z

@patrickdillon: Jira Issue OCPBUGS-35256: All pull requests linked via external trackers have merged:

openshift/cloud-provider-gcp#66

Jira Issue OCPBUGS-35256 has been moved to the MODIFIED state.

In response to this:

With the move to CAPG-based installs, the installer places the bootstrap node in the same instance group as one of the masters, due to limitations in CAPG. (In Terraform installs the bootstrap node has its own instance group.) Having the bootstrap node in the same instance group as a master exposes a bug in the logic of a patch we carry. As the commit message states:

This change finds pre-existing instance-groups that ONLY contain
instances that belong to the cluster, uses them for the backend
service.

The logic for that check is implemented here:

cloud-provider-gcp/providers/gce/gce_loadbalancer_internal.go

Lines 671 to 674 in 539eb06

if names.HasAll(groupInstances.UnsortedList()...) {

igLinks = append(igLinks, ig.SelfLink)

skip.Insert(groupInstances.UnsortedList()...)

}

The issue is due to the fact that the check only uses "node" names, and cloud-provider-gcp does not consider the bootstrap VM a node. We can resolve the issue by relaxing the logic to ensure that all instances in the instance group have the node prefix.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-bot · 2024-08-06T16:56:29Z

[ART PR BUILD NOTIFIER]

Distgit: ose-gcp-cloud-controller-manager
This PR has been included in build ose-gcp-cloud-controller-manager-container-v4.17.0-202408061615.p0.g8ce997d.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 1, 2024

patrickdillon force-pushed the int-lb-ig-assignment branch from e0d8553 to 4cb47f1 Compare August 2, 2024 18:00

patrickdillon force-pushed the int-lb-ig-assignment branch from 4cb47f1 to 829f5ea Compare August 3, 2024 10:55

patrickdillon changed the title ~~WIP: Allow bootstrap node in existing instance groups~~ OCPBUGS-35256: Allow bootstrap node in existing instance groups Aug 5, 2024

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 5, 2024

patrickdillon force-pushed the int-lb-ig-assignment branch from 829f5ea to ba9e0bf Compare August 5, 2024 13:27

patrickdillon force-pushed the int-lb-ig-assignment branch from ba9e0bf to ad8df54 Compare August 5, 2024 13:37

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 5, 2024

openshift-ci bot requested a review from jianli-wei August 5, 2024 13:39

nrb reviewed Aug 5, 2024

View reviewed changes

patrickdillon force-pushed the int-lb-ig-assignment branch from ad8df54 to 58222a4 Compare August 5, 2024 17:10

nrb reviewed Aug 5, 2024

View reviewed changes

patrickdillon force-pushed the int-lb-ig-assignment branch from 58222a4 to 2f97fd3 Compare August 5, 2024 20:25

openshift-ci bot assigned bfournie Aug 5, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 5, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 6, 2024

openshift-merge-bot bot merged commit 8ce997d into openshift:master Aug 6, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-35256: Allow bootstrap node in existing instance groups #66

OCPBUGS-35256: Allow bootstrap node in existing instance groups #66

patrickdillon commented Aug 1, 2024 •

edited

Loading

patrickdillon commented Aug 3, 2024

openshift-ci bot commented Aug 3, 2024

openshift-ci-robot commented Aug 5, 2024

patrickdillon commented Aug 5, 2024

openshift-ci-robot commented Aug 5, 2024

patrickdillon commented Aug 5, 2024

openshift-ci-robot commented Aug 5, 2024

patrickdillon commented Aug 5, 2024

openshift-ci-robot commented Aug 5, 2024

patrickdillon commented Aug 5, 2024

patrickdillon commented Aug 5, 2024

nrb Aug 5, 2024

patrickdillon Aug 5, 2024

nrb commented Aug 5, 2024

bfournie commented Aug 5, 2024

patrickdillon commented Aug 5, 2024

nrb commented Aug 5, 2024

nrb Aug 5, 2024

patrickdillon Aug 5, 2024

bfournie commented Aug 5, 2024

bfournie commented Aug 5, 2024

openshift-ci bot commented Aug 5, 2024

nrb commented Aug 6, 2024

openshift-ci bot commented Aug 6, 2024

openshift-ci-robot commented Aug 6, 2024

openshift-bot commented Aug 6, 2024

	if names.HasAll(groupInstances.UnsortedList()...) {
	igLinks = append(igLinks, ig.SelfLink)
	skip.Insert(groupInstances.UnsortedList()...)
	}

OCPBUGS-35256: Allow bootstrap node in existing instance groups #66

OCPBUGS-35256: Allow bootstrap node in existing instance groups #66

Conversation

patrickdillon commented Aug 1, 2024 • edited Loading

patrickdillon commented Aug 3, 2024

openshift-ci bot commented Aug 3, 2024

openshift-ci-robot commented Aug 5, 2024

patrickdillon commented Aug 5, 2024

openshift-ci-robot commented Aug 5, 2024

patrickdillon commented Aug 5, 2024

openshift-ci-robot commented Aug 5, 2024

patrickdillon commented Aug 5, 2024

openshift-ci-robot commented Aug 5, 2024

patrickdillon commented Aug 5, 2024

patrickdillon commented Aug 5, 2024

nrb Aug 5, 2024

Choose a reason for hiding this comment

patrickdillon Aug 5, 2024

Choose a reason for hiding this comment

nrb commented Aug 5, 2024

bfournie commented Aug 5, 2024

patrickdillon commented Aug 5, 2024

nrb commented Aug 5, 2024

nrb Aug 5, 2024

Choose a reason for hiding this comment

patrickdillon Aug 5, 2024

Choose a reason for hiding this comment

bfournie commented Aug 5, 2024

bfournie commented Aug 5, 2024

openshift-ci bot commented Aug 5, 2024

nrb commented Aug 6, 2024

openshift-ci bot commented Aug 6, 2024

openshift-ci-robot commented Aug 6, 2024

openshift-bot commented Aug 6, 2024

patrickdillon commented Aug 1, 2024 •

edited

Loading