Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-35256: Allow bootstrap node in existing instance groups #66

Merged

Conversation

patrickdillon
Copy link

@patrickdillon patrickdillon commented Aug 1, 2024

With the move to CAPG-based installs, the installer places the bootstrap node in the same instance group as one of the masters, due to limitations in CAPG. (In Terraform installs the bootstrap node has its own instance group.) Having the bootstrap node in the same instance group as a master exposes a bug in the logic of a patch we carry. As the commit message states:

This change finds pre-existing instance-groups that ONLY contain
instances that belong to the cluster, uses them for the backend
service.

The logic for that check is implemented here:

if names.HasAll(groupInstances.UnsortedList()...) {
igLinks = append(igLinks, ig.SelfLink)
skip.Insert(groupInstances.UnsortedList()...)
}

The issue is due to the fact that the check only uses "node" names, and cloud-provider-gcp does not consider the bootstrap VM a node. We can resolve the issue by relaxing the logic to ensure that all instances in the instance group have the node prefix.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 1, 2024
@patrickdillon
Copy link
Author

/retest images

Copy link

openshift-ci bot commented Aug 3, 2024

@patrickdillon: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

  • /test e2e-gcp-ovn
  • /test e2e-gcp-ovn-upgrade
  • /test fmt
  • /test images

The following commands are available to trigger optional jobs:

  • /test okd-scos-images
  • /test verify-commits

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-cloud-provider-gcp-master-e2e-gcp-ovn
  • pull-ci-openshift-cloud-provider-gcp-master-e2e-gcp-ovn-upgrade
  • pull-ci-openshift-cloud-provider-gcp-master-fmt
  • pull-ci-openshift-cloud-provider-gcp-master-images
  • pull-ci-openshift-cloud-provider-gcp-master-verify-commits

In response to this:

/retest images

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@patrickdillon patrickdillon changed the title WIP: Allow bootstrap node in existing instance groups OCPBUGS-35256: Allow bootstrap node in existing instance groups Aug 5, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 5, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 5, 2024
@openshift-ci-robot
Copy link

@patrickdillon: This pull request references Jira Issue OCPBUGS-35256, which is invalid:

  • expected the bug to target the "4.17.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The logic to allow cloud provider gcp to utilize existing instance groups checks to ensure the instance group contains only cluster nodes. cloud provider gcp does not consider the bootstrap node to be a node, so if the bootstrap node is in the same ig as one of the masters, cloud provider gcp will try to add that master to another ig.

We can resolve the issue by relaxing the logic to ensure that all instances in the instance group have the node prefix.

Opening this to test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@patrickdillon
Copy link
Author

/jira refresh

@openshift-ci-robot
Copy link

@patrickdillon: This pull request references Jira Issue OCPBUGS-35256, which is invalid:

  • expected the bug to target the "4.17.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@patrickdillon
Copy link
Author

/jira refresh

@openshift-ci-robot
Copy link

@patrickdillon: This pull request references Jira Issue OCPBUGS-35256, which is invalid:

  • expected the bug to target the "4.17.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@patrickdillon
Copy link
Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 5, 2024
@openshift-ci-robot
Copy link

@patrickdillon: This pull request references Jira Issue OCPBUGS-35256, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.0) matches configured target version for branch (4.17.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jianli-wei

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from jianli-wei August 5, 2024 13:39
@patrickdillon
Copy link
Author

While testing my changes to this PR, it was necessary to run go mod vendor to pull my changes into the vendor directory, apparently due to:

k8s.io/cloud-provider-gcp/providers => ./providers

@patrickdillon
Copy link
Author

As this is building, minimally, on top of another carry patch, please LMK if there's anything I can do commit-wise to make this easier for future carries/rebases.

@@ -166,7 +166,7 @@ func (g *Cloud) EnsureLoadBalancer(ctx context.Context, clusterName string, svc
return nil, err
}

klog.V(4).Infof("EnsureLoadBalancer(%v, %v, %v, %v, %v): ensure %v loadbalancer", clusterName, svc.Namespace, svc.Name, loadBalancerName, g.region, desiredScheme)
klog.V(4).Infof("debug EnsureLoadBalancer(%v, %v, %v, %v, %v): ensure %v loadbalancer", clusterName, svc.Namespace, svc.Name, loadBalancerName, g.region, desiredScheme)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should either remove the debug prefix in all the logging calls or change the log level.

I suspect this is left over from debugging work, though :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right, but I forgot to revendor after cleaning up my work. Thanks for catching this. Should be fixed now!

@nrb
Copy link

nrb commented Aug 5, 2024

the installer places the bootstrap node in the same instance group as one of the masters, due to limitations in CAPG

Can you refresh my memory - how many instance groups are there total for the masters?

@bfournie
Copy link

bfournie commented Aug 5, 2024

the installer places the bootstrap node in the same instance group as one of the masters, due to limitations in CAPG

Can you refresh my memory - how many instance groups are there total for the masters?

One for each master, so 3 instance groups. The bootstrap VM is included in the first instance group. See https://issues.redhat.com/browse/OCPBUGS-35256?focusedId=24931575&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-24931575

@patrickdillon
Copy link
Author

the installer places the bootstrap node in the same instance group as one of the masters, due to limitations in CAPG

Can you refresh my memory - how many instance groups are there total for the masters?

This code update is only relevant for private installs, in which case the control plane nodes only belong to a single instance group. Here is a screenshot from a terraform install (note the bootstrap ig), showing the VMs and their instance groups:

image

Here is a screenshot for capg-based, where bootstrap shares an IG with a master:

image

@nrb
Copy link

nrb commented Aug 5, 2024

Ah, ok, this has to do with the instance groups being confined to a region/zone, and the bootstrap node will have to be in one of them.

I'm ok with carrying this change to our carried logic.

I see there was discussion about using kubernetes-sigs/cluster-api-provider-gcp#1266 instead. Usually I'd prefer to use upstream, but IMO expanding the CAPG API to accommodate for this specific use case might be too broad of a change. I'm not a CAPG reviewer or maintainer, though.

@@ -668,7 +668,8 @@ func (g *Cloud) ensureInternalInstanceGroups(name string, nodes []*v1.Node) ([]s
parts := strings.Split(ins.Instance, "/")
groupInstances.Insert(parts[len(parts)-1])
}
if names.HasAll(groupInstances.UnsortedList()...) {
groupInstanceNames := groupInstances.UnsortedList()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the code as is, but since this is looking like it will be a long term carry, a comment as to why we have these conditions will help reduce issue archeology in the future.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation was included in the commit message. I just brushed up again.

Do you think that works or we also need an in-code comment?

This updates the logic from ac00767 for using existing instance groups
to allow the bootstrap node to be present in the instance group..
Prior to this commit logic to utilize existing instance groups checked
to ensure the instance group contains only cluster nodes (which the
bootstrap node is not), so if the bootstrap node is in the same ig as
ne of the masters, the master would be added to a second ig, which
breaks things.

We can resolve the issue by relaxing the logic to ensure that
all instances in the instance group have the node prefix.

Vendors the changes in order to be utilized by the cloud provider.

Fixes OCPBUGS-35256
@bfournie
Copy link

bfournie commented Aug 5, 2024

I tested this PR with an internal cluster and it seems to work, I am no longer see the wrongSubnetwork error in cloud-controller-manager that I was seeing in
https://issues.redhat.com/browse/OCPBUGS-35256

I think its preferable to use this PR as opposed to adding knowledge of a bootstrap VM to kubernetes-sigs/cluster-api-provider-gcp#1266

@bfournie
Copy link

bfournie commented Aug 5, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 5, 2024
Copy link

openshift-ci bot commented Aug 5, 2024

@patrickdillon: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@nrb
Copy link

nrb commented Aug 6, 2024

/approve

Copy link

openshift-ci bot commented Aug 6, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nrb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 6, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 8ce997d into openshift:master Aug 6, 2024
6 checks passed
@openshift-ci-robot
Copy link

@patrickdillon: Jira Issue OCPBUGS-35256: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-35256 has been moved to the MODIFIED state.

In response to this:

With the move to CAPG-based installs, the installer places the bootstrap node in the same instance group as one of the masters, due to limitations in CAPG. (In Terraform installs the bootstrap node has its own instance group.) Having the bootstrap node in the same instance group as a master exposes a bug in the logic of a patch we carry. As the commit message states:

This change finds pre-existing instance-groups that ONLY contain
instances that belong to the cluster, uses them for the backend
service.

The logic for that check is implemented here:

if names.HasAll(groupInstances.UnsortedList()...) {
igLinks = append(igLinks, ig.SelfLink)
skip.Insert(groupInstances.UnsortedList()...)
}

The issue is due to the fact that the check only uses "node" names, and cloud-provider-gcp does not consider the bootstrap VM a node. We can resolve the issue by relaxing the logic to ensure that all instances in the instance group have the node prefix.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

Distgit: ose-gcp-cloud-controller-manager
This PR has been included in build ose-gcp-cloud-controller-manager-container-v4.17.0-202408061615.p0.g8ce997d.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants