Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast-fail make e2e/single-az if cluster has nodes in multiple AZs #2289

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AndrewSirenko
Copy link
Contributor

@AndrewSirenko AndrewSirenko commented Jan 15, 2025

What type of PR is this?

/kind cleanup

What is this PR about? / Why do we need it?

Today if you try to run make e2e/single-az on a cluster with nodes in multiple AZs, certain tests will mysteriously fail (because they require there only being nodes in one AZ). This is a painful gotcha for new contributors or those that try to run all of our e2e tests.

For the short-term, this PR fast-fails make e2e/single-az if your cluster has worker nodes in multiple AZs.

In the future, we should see if we can do something in the [single-az] e2e tests themselves to have them pass on a multi-az cluster. (Maybe pick an AZ at start of test, and add a node-selector for that AZ for all pods?)

How was this change tested?

Have multi-az cluster. See failure. Delete or cordon nodes in other AZs. Tests pass.

❯ make cluster/create
...
❯ make e2e/single-az
AWS_AVAILABILITY_ZONES=us-west-2a \
TEST_PATH=./tests/e2e/... \
GINKGO_FOCUS="\[ebs-csi-e2e\] \[single-az\]" \
GINKGO_PARALLEL=5 \
HELM_EXTRA_FLAGS="--set=controller.volumeModificationFeature.enabled=true,sidecars.provisioner.additionalArgs[0]='--feature-gates=VolumeAttributesClass=true',sidecars.resizer.additionalArgs[0]='--feature-gates=VolumeAttributesClass=true',node.enableMetrics=true" \
./hack/e2e/run.sh
###
## ERROR. single-az tests require all worker nodes to be in a single availability zone (AZ) that matches env var $AWS_AVAILABILITY_ZONES (Currently set as "us-west-2a"). Consider cordoning nodes in other AZs.
#
make: *** [e2e/single-az] Error 1


❯ k cordon i-09ab24e3d3e662057
node/i-09ab24e3d3e662057 cordoned
❯ k cordon i-0093186eb892244a7
node/i-0093186eb892244a7 cordoned

❯ make e2e/single-az
AWS_AVAILABILITY_ZONES=us-west-2a \
TEST_PATH=./tests/e2e/... \
GINKGO_FOCUS="\[ebs-csi-e2e\] \[single-az\]" \
GINKGO_PARALLEL=5 \
HELM_EXTRA_FLAGS="--set=controller.volumeModificationFeature.enabled=true,sidecars.provisioner.additionalArgs[0]='--feature-gates=VolumeAttributesClass=true',sidecars.resizer.additionalArgs[0]='--feature-gates=VolumeAttributesClass=true',node.enableMetrics=true" \
./hack/e2e/run.sh
###
## Applying snapshot controller and CRDs
#
...

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Jan 15, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andrewsirenko. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 15, 2025
Copy link

Code Coverage Diff

This PR does not change the code coverage

@AndrewSirenko
Copy link
Contributor Author

/retest

Windows e2e failures likely was a flake (pod timeout).

@AndrewSirenko AndrewSirenko changed the title wip: Improve devloop by fast-failing make e2e/single-az if cluster has nodes in multiple AZs Improve devloop by fast-failing make e2e/single-az if cluster has nodes in multiple AZs Jan 15, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 15, 2025
@AndrewSirenko
Copy link
Contributor Author

/hold

Assuming I'll need one more manual test after all comments are resolved.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 15, 2025
if [[ $(kubectl get nodes \
--selector '!node-role.kubernetes.io/control-plane' \
-o jsonpath='{.items[*].metadata.labels.topology\.kubernetes\.io/zone}' |
sort | uniq | wc -l) -ne 1 ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do greater than 1 (rather than not equal), so if the nodes don't have the topology label for whatever reason this doesn't break that case?

@@ -133,6 +133,17 @@ else
else
set -x
set +e
# Fail early if running single-az tests AND current cluster has nodes in multiple AZs. TODO see if we can add node selector for single-az tests.
if [[ "$GINKGO_FOCUS" =~ "single-az" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too late to do the check - the driver has already been installed and this will leave the cluster in a broken state. It should be moved way earlier in the file to fail ASAP, probably around line 49.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I tested that make e2e/single-az still works on retry despite the early exit, but you're right we can just check early.

Copy link
Member

@torredil torredil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

certain tests will mysteriously fail (because they require there only being nodes in one AZ)

Is it every single test under single-az, or a select few? providing an example or logs to review an instance of this failure would be helpful.

Comment on lines +142 to +143
loudecho "ERROR. Attempting to run '[single-az]' tests on cluster with Nodes in multiple availability-zones. Try again once cluster only has Nodes in one AZ"
exit 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on logging this warning but not actually returning a failure when a multi-az cluster is detected? Is there ever a scenario where a maintainer might want to run the single-az tests in a multi-az cluster anyway? (suppose someone has multi-az cluster up already, and would prefer to cordon a few nodes instead of creating a new cluster).

Copy link
Contributor Author

@AndrewSirenko AndrewSirenko Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it every single test under single-az, or a select few?

A select few. We have a backlog spike/task for resolving the root problem. This is a short-term solution so new contributors stop getting stuck.

Thoughts on logging this warning but not actually returning a failure when a multi-az cluster is detected?

Why not fast fail if we know the target will fail? It's easy to miss a warning.

Is there ever a scenario where a maintainer might want to run the single-az tests in a multi-az cluster anyway?

How does maintainer know that failing single-az tests are not their fault if they run them anyway?

Also, maintainer can run tests without the make e2e/single-az target.

I thought about ignoring cordon nodes for the check but a colleague convinced me that this might be a misleading workaround. (Edited) Maintainer can just delete nodes in extra AZs if they really don't want a new cluster.

Copy link
Contributor Author

@AndrewSirenko AndrewSirenko Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

providing an example or logs to review an instance of this failure would be helpful.

Yes there will be full logs once all other comments are addressed.

Until then I recommend you checking out this PR locally and running example commands I provided with and without the changes to get a visceral feeling of the pain 🥲

Copy link
Contributor Author

@AndrewSirenko AndrewSirenko Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was talking to @ElijahQuinones, he said cordoning nodes is sufficient to pass tests, so I'll filter out cordoned nodes from check!

He also had an idea of having a yes/no prompt here to manual override but I prefer the cordoning nodes because I'm not sure we want a prompt within a makefile target.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would probably be easier to instead of filtering out cordoned nodes you could just ask the do you want to continue if you ever detect a multi az cluster this way if someone knowingly is running these tests with cordoned nodes they can continue and if they are not they can exit. Another issue with just the warning is if you show the warning and then I want to stop the tests with like a cntrl c after reading the message you can leak volumes and pods from the tests as they will never get cleaned up.

@AndrewSirenko AndrewSirenko changed the title Improve devloop by fast-failing make e2e/single-az if cluster has nodes in multiple AZs Fast-fail make e2e/single-az if cluster has nodes in multiple AZs Jan 15, 2025
@AndrewSirenko
Copy link
Contributor Author

Looks like e2e/single-az makefile target sets AWS_AVAILABILITY_ZONES=us-west-2a.

I'll probably change the check such that AZ of nodes must also match what is in $AWS_AVAILABILITY_ZONES (and add that to the warning)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants