-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast-fail make e2e/single-az
if cluster has nodes in multiple AZs
#2289
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Code Coverage DiffThis PR does not change the code coverage |
b47bb6b
to
76cea0b
Compare
/retest Windows e2e failures likely was a flake (pod timeout). |
make e2e/single-az
if cluster has nodes in multiple AZsmake e2e/single-az
if cluster has nodes in multiple AZs
/hold Assuming I'll need one more manual test after all comments are resolved. |
if [[ $(kubectl get nodes \ | ||
--selector '!node-role.kubernetes.io/control-plane' \ | ||
-o jsonpath='{.items[*].metadata.labels.topology\.kubernetes\.io/zone}' | | ||
sort | uniq | wc -l) -ne 1 ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do greater than 1 (rather than not equal), so if the nodes don't have the topology label for whatever reason this doesn't break that case?
@@ -133,6 +133,17 @@ else | |||
else | |||
set -x | |||
set +e | |||
# Fail early if running single-az tests AND current cluster has nodes in multiple AZs. TODO see if we can add node selector for single-az tests. | |||
if [[ "$GINKGO_FOCUS" =~ "single-az" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too late to do the check - the driver has already been installed and this will leave the cluster in a broken state. It should be moved way earlier in the file to fail ASAP, probably around line 49.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I tested that make e2e/single-az
still works on retry despite the early exit, but you're right we can just check early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
certain tests will mysteriously fail (because they require there only being nodes in one AZ)
Is it every single test under single-az
, or a select few? providing an example or logs to review an instance of this failure would be helpful.
loudecho "ERROR. Attempting to run '[single-az]' tests on cluster with Nodes in multiple availability-zones. Try again once cluster only has Nodes in one AZ" | ||
exit 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on logging this warning but not actually returning a failure when a multi-az cluster is detected? Is there ever a scenario where a maintainer might want to run the single-az tests in a multi-az cluster anyway? (suppose someone has multi-az cluster up already, and would prefer to cordon a few nodes instead of creating a new cluster).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it every single test under single-az, or a select few?
A select few. We have a backlog spike/task for resolving the root problem. This is a short-term solution so new contributors stop getting stuck.
Thoughts on logging this warning but not actually returning a failure when a multi-az cluster is detected?
Why not fast fail if we know the target will fail? It's easy to miss a warning.
Is there ever a scenario where a maintainer might want to run the single-az tests in a multi-az cluster anyway?
How does maintainer know that failing single-az tests are not their fault if they run them anyway?
Also, maintainer can run tests without the make e2e/single-az
target.
I thought about ignoring cordon nodes for the check but a colleague convinced me that this might be a misleading workaround. (Edited) Maintainer can just delete nodes in extra AZs if they really don't want a new cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
providing an example or logs to review an instance of this failure would be helpful.
Yes there will be full logs once all other comments are addressed.
Until then I recommend you checking out this PR locally and running example commands I provided with and without the changes to get a visceral feeling of the pain 🥲
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was talking to @ElijahQuinones, he said cordoning nodes is sufficient to pass tests, so I'll filter out cordoned nodes from check!
He also had an idea of having a yes/no
prompt here to manual override but I prefer the cordoning nodes because I'm not sure we want a prompt within a makefile target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would probably be easier to instead of filtering out cordoned nodes you could just ask the do you want to continue if you ever detect a multi az cluster this way if someone knowingly is running these tests with cordoned nodes they can continue and if they are not they can exit. Another issue with just the warning is if you show the warning and then I want to stop the tests with like a cntrl c after reading the message you can leak volumes and pods from the tests as they will never get cleaned up.
make e2e/single-az
if cluster has nodes in multiple AZsmake e2e/single-az
if cluster has nodes in multiple AZs
Looks like I'll probably change the check such that AZ of nodes must also match what is in $AWS_AVAILABILITY_ZONES (and add that to the warning) |
What type of PR is this?
/kind cleanup
What is this PR about? / Why do we need it?
Today if you try to run
make e2e/single-az
on a cluster with nodes in multiple AZs, certain tests will mysteriously fail (because they require there only being nodes in one AZ). This is a painful gotcha for new contributors or those that try to run all of our e2e tests.For the short-term, this PR fast-fails
make e2e/single-az
if your cluster has worker nodes in multiple AZs.In the future, we should see if we can do something in the
[single-az]
e2e tests themselves to have them pass on a multi-az cluster. (Maybe pick an AZ at start of test, and add a node-selector for that AZ for all pods?)How was this change tested?
Have multi-az cluster. See failure. Delete or cordon nodes in other AZs. Tests pass.
Does this PR introduce a user-facing change?