Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Do not update KCP and MS status when unable to get workload cluster #10229
🐛 Do not update KCP and MS status when unable to get workload cluster #10229
Changes from all commits
0561c98
132bdbc
dbb056b
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add an additional call to updateStatus here:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Will add the case after the discussion is resolved: #10229 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here a proposed solution to the ErrClusterLocked issue: (that Fabrizio was referring to)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that when the workload cluster is reachable, we only get ErrClusterLocked for a very short amount of time (the time it takes to create a client). For this case it is good enough to simply retry creating the client.
We will fallback to the previous behavior only if the workload cluster is actually not reachable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!I'm not quite sure if it's worthy to add this bulk of code to resolve the temporary inconsistency of some status replicas and conditions fields caused by ErrClusterLocked error. The simple change in current PR can solve this problem while there are inconsistency but acceptable and won't cause issues per my thinking (maybe I missed some cases).
cc @fabriziopandini for awareness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sbueringer and I are in total agreement, and we are trying to help in re-focusing the PR to the original issue. We also took some additional steps to help in finding a way forward by proposing an alternative solution.
With regards to the current PR, I already tried to explain my concern and I will be happy to add more if you have any doubts (and I already answered one).
But given the concern above, I'm personally -1 to merge the PR in the current state.
Instead, we both think the change proposed by @sbueringer solves the original issues, but ultimately it is up to you to accept it or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fabriziopandini for the details. Here are my two cents:
return err
instead ofcontinue
since canCommunicateWithWorkloadCluster won't change in the for loop.newStatus.Replicas = int32(len(filteredMachines))
.However KCP controller is different and already updates some status fields before hitting ErrClusterLocked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing: not sure if
canCommunicateWithWorkloadCluster == true
can 100% guarentee getMachineNode() won't hit ErrClusterLocked since they are two serial func calls but not atom calls ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our main concern is basically that if we return an error here as soon as we hit ErrClusterLocked we don't update the status at all anymore. This should be okay in the happy path, which is that we can actually communicate with the workload cluster and it's just a matter of waiting until another goroutine successfully created the client so we can use it. The problem we see is if the workload cluster is actually not reachable. Because in that case we will just continuously return the error forever. In this case we "freeze" the values of status: FullyLabeledReplicas, ReadyReplicas, AvailableReplicas, ObservedGeneration and a few conditions.
The concern Fabrizio raised (and I didn't have on my radar before) is that if we freeze the status indefinitely in these cases (or rather until the workload cluster is reachable) this can be pretty confusing for users. So for this case we should actually have a more extensive solution which also covers signalling to users that we can't communicate with the workload cluster anymore.
What we were trying to suggest was a mitigation for the issue for the happy path, where we actually can talk to the workload cluster, but replicas are flipping only because of the way we create the client initially for a cluster.
I'm not really sure about this one. Basically we introduced the current "TryLock" over a "Lock" to make sure we don't block all reconcile for a cluster when a workload cluster is not reachable. The "Lock" we had before lead to really problematic performance issues when some workload clusters (with a lot of Machines) were unreachable.
So I think we should be careful with introducing these retries as they can lead to serious performance degradation (basically by just introducing this here, every reconcile of a MachineSet of an unreachable cluster will take 1 second).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the issue
But if I'm not wrong with the current implementation we are going to freeze the replica count indefinetly, which could be confusing or eventually also wrong, given that the number of replicas/machine might change in the meantime.
Frankly speaking, in order to properly fix this issue I think that we first need to figure out how to get/store the "last seen" information for each node.
Given that info, we can decide if it is ok to flip the replica status or if to wait (but I don't have a concrete idea on how to do so, I need some time to research into it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important point that I discussed with Fabrizio. We're not only freezing the replica fields, we would also freeze other status fields like ObservedGeneration and conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why are we ignoring not found.
Is that for the case when someone manually deletes a node? If yes, please add a comment (but it also seems unrelated to the issue we are trying to fix)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for clarification.
The idea behind this one was to signal to the calling function that the node doesn't exist (vs. that it's just an error). Because not finding the Node is also a valuable information (which basically comes down to that the node is not ready)
This was added to preserve the previous behavior one level above (where we only logged the error before but still considered a not found node a node that is not ready)
But yeah it becomes pretty had to understand