-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
switch trusted/remote cluster management to atomic write #48009
Conversation
b0603a5
to
dd4ce9e
Compare
This pull request is automatically being deployed by Amplify Hosting (learn more). |
1051970
to
3bfc519
Compare
3bfc519
to
9b0530b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just spotted one typo. On the whole this new logic seems much easier to follow!
if active { | ||
if currentlyActive { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An explicit truth table like switch might make this a bit clearer and reduce some of the indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attempted. The reduced indentation was nice, but I found it overall a bit less readable, so leaving as-is for now.
return condacts, nil | ||
} | ||
|
||
func updateCertAuthoritiesCondActs(cas []types.CertAuthority, active bool, currentlyActive bool) ([]backend.ConditionalAction, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all of the possible conditional actions taken here covered in tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a separate dedicated test just for covering each case with and without a conflict.
9b0530b
to
65f7014
Compare
65f7014
to
f58c589
Compare
@fspmarshall See the table below for backport results.
|
if existingCluster == nil { | ||
return a.createTrustedCluster(ctx, tc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concurrent calls to UpsertTrustedCluster
can result in an AlreadyExists
error, which at the very least feels at least a bit wrong.
Are we already retrying every operation in the terraform provider no matter what the error is? Will we remember to do so for UpsertTrustedCluster
if and when that comes to be automatable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I tried adding retry to this logic, but ended up discarding it. Once you start to try to distinguish between AlreadyExists
due to concurrent tc creation vs AlreadyExists
due to conflicting CAs and handling CompareFailed
from failed updates, it all starts to get a bit messy. My (potentially weak) justification for leaving as-is is that anything currently updating these in a racy fashion is already broken because it is likely to cause conflicting state (e.g. role map on CA disagreeing with role map in TC). Unexpected error seems like a strictly better outcome. I can take another stab at if you feel strongly about it, but right now I'm leaning toward just accepting that this method might yield unexpected errors during concurrent use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you find yourself around the code maybe we should replace the AlreadyExists with a CompareFailed here in UpsertTrustedCluster? Otherwise it's not even worth the mental effort in opening a PR, I agree.
var err error | ||
existingCluster, err = a.GetTrustedCluster(ctx, trustedCluster.GetName()) | ||
existingCluster, err = a.GetTrustedCluster(ctx, tc.GetName()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not wrong in the current implementation of (*Server).GetTrustedCluster
but the typical error contract is that the return value must not be inspected in any way if the err is not nil - GetTrustedCluster
might be accidentally updated to return a concrete type, which would make existingCluster
always non-nil (but invalid) even in case of an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #48176
This PR switches creation/update/delete of trusted and remote clusters (and their associated resources) to use the
AtomicWrite
API. Previously, the various checks and writes associated with these operations were performed sequentially, leading to various possible inconsistent states in the event of crashes, errors, or concurrent writes. With these changes, all relevant trust resources are now updated as a single atomic operation.