Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch trusted/remote cluster management to atomic write #48009

Merged
merged 1 commit into from
Oct 29, 2024

Conversation

fspmarshall
Copy link
Contributor

This PR switches creation/update/delete of trusted and remote clusters (and their associated resources) to use the AtomicWrite API. Previously, the various checks and writes associated with these operations were performed sequentially, leading to various possible inconsistent states in the event of crashes, errors, or concurrent writes. With these changes, all relevant trust resources are now updated as a single atomic operation.

Copy link

This pull request is automatically being deployed by Amplify Hosting (learn more).

Access this pull request here: https://pr-48009.d3pp5qlev8mo18.amplifyapp.com

@fspmarshall fspmarshall force-pushed the fspmarshall/trust-atomics branch 4 times, most recently from 1051970 to 3bfc519 Compare October 28, 2024 17:40
lib/auth/trustedcluster.go Outdated Show resolved Hide resolved
lib/services/local/trust.go Outdated Show resolved Hide resolved
lib/services/local/trust.go Outdated Show resolved Hide resolved
lib/auth/trustedcluster.go Outdated Show resolved Hide resolved
lib/auth/trustedcluster.go Outdated Show resolved Hide resolved
lib/auth/trustedcluster.go Outdated Show resolved Hide resolved
@fspmarshall fspmarshall force-pushed the fspmarshall/trust-atomics branch from 3bfc519 to 9b0530b Compare October 28, 2024 22:26
Copy link
Contributor

@timothyb89 timothyb89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just spotted one typo. On the whole this new logic seems much easier to follow!

lib/services/local/trust.go Outdated Show resolved Hide resolved
Comment on lines +154 to +155
if active {
if currentlyActive {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An explicit truth table like switch might make this a bit clearer and reduce some of the indentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attempted. The reduced indentation was nice, but I found it overall a bit less readable, so leaving as-is for now.

return condacts, nil
}

func updateCertAuthoritiesCondActs(cas []types.CertAuthority, active bool, currentlyActive bool) ([]backend.ConditionalAction, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all of the possible conditional actions taken here covered in tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a separate dedicated test just for covering each case with and without a conflict.

@fspmarshall fspmarshall force-pushed the fspmarshall/trust-atomics branch from 65f7014 to f58c589 Compare October 29, 2024 17:09
@fspmarshall fspmarshall added this pull request to the merge queue Oct 29, 2024
Merged via the queue into master with commit a54c311 Oct 29, 2024
40 checks passed
@fspmarshall fspmarshall deleted the fspmarshall/trust-atomics branch October 29, 2024 17:53
@public-teleport-github-review-bot

@fspmarshall See the table below for backport results.

Branch Result
branch/v15 Failed
branch/v16 Failed
branch/v17 Create PR

Comment on lines +67 to +68
if existingCluster == nil {
return a.createTrustedCluster(ctx, tc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concurrent calls to UpsertTrustedCluster can result in an AlreadyExists error, which at the very least feels at least a bit wrong.

Are we already retrying every operation in the terraform provider no matter what the error is? Will we remember to do so for UpsertTrustedCluster if and when that comes to be automatable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I tried adding retry to this logic, but ended up discarding it. Once you start to try to distinguish between AlreadyExists due to concurrent tc creation vs AlreadyExists due to conflicting CAs and handling CompareFailed from failed updates, it all starts to get a bit messy. My (potentially weak) justification for leaving as-is is that anything currently updating these in a racy fashion is already broken because it is likely to cause conflicting state (e.g. role map on CA disagreeing with role map in TC). Unexpected error seems like a strictly better outcome. I can take another stab at if you feel strongly about it, but right now I'm leaning toward just accepting that this method might yield unexpected errors during concurrent use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you find yourself around the code maybe we should replace the AlreadyExists with a CompareFailed here in UpsertTrustedCluster? Otherwise it's not even worth the mental effort in opening a PR, I agree.

var err error
existingCluster, err = a.GetTrustedCluster(ctx, trustedCluster.GetName())
existingCluster, err = a.GetTrustedCluster(ctx, tc.GetName())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not wrong in the current implementation of (*Server).GetTrustedCluster but the typical error contract is that the return value must not be inspected in any way if the err is not nil - GetTrustedCluster might be accidentally updated to return a concrete type, which would make existingCluster always non-nil (but invalid) even in case of an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #48176

lib/services/local/trust.go Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants