Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HA sync issue #148

Open
kfox1111 opened this issue Dec 13, 2023 · 4 comments
Open

HA sync issue #148

kfox1111 opened this issue Dec 13, 2023 · 4 comments

Comments

@kfox1111
Copy link
Collaborator

kfox1111 commented Dec 13, 2023

There was a conversation on slack about multiple instances of the spire server making their own CA's when in HA mode and usually waiting a certain amount of time for the new instances CA's to sync to the agents before adding it to the LoadBalancer. We currently do not do this. We either need to make the server's initialDelaySeconds a larger number, like 60+ seconds, or make a dynamic readiness probe that waited only on new instances. If not done, agents may get valid certs that other agents don't trust for a while.

@kfox1111
Copy link
Collaborator Author

Probably need a sidecar with just busybox in it. it does nothing but sleep. then a readiness hook in it that checks for the age of files in the pvc. if less then a few minutes old, unready for a minute.

@kfox1111
Copy link
Collaborator Author

kfox1111 commented Dec 13, 2023

Can skip the sidecar if there is an upstreamAuthority specified if sqlite backend (replicas must always be 1)

@edwbuck
Copy link
Collaborator

edwbuck commented Jun 18, 2024

I'm not sure we need to fix this in the charts, because this is how it is designed to work without Helm charts.

Yes, there is a period of time where certs are not in sync, because it is a distributed system. During that period of time, agents typically get valid certs, because both CAs are valid (their validity overlaps). With that in mind, there are also scenarios where a CA is purposefully expired, but that involves revocation lists (or waiting for the TTL of the cert to expire.

In both cases, how the CA is sourced and used is a function of the UpstreamAuthority plugin, and SPIRE is designed to not source and sync these synchronously across all Server instances.

Is the proposal to fix this asynchronous behavior by adding additional logic in SPIRE to make the asynchronous behavior synchronous?

@kfox1111
Copy link
Collaborator Author

Its a problem when not using an UpstreamAuthority plugin. The non containerized spire has the same issue, but hit it less due to not as good orchestration layers in non kubernetes environments. Humans take long enough to set things up its less likely to hit it. K8s has enough automation it does hit it, and some folks will want to autoscale their spire servers and would definitely hit this issue in that situation I believe.

The solution would be to make sure the pod doesn't go ready until the ca is added to the bundle in k8s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants