HA sync issue #148

kfox1111 · 2023-12-13T14:06:50Z

There was a conversation on slack about multiple instances of the spire server making their own CA's when in HA mode and usually waiting a certain amount of time for the new instances CA's to sync to the agents before adding it to the LoadBalancer. We currently do not do this. We either need to make the server's initialDelaySeconds a larger number, like 60+ seconds, or make a dynamic readiness probe that waited only on new instances. If not done, agents may get valid certs that other agents don't trust for a while.

kfox1111 · 2023-12-13T14:41:18Z

Probably need a sidecar with just busybox in it. it does nothing but sleep. then a readiness hook in it that checks for the age of files in the pvc. if less then a few minutes old, unready for a minute.

kfox1111 · 2023-12-13T14:41:38Z

Can skip the sidecar if there is an upstreamAuthority specified if sqlite backend (replicas must always be 1)

edwbuck · 2024-06-18T18:01:45Z

I'm not sure we need to fix this in the charts, because this is how it is designed to work without Helm charts.

Yes, there is a period of time where certs are not in sync, because it is a distributed system. During that period of time, agents typically get valid certs, because both CAs are valid (their validity overlaps). With that in mind, there are also scenarios where a CA is purposefully expired, but that involves revocation lists (or waiting for the TTL of the cert to expire.

In both cases, how the CA is sourced and used is a function of the UpstreamAuthority plugin, and SPIRE is designed to not source and sync these synchronously across all Server instances.

Is the proposal to fix this asynchronous behavior by adding additional logic in SPIRE to make the asynchronous behavior synchronous?

kfox1111 · 2024-06-18T19:34:12Z

Its a problem when not using an UpstreamAuthority plugin. The non containerized spire has the same issue, but hit it less due to not as good orchestration layers in non kubernetes environments. Humans take long enough to set things up its less likely to hit it. K8s has enough automation it does hit it, and some folks will want to autoscale their spire servers and would definitely hit this issue in that situation I believe.

The solution would be to make sure the pod doesn't go ready until the ca is added to the bundle in k8s.

edwbuck added the Proc: needs triage label Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HA sync issue #148

HA sync issue #148

kfox1111 commented Dec 13, 2023 •

edited

Loading

kfox1111 commented Dec 13, 2023

kfox1111 commented Dec 13, 2023 •

edited

Loading

edwbuck commented Jun 18, 2024

kfox1111 commented Jun 18, 2024

HA sync issue #148

HA sync issue #148

Comments

kfox1111 commented Dec 13, 2023 • edited Loading

kfox1111 commented Dec 13, 2023

kfox1111 commented Dec 13, 2023 • edited Loading

edwbuck commented Jun 18, 2024

kfox1111 commented Jun 18, 2024

kfox1111 commented Dec 13, 2023 •

edited

Loading

kfox1111 commented Dec 13, 2023 •

edited

Loading