Retry should be enabled by default #193

bobh66 · 2023-05-25T22:07:37Z

What happened?

One of Crossplane's best features is that it just keeps retrying when things fail, ever optimistic that something will change in the future to allow things to succeed "next time".

provider-helm is an exception to this - by default it does NOT retry, and there is no way to make it retry forever. rollbackLimit defaults to nil, which disables retry. And there is no way to set rollbackLimit to "forever", such as using 0 or -1. The only option is to set it to a "large" number which will eventually be exceeded.

Ideally the Release would behave like all other Managed Resources and continuously retry until it is successful.

It seems like to be consistent with other providers and Crossplane itself, rollbackLimit should default to a value that specifies infinite retries, or nil should indicate infinite retries and 0 should disable retries.

rollbackLimit is also not an obvious attribute name for an option that controls retries. Maybe a better solution would be to add a retryEnabled attribute that turns retries on and off, and leave rollbackLimit to specify the number of retries, where nil is no limit.

I'd be glad to push a PR if any of these suggestions are agreeable.

How can we reproduce it?

Deploy a Release

The text was updated successfully, but these errors were encountered:

turkenh · 2023-05-26T09:58:21Z

FWIW, provider-helm indeed retries for other errors than a failing release. For example, it would retry forever if it hits an error while pulling the chart or communicating with the Kube API.

I would be supportive of implementing a negative value meaning retry forever, but hesitant to change the default behavior for everyone.

Also related: https://fluxcd.io/flux/components/helm/helmreleases/#configuring-failure-remediation

bobh66 · 2024-10-04T21:26:41Z

I encountered this problem again today, so I'd really like to find a way to make it retry-by-default. How about a CLI parameter to the provider like --always-retry that makes an unset rollbackLimit equivalent to an infinite retry? I agree it's not ideal to change the behavior of an attribute via an optional parameter, but I think the default behavior should match all other providers and what kubernetes does by default - keep trying.

kingdonb · 2025-01-11T15:34:56Z

I think there's a mismatch of expectations, Flux has HelmRelease and it is also not retry-by-default for releases that failed. You can run flux reconcile helmrelease my-app --force to trigger a forced upgrade since GA, but Helm is actually quite expensive to run in a reconcile loop, and there would not be any more reliable way to topple a cluster over than by running a bunch of helm install / helm upgrade retries in perpetuity.

I found this issue reaching for a way to reconcile release --force with the Crossplane CLI, I still haven't found one, but I did find that helm uninstall triggers a retry within 10 minutes. It would be good to have a more direct way that doesn't disrupt the release, in my case it was a release that failed, and the services were offline. So nothing was disrupted by uninstall.

But I think there could be other cases where you want the release to retry when it's in a failed state, without uninstalling. I think Helm has a way to retry a failed upgrade, but a failed install needs to be uninstalled before it can try again.

Even knowing all that I know I was still surprised to see Crossplane didn't retry on failure, that's definitely what GitOps users expect, but following the flux model it should probably not have indefinite retries because again, Helm is expensive, and retrying forever could easily have swamped the node that helm-controller is running on. Or in case of many failing helm releases, even deadlock the control loop that only has so many goroutines to manage concurrent Helm releases before they start to queue up behind one another.

(I don't know if that's exactly how the Crossplane provider helm works, I assume it works how Flux's Helm Controller does, but with the Helm CLI instead of the Helm SDK underneath.)

bobh66 added the bug Something isn't working label May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry should be enabled by default #193

Retry should be enabled by default #193

bobh66 commented May 25, 2023

turkenh commented May 26, 2023

bobh66 commented Oct 4, 2024

kingdonb commented Jan 11, 2025

Retry should be enabled by default #193

Retry should be enabled by default #193

Comments

bobh66 commented May 25, 2023

What happened?

How can we reproduce it?

turkenh commented May 26, 2023

bobh66 commented Oct 4, 2024

kingdonb commented Jan 11, 2025