Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry should be enabled by default #193

Open
bobh66 opened this issue May 25, 2023 · 3 comments
Open

Retry should be enabled by default #193

bobh66 opened this issue May 25, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@bobh66
Copy link
Contributor

bobh66 commented May 25, 2023

What happened?

One of Crossplane's best features is that it just keeps retrying when things fail, ever optimistic that something will change in the future to allow things to succeed "next time".

provider-helm is an exception to this - by default it does NOT retry, and there is no way to make it retry forever. rollbackLimit defaults to nil, which disables retry. And there is no way to set rollbackLimit to "forever", such as using 0 or -1. The only option is to set it to a "large" number which will eventually be exceeded.

Ideally the Release would behave like all other Managed Resources and continuously retry until it is successful.

It seems like to be consistent with other providers and Crossplane itself, rollbackLimit should default to a value that specifies infinite retries, or nil should indicate infinite retries and 0 should disable retries.

rollbackLimit is also not an obvious attribute name for an option that controls retries. Maybe a better solution would be to add a retryEnabled attribute that turns retries on and off, and leave rollbackLimit to specify the number of retries, where nil is no limit.

I'd be glad to push a PR if any of these suggestions are agreeable.

How can we reproduce it?

Deploy a Release

@bobh66 bobh66 added the bug Something isn't working label May 25, 2023
@turkenh
Copy link
Collaborator

turkenh commented May 26, 2023

FWIW, provider-helm indeed retries for other errors than a failing release. For example, it would retry forever if it hits an error while pulling the chart or communicating with the Kube API.

I would be supportive of implementing a negative value meaning retry forever, but hesitant to change the default behavior for everyone.

Also related: https://fluxcd.io/flux/components/helm/helmreleases/#configuring-failure-remediation

@bobh66
Copy link
Contributor Author

bobh66 commented Oct 4, 2024

I encountered this problem again today, so I'd really like to find a way to make it retry-by-default. How about a CLI parameter to the provider like --always-retry that makes an unset rollbackLimit equivalent to an infinite retry? I agree it's not ideal to change the behavior of an attribute via an optional parameter, but I think the default behavior should match all other providers and what kubernetes does by default - keep trying.

@kingdonb
Copy link

I think there's a mismatch of expectations, Flux has HelmRelease and it is also not retry-by-default for releases that failed. You can run flux reconcile helmrelease my-app --force to trigger a forced upgrade since GA, but Helm is actually quite expensive to run in a reconcile loop, and there would not be any more reliable way to topple a cluster over than by running a bunch of helm install / helm upgrade retries in perpetuity.

I found this issue reaching for a way to reconcile release --force with the Crossplane CLI, I still haven't found one, but I did find that helm uninstall triggers a retry within 10 minutes. It would be good to have a more direct way that doesn't disrupt the release, in my case it was a release that failed, and the services were offline. So nothing was disrupted by uninstall.

But I think there could be other cases where you want the release to retry when it's in a failed state, without uninstalling. I think Helm has a way to retry a failed upgrade, but a failed install needs to be uninstalled before it can try again.

Even knowing all that I know I was still surprised to see Crossplane didn't retry on failure, that's definitely what GitOps users expect, but following the flux model it should probably not have indefinite retries because again, Helm is expensive, and retrying forever could easily have swamped the node that helm-controller is running on. Or in case of many failing helm releases, even deadlock the control loop that only has so many goroutines to manage concurrent Helm releases before they start to queue up behind one another.

(I don't know if that's exactly how the Crossplane provider helm works, I assume it works how Flux's Helm Controller does, but with the Helm CLI instead of the Helm SDK underneath.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants