Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm Operator: charts with Helm template function "lookup" always fail at sync release. #5728

Closed
mikeshng opened this issue May 4, 2022 · 12 comments · Fixed by #6691
Closed
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. needs discussion
Milestone

Comments

@mikeshng
Copy link
Contributor

mikeshng commented May 4, 2022

Bug Report

What did you do?

Using operator-sdk, I created a Helm operator for a Helm chart (for example) that uses the lookup Helm template function. The Helm operator is unable to sync the release and always fails with the error message:

{"level":"error","ts":1651678030.818603,"logger":"helm.controller","msg":"Failed to sync release","namespace":"default","name":"etcdbackup-sample","apiVersion":"charts.my.domain/v1alpha1","kind":"EtcdBackup","release":"etcdbackup-sample","error":"failed to get candidate release: template: etcd-backup/templates/deployment.yaml:34:48: executing \"etcd-backup/templates/deployment.yaml\" at <\"default\">: nil pointer evaluating interface {}.name","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}

This is likely due to the fact that Helm operator sync release performs a Helm upgrade dry-run. It seems like there is an ongoing Helm issue with dry-run and lookup. See https://github.com/helm/helm/issues/8137

In the meantime, is there a workaround that the Helm operator can do so that it can unblock charts with "lookup" functions?

What did you expect to see?

The Helm operator is able to sync the release.

What did you see instead? Under which circumstances?

I see the error above continuously.

Environment

Operator type:
/language helm

Kubernetes cluster type:
kind cluster

$ kind version
kind v0.11.1 go1.16.4 linux/amd64

$ operator-sdk version

operator-sdk version: "v1.20.0", commit: "deb3531ae20a5805b7ee30b71f13792b80bd49b1", kubernetes version: "1.23", go version: "go1.17.9", GOOS: "linux", GOARCH: "amd64"

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-18T16:12:00Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-21T23:01:33Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}

Possible Solution

Additional context

When I ran the operator-sdk create api command, I already see the warning message that is the same as the sync release failure message. Here is the full list of commands I ran to reproduce. The chart I used is here.

$ operator-sdk init --plugins helm --helm-chart etcd-backup
Writing kustomize manifests for you to edit...
Creating the API:
$ operator-sdk create api --helm-chart etcd-backup
Writing kustomize manifests for you to edit...
Created helm-charts/etcd-backup
Generating RBAC rules
WARN[0000] Using default RBAC rules: failed to generate RBAC rules: failed to get default manifest: failed to render chart templates: template: etcd-backup/templates/deployment.yaml:34:48: executing "etcd-backup/templates/deployment.yaml" at <"default">: nil pointer evaluating interface {}.name 
$ make install
$ kubectl apply -f config/samples/charts_v1alpha1_etcdbackup.yaml
$ make run
/usr/local/bin/helm-operator run
{"level":"info","ts":1651678028.5615184,"logger":"cmd","msg":"Version","Go Version":"go1.17.9","GOOS":"linux","GOARCH":"amd64","helm-operator":"v1.20.0","commit":"deb3531ae20a5805b7ee30b71f13792b80bd49b1"}
{"level":"info","ts":1651678028.5725975,"logger":"cmd","msg":"Watch namespaces not configured by environment variable WATCH_NAMESPACE or file. Watching all namespaces.","Namespace":""}
{"level":"info","ts":1651678029.1749399,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1651678029.1755068,"logger":"helm.controller","msg":"Watching resource","apiVersion":"charts.my.domain/v1alpha1","kind":"EtcdBackup","namespace":"","reconcilePeriod":"1m0s"}
{"level":"info","ts":1651678029.175779,"msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"info","ts":1651678029.1758053,"msg":"Starting server","kind":"health probe","addr":"[::]:8081"}
{"level":"info","ts":1651678029.1759253,"logger":"controller.etcdbackup-controller","msg":"Starting EventSource","source":"kind source: *unstructured.Unstructured"}
{"level":"info","ts":1651678029.175962,"logger":"controller.etcdbackup-controller","msg":"Starting Controller"}
{"level":"info","ts":1651678029.2769535,"logger":"controller.etcdbackup-controller","msg":"Starting workers","worker count":8}
{"level":"info","ts":1651678030.1477642,"logger":"controller.etcdbackup-controller","msg":"Starting EventSource","source":"kind source: *unstructured.Unstructured"}
{"level":"info","ts":1651678030.1478019,"logger":"helm.controller","msg":"Watching dependent resource","ownerApiVersion":"charts.my.domain/v1alpha1","ownerKind":"EtcdBackup","apiVersion":"apps/v1","kind":"Deployment"}
{"level":"info","ts":1651678030.1478174,"logger":"helm.controller","msg":"Installed release","namespace":"default","name":"etcdbackup-sample","apiVersion":"charts.my.domain/v1alpha1","kind":"EtcdBackup","release":"etcdbackup-sample"}
{"level":"error","ts":1651678030.818603,"logger":"helm.controller","msg":"Failed to sync release","namespace":"default","name":"etcdbackup-sample","apiVersion":"charts.my.domain/v1alpha1","kind":"EtcdBackup","release":"etcdbackup-sample","error":"failed to get candidate release: template: etcd-backup/templates/deployment.yaml:34:48: executing \"etcd-backup/templates/deployment.yaml\" at <\"default\">: nil pointer evaluating interface {}.name","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":1651678031.1916735,"logger":"helm.controller","msg":"Failed to update status after sync release failure","namespace":"default","name":"etcdbackup-sample","apiVersion":"charts.my.domain/v1alpha1","kind":"EtcdBackup","release":"etcdbackup-sample","error":"Operation cannot be fulfilled on etcdbackups.charts.my.domain \"etcdbackup-sample\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":1651678031.1918292,"logger":"controller.etcdbackup-controller","msg":"Reconciler error","name":"etcdbackup-sample","namespace":"default","error":"failed to get candidate release: template: etcd-backup/templates/deployment.yaml:34:48: executing \"etcd-backup/templates/deployment.yaml\" at <\"default\">: nil pointer evaluating interface {}.name","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":1651678031.879193,"logger":"helm.controller","msg":"Failed to sync release","namespace":"default","name":"etcdbackup-sample","apiVersion":"charts.my.domain/v1alpha1","kind":"EtcdBackup","release":"etcdbackup-sample","error":"failed to get candidate release: template: etcd-backup/templates/deployment.yaml:34:48: executing \"etcd-backup/templates/deployment.yaml\" at <\"default\">: nil pointer evaluating interface {}.name","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
...
@everettraven
Copy link
Contributor

Interesting. I will start looking into this and see if there is something reasonable that we can implement as a workaround.

Thanks for bringing this issue to our attention!

/assign

@everettraven
Copy link
Contributor

Hi @mikeshng ,

Here is what I have found when digging into this issue a bit more:

Reading the error it seems that this is where the error actually occurs:

candidateRelease, err := m.getCandidateRelease(m.namespace, m.releaseName, m.chart, m.values)
if err != nil {
return fmt.Errorf("failed to get candidate release: %w", err)
}

Diving into the getCandidateRelease function we see:

func (m manager) getCandidateRelease(namespace, name string, chart *cpb.Chart,
values map[string]interface{}) (*rpb.Release, error) {
upgrade := action.NewUpgrade(m.actionConfig)
upgrade.Namespace = namespace
upgrade.DryRun = true
return upgrade.Run(name, chart, values)
}

In this getCandidateRelease function we are pretty tightly coupled with Helm's functionality. Taking a look at the issue you provided I noticed that there seemed to be a workaround that involved installing the helm release and then immediately uninstalling it. I am hesitant to do something like that as that would significantly change the underlying functionality, and I think could expose a lot of potential for issues related to this (i.e what happens if the install works but the uninstall does not?).

I think the simplest workaround in this case would be for the user to implement a workaround in their templates. In the issue you referenced I saw that there seemed to be a workaround in the template to wrap the lookup function in an if statement. In your template:

image: {{ (lookup "v1" "Namespace" "" "default").metadata.name }}

would become something like:

{{ if (lookup "v1" "Namespace" "" "default") }}
image: {{ (lookup "v1" "Namespace" "" "default").metadata.name }}
{{ else }}
image: "default"
{{ end }}

I tested these changes with helm template ./etcd-backup --validate and helm install ./etcd-backup --dry-run --generate-name and no errors occurred.

I also tested these changes by running the same commands that you provided in the additional context section and the operator ran as expected.

I also don't see it anywhere in the docs, so it would probably be a good idea for us to make a docs change that states that when using Helm the following commands should run with no errors before attempting to use a helm chart in an operator:

$ helm template ./chart --validate
$ helm install ./chart --dry-run --generate-name

All in all, I don't think there are any reasonable changes that we could make on our end to accomodate this and think it should be left up to the user to ensure that their Helm chart works properly when used in a dry run setting.

I am new to the Operator SDK team and community, so I would like to leave this issue open to be discussed in our next community meeting on May 9th and see if anyone else with a bit more experience in the Helm operator realm has any insights into this.

Thanks again for raising the issue!

@mikeshng
Copy link
Contributor Author

mikeshng commented May 6, 2022

I noticed that there seemed to be a workaround that involved installing the helm release and then immediately uninstalling it. I am hesitant to do something like that as that would significantly change the underlying functionality, and I think could expose a lot of potential for issues related to this (i.e what happens if the install works but the uninstall does not?).

I agree. Installing then uninstalling doesn't seem like a good idea.

I think the simplest workaround in this case would be for the user to implement a workaround in their templates.

Thank you! This is exactly what I needed.

I also don't see it anywhere in the docs, so it would probably be a good idea for us to make a docs change that states that when using Helm the following commands should run with no errors before attempting to use a helm chart in an operator:

$ helm template ./chart --validate
$ helm install ./chart --dry-run --generate-name

I feel like if those are the necessary validation steps then operator-sdk should probably be able to perform them without having to ask the user to run those manual steps.

All in all, I don't think there are any reasonable changes that we could make on our end to accomodate this and think it should be left up to the user to ensure that their Helm chart works properly when used in a dry run setting.

I assume most end users are not familiar with the inner workings of operator-sdk. They might not know that dry-run Helm operations not working is going to cause Helm operator to not function properly. For example, the charts that I have can be helm install/upgrade/uninstall fine without dry-run so to most users, they will probably assume this chart should be fine for the Helm operator.

I am new to the Operator SDK team and community, so I would like to leave this issue open to be discussed in our next community meeting on May 9th and see if anyone else with a bit more experience in the Helm operator realm has any insights into this.

Thanks again for raising the issue!

Thank you for your triage. Your suggestions are great! Much appreciated.

@rashmigottipati rashmigottipati added needs discussion help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels May 9, 2022
@everettraven
Copy link
Contributor

During the community meeting it was decided to leave this open to see if any solutions to the dry-run issue could be formulated.

As per my last comment, we seem to be pretty coupled to the functionality defined by Helm for checking if an upgrade would be valid via the dry run option. The most prevalent workaround from the referenced helm issue that Operator SDK could try implementing is to attempt to install and immediately uninstall the release. I don't feel like this would be the best way to go about this due to the potential domino effect that could occur if this process doesn't go smoothly.

I think the simplest workaround in this case would be for the user to implement a workaround in their templates. In the issue referenced I saw that there seemed to be a workaround in the template to wrap functions that rely on cluster access to be wrapped in an if statement to nil check the object returned by the function.
Something like:

image: {{ (lookup "v1" "Namespace" "" "default").metadata.name }}

would become something like:

{{ if (lookup "v1" "Namespace" "" "default") }}
image: {{ (lookup "v1" "Namespace" "" "default").metadata.name }}
{{ else }}
image: "default"
{{ end }}

I think having end users make sure that the following Helm commands run successfully on their Helm charts would prevent this issue from occurring in their Helm operators as they follow the same dry-run environment that the Helm operator follows:

$ helm template ./chart --validate
$ helm install ./chart --dry-run --generate-name

@varshaprasad96 @fabianvf @joelanford what are your thoughts?

@joelanford
Copy link
Member

I'd just echo what's been said:

  1. doing an actual upgrade to see if we need to upgrade pretty much defeats the purpose.
  2. workaround for now is basically "don't use lookup". If you use lookup, but then handle it returning nil, it seems like you'll end up getting dry-runs that return a diff from the existing release, which will trigger a real upgrade (which will use the lookup), and then the cycle will continue.
  3. docs that mention this shortcoming would be good
  4. maybe we leave this issue open and move the discussion of this use case to the helm issue? based on technosophos's responses in that issue, it seems like the original intention of --dry-run was to not talk to the server, so if that upstream issue isn't going anywhere, perhaps we just need to document that lookup is not supported in helm-operator.

@varshaprasad96
Copy link
Member

We probably need docs for the workaround. @everettraven can you create another follow up issue to track that down. Meanwhile, since we have a solution here, I'm closing this issue.

@holyspectral
Copy link
Contributor

Helm 3.13+ has a new flag --dry-run=server, which can be used to support lookup. https://github.com/helm/helm/releases/tag/v3.13.0
ArgoCD already plans to support it: argoproj/argo-cd#5202

@joelanford
Copy link
Member

Reopened so we can look into the new dry-run=server option

@varshaprasad96
Copy link
Member

@holyspectral would you be open to submitting a PR to fix this? THanks!

@holyspectral
Copy link
Contributor

/assign

@oceanc80
Copy link
Collaborator

@holyspectral I was hoping to cut this release today since it is quite overdue. As there's no open PR for this yet do you want to try to get one in today and I can wait until Monday to cut the release or should we punt this to the next release milestone?

@oceanc80 oceanc80 modified the milestones: v1.34.0, v1.35.0 Feb 16, 2024
@holyspectral
Copy link
Contributor

Hi @oceanc80 , sorry that I was traveling and missed your message. I should be able to work on this tomorrow. There will be a new parameter in watches.yaml to prevent breaking existing applications. Other than that it should be straightforward. I will be creating a PR in a few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. needs discussion
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants