Helm uninstall stuck indefinitely if `tigera-operator` pre-delete job fails to schedule #9220

TheNilesh · 2024-09-11T18:52:27Z

Expected Behavior

During Helm uninstallation of the tigera-operator chart, the uninstall job (defined as a pre-delete hook) should execute and complete or fail, allowing Helm to remove the associated resources without hanging indefinitely.

Current Behavior

Helm uninstallation becomes stuck indefinitely if the tigera-operator-uninstall job (pre-delete hook) is not scheduled or fails to execute, particularly when the cluster has no active nodes. This prevents the uninstallation process from completing, requiring manual intervention.

Possible Solution

A potential solution could involve adding configuration options to manage the behavior of the pre-delete hook. For example:

Introduce a configurable timeout for the pre-delete job. If the job is not scheduled or completed within the specified time, Helm should forcefully delete the release to avoid being stuck.
Implement a flag in the Helm chart to allow force deletion of resources after the pre-delete job times out.
Allow disabling the pre-delete hook for clusters that may not have nodes, ensuring the Helm uninstall can still proceed in control-plane-only environments.
Use Kubernetes' activeDeadlineSeconds and backoffLimit to control retries and timeouts for the job, ensuring it doesn’t block indefinitely.

Steps to Reproduce (for bugs)

Install the tigera-operator Helm chart on a Kubernetes cluster managed through a hosted control plane using Cluster API (CAPI), Kamaji, and Sveltos.
Remove all nodes from the virtual cluster (leaving only the control plane).
Attempt to uninstall the Helm chart using helm uninstall tigera-operator -n tigera-operator.
Observe that Helm becomes stuck, waiting for the uninstall job that never runs due to the lack of nodes.

Context

This issue affects environments where the Helm chart is installed on a hosted control plane. Specifically, it impacts clusters managed by Cluster API (CAPI), Kamaji, and Sveltos. When no worker nodes are present, the uninstall job (pre-delete hook) cannot be scheduled, leaving the Helm uninstallation stuck indefinitely. This issue interferes with automated cluster management workflows, where clean resource removal is required even when no nodes are attached to the cluster. This mainly happens when managing tigera-operator installation through the Sveltos cluster profile.

Your Environment

Calico version: v3.28.0
Orchestrator version: Kubernetes 1.29.6
Operating System and version: Ubuntu 20.04
Cluster API (CAPI), Kamaji, Sveltos

The text was updated successfully, but these errors were encountered:

caseydavenport · 2024-09-13T20:26:39Z

I believe helm has a --no-hooks option you can pass in order to disable the hook if there are no scheduleable nodes in the
cluster. https://helm.sh/docs/helm/helm_uninstall/

  --no-hooks             prevent hooks from running during uninstallation

Would that do the trick in your case?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helm uninstall stuck indefinitely if `tigera-operator` pre-delete job fails to schedule #9220

Helm uninstall stuck indefinitely if `tigera-operator` pre-delete job fails to schedule #9220

TheNilesh commented Sep 11, 2024

caseydavenport commented Sep 13, 2024

Helm uninstall stuck indefinitely if tigera-operator pre-delete job fails to schedule #9220

Helm uninstall stuck indefinitely if tigera-operator pre-delete job fails to schedule #9220

Comments

TheNilesh commented Sep 11, 2024

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

caseydavenport commented Sep 13, 2024

Helm uninstall stuck indefinitely if `tigera-operator` pre-delete job fails to schedule #9220

Helm uninstall stuck indefinitely if `tigera-operator` pre-delete job fails to schedule #9220