Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: expire webhook cert generation Jobs #346

Closed
wants to merge 1 commit into from

Conversation

rainest
Copy link
Contributor

@rainest rainest commented Jun 18, 2024

What this PR does / why we need it:

Set a finished TTL on webhook cert Jobs, to clean them after they are finished.

Which issue this PR fixes

Related to https://github.com/Kong/gateway-operator-archive/issues/429

Special notes for your reviewer:

Jobs may continue to produce failures and associated alerts, but should at least clean up their failures after.

There's ongoing discussion with the cloud gateways team re the TTL value. Ideally this doesn't need to be configurable, but we're unsure what a reasonable value is to allow Jobs of interest (essentially Jobs that are finished because they've failed and could warrant human review--successful Jobs should always be okay to clean up).

I initially chose 60s with the rationale that actively failing Jobs won't reach finished for a while, and will spend a while in a fail->backoff->retry loop before reaching their final finished state after they give up retrying. Upped it to 600s after initial discussion; 3600s/1h was the other value I thought would be reasonable.

PR Readiness Checklist:

Complete these before marking the PR as ready to review:

  • the CHANGELOG.md release notes have been updated to reflect significant changes

@rainest rainest marked this pull request as ready for review June 18, 2024 17:50
@rainest rainest requested a review from a team as a code owner June 18, 2024 17:50
@rainest rainest enabled auto-merge (squash) June 18, 2024 17:52
@@ -75,6 +75,7 @@ func newWebhookCertificateConfigJobCommon(namespace, serviceAccountName string,
Namespace: namespace,
},
Spec: batchv1.JobSpec{
TTLSecondsAfterFinished: lo.ToPtr(int32(600)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's hold off with this one for now as this will mask the underlying problem: the jobs (with the failing Pods will get deleted and there will be no sign of them and no sign of failure whereas the webhook - and related resources - won't get created).

The TTL seems to be counted after the job is considered completed or failed.

status:
  active: 1
  ready: 0
  startTime: "2024-06-19T15:03:26Z"
  terminating: 0
  uncountedTerminatedPods: {}

...

status:
  conditions:
  - lastProbeTime: "2024-06-19T15:04:12Z"
    lastTransitionTime: "2024-06-19T15:04:12Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed
  failed: 1
  ready: 0
  startTime: "2024-06-19T15:03:26Z"
  terminating: 0
  uncountedTerminatedPods: {}

@rainest rainest closed this Jun 20, 2024
auto-merge was automatically disabled June 20, 2024 20:15

Pull request was closed

@rainest rainest deleted the feat/ttl-webhook-job branch June 20, 2024 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants