Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(backend): stop heartbeat status updates for ScheduledWorkflows. Fixes #8757 #1

Closed
wants to merge 1 commit into from

Conversation

demarna1
Copy link
Owner

@demarna1 demarna1 commented Nov 7, 2024

Description of your changes:

Closes issue: kubeflow#8757

Every time the ScheduledWorkflow controller syncs a SWF resource, it updates the Last Heartbeat Time and Last Transition Time to the current time in the status block.

Status:
  Conditions:
    Last Heartbeat Time:   2024-11-07T11:16:33Z
    Last Transition Time:  2024-11-07T11:16:33Z
    Message:               The schedule is disabled.
    Reason:                Disabled
    Status:                True
    Type:                  Disabled

These heartbeat updates result in an infinite reconciliation loop:

  • SWF is added to controller work queue.
  • Controller processes the SWF and updates the status' LastProbeTime and LastTransitionTime to current time.
  • Object is re-written to ETCD and the resourceVersion is updated.
  • Shared informer detects that the resourceVersion has changed.
  • Controller event handler re-adds the SWF to the work queue.
  • This reconciliation loop occurs every 10 seconds for every SWF resource on the cluster. The reason it's 10s and not 1s is because the controller has a default queue backoff of 10s, so events are always queued for a minimum of 10s.

ETCD performance before & after

I measured ETCD bytes written for all resources on our cluster over a 10 minute time span. Once this fix was instituted, we saw a dramatic decrease in ETCD usage.

etcd

The chart agrees with the back-of-the-napkin math:

  • The average size of our SWF objects is 270kb.
  • Controller re-writes the object every 10 seconds (6x/min).
  • Bytes written to ETCD per minute = 270kb x 6/min = 1.6MB/minute per SWF.
  • Our cluster had 54 SWFs at the time of the analysis.
  • ETCD write throughput is 54*1.6mb/min = 86mb/min = 430MB every 5 min.

Checklist:

@demarna1 demarna1 closed this Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant