[backend] Performance issue: ScheduledWorkflow is taking significant amount of etcd storage #8757

deepk2u · 2023-01-25T04:25:41Z

Environment

How did you deploy Kubeflow Pipelines (KFP)?
Full Kuebflow deployment using manifests
KFP version: 2.0.0b6
KFP SDK version: 2.0.0b6

Steps to reproduce

We have around 125 recurring runs within a single namespace. After a few months of historical runs, we have started seeing performance issues in the k8s cluster.

After digging deeper, we found that we are seeing timeouts in the calls to etcd. When we checked the etcd database for objects we found that one particular namespace which has 125 recurring runs is taking 996MB of etcd space

some data to look at:

Entries by 'KEY GROUP' (total 1.6 GB):
+----------------------------------------------------------------------------------------------------------+--------------------------------+--------+
|                                                KEY GROUP                                                 |              KIND              |  SIZE  |
+----------------------------------------------------------------------------------------------------------+--------------------------------+--------+
| /registry/kubeflow.org/scheduledworkflows/<namespace1>                        | ScheduledWorkflow              | 996 MB |
| /registry/kubeflow.org/scheduledworkflows/<namespace2>                       | ScheduledWorkflow              | 211 MB |
| /registry/kubeflow.org/scheduledworkflows/<namespace3>                           | ScheduledWorkflow              | 118 MB |

.....

namespace1 has 123 recurring runs
namespace2 has 40 recurring runs
namespace3 has 63 recurring runs

Expected result

Looks like we are storing a lot of unnecessary information in the ScheduledWorkflow object, which eventually is taking space in etcd database resulting in all the performance issues

Materials and Reference

Impacted by this bug? Give it a 👍.

The text was updated successfully, but these errors were encountered:

connor-mccarthy · 2023-01-26T23:51:09Z

/assign @gkcalat

gkcalat · 2023-01-27T00:04:34Z

Hi @deepk2u!
It may be due to insufficient resource provisioning or the lack of etc maintenance (see here). How longs did it take for you to reach these numbers?

deepk2u · 2023-01-31T21:00:14Z

It's an eks cluster. We connected with AWS support and they are maintaining the cluster and running defragmentation and doing all kinds of maintenance for etcd.

the oldest object I have on the list is from 14th July 2022.

gkcalat · 2023-02-01T00:32:42Z

Can check how large are your pipeline manifests used in these recurring runs?

github-actions · 2023-08-26T07:42:46Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions · 2023-11-25T07:42:13Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

kuldeepjain · 2024-01-05T02:42:48Z

/reopen

google-oss-prow · 2024-01-05T02:42:52Z

@kuldeepjain: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

deepk2u · 2024-01-05T02:57:03Z

/reopen

google-oss-prow · 2024-01-05T02:57:07Z

@deepk2u: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rimolive · 2024-04-03T16:03:26Z

Closing this issue. No activity for more than a year.

/close

google-oss-prow · 2024-04-03T16:03:30Z

@rimolive: Closing this issue.

In response to this:

Closing this issue. No activity for more than a year.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rimolive · 2024-06-17T21:39:52Z

/reopen

We found this issue in KFP 2.0.5. We'll work on a pruning mechanism for pipeline run k8s objects.

google-oss-prow · 2024-06-17T21:39:57Z

@rimolive: Reopened this issue.

In response to this:

/reopen

We found this issue in KFP 2.0.5. We'll work on a pruning mechanism for pipeline run k8s objects.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

github-actions · 2024-08-17T07:41:51Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions · 2024-09-09T07:41:41Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

demarna1 · 2024-09-24T16:28:33Z

/reopen

I am seeing this issue in KFP 2.2.0. Our cluster has ~50 ScheduledWorkflows and we are seeing ~400MB of bytes written to ETCD every 10 minutes for the /registry/kubeflow.org/scheduledworkflows prefix.

The culprit seems to the heartbeat status updates, e.g.:

Status:
  Conditions:
    Last Heartbeat Time:   2024-09-19T11:16:33Z
    Last Transition Time:  2024-09-19T11:16:33Z
    Message:               The schedule is disabled.
    Reason:                Disabled
    Status:                True
    Type:                  Disabled

These status updates mean the SWF objects fail to reconcile, resulting in the following reconciliation loop:

SWF is added to controller work queue
Controller processes the SWF and updates the status heartbeat and transition time to current time.
Object is re-written to ETCD and resourceVersion is updated.
Controller event handler re-adds the SWF to the work queue.

This reconciliation loop occurs every 10 seconds for every SWF on the cluster (note: the reason it's 10s and not 1s is because of the controller's default queue backoff, so events are always queued for a minimum of 10s).

How is the heartbeat time and transition time used in Kubeflow? If they are not used, then one possible fix here would be to remove them from the status block.

cc @droctothorpe

google-oss-prow · 2024-09-24T16:28:37Z

@demarna1: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

I am seeing this issue in KFP 2.2.0. Our cluster has ~50 ScheduledWorkflows and we are seeing ~400MB of bytes written to ETCD every 10 minutes for the /registry/kubeflow.org/scheduledworkflows prefix.

The culprit seems to the heartbeat status updates, e.g.:
Status:
 Conditions:
   Last Heartbeat Time:   2024-09-19T11:16:33Z
   Last Transition Time:  2024-09-19T11:16:33Z
   Message:               The schedule is disabled.
   Reason:                Disabled
   Status:                True
   Type:                  Disabled
These status updates mean the SWF objects fail to reconcile, resulting in the following reconciliation loop:

SWF is added to controller work queue

Controller processes the SWF and updates the status heartbeat and transition time to current time.

Object is re-written to ETCD and resourceVersion is updated.

Controller event handler re-adds the SWF to the work queue.

This reconciliation loop occurs every 10 seconds for every SWF on the cluster (note: the reason it's 10s and not 1s is because of the controller's default queue backoff, so events are always queued for a minimum of 10s).

How is the heartbeat time and transition time used in Kubeflow? If they are not used, then one possible fix here would be to remove them from the status block.

cc @droctothorpe

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

droctothorpe · 2024-09-24T20:51:50Z

/reopen

google-oss-prow · 2024-09-24T20:51:53Z

@droctothorpe: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…roller. Fixes kubeflow#8757 Signed-off-by: demarna1 <[email protected]>

deepk2u added area/backend kind/bug labels Jan 25, 2023

google-oss-prow bot assigned gkcalat Jan 26, 2023

github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Aug 26, 2023

github-actions bot closed this as completed Nov 25, 2023

google-oss-prow bot reopened this Jan 5, 2024

github-actions bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jan 5, 2024

google-oss-prow bot closed this as completed Apr 3, 2024

google-oss-prow bot reopened this Jun 17, 2024

github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Aug 17, 2024

github-project-automation bot added this to Community Contrib Aug 29, 2024

github-project-automation bot moved this to AWS in Community Contrib Aug 29, 2024

github-actions bot closed this as completed Sep 9, 2024

google-oss-prow bot reopened this Sep 24, 2024

github-actions bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Sep 25, 2024

demarna1 added a commit to demarna1/pipelines that referenced this issue Nov 7, 2024

fix(backend): stop heartbeat status updates in ScheduledWorkflow cont…

23fcac3

…roller. Fixes kubeflow#8757 Signed-off-by: demarna1 <[email protected]>

This was referenced Nov 7, 2024

fix(backend): stop heartbeat status updates for ScheduledWorkflows. Fixes #8757 demarna1/pipelines#1

Closed

fix(backend): stop heartbeat status updates for ScheduledWorkflows. Fixes #8757 #11363

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[backend] Performance issue: ScheduledWorkflow is taking significant amount of etcd storage #8757

[backend] Performance issue: ScheduledWorkflow is taking significant amount of etcd storage #8757

deepk2u commented Jan 25, 2023

connor-mccarthy commented Jan 26, 2023

gkcalat commented Jan 27, 2023

deepk2u commented Jan 31, 2023 •

edited

Loading

gkcalat commented Feb 1, 2023

github-actions bot commented Aug 26, 2023

github-actions bot commented Nov 25, 2023

kuldeepjain commented Jan 5, 2024

google-oss-prow bot commented Jan 5, 2024

deepk2u commented Jan 5, 2024

google-oss-prow bot commented Jan 5, 2024

rimolive commented Apr 3, 2024

google-oss-prow bot commented Apr 3, 2024

rimolive commented Jun 17, 2024

google-oss-prow bot commented Jun 17, 2024

github-actions bot commented Aug 17, 2024

github-actions bot commented Sep 9, 2024

demarna1 commented Sep 24, 2024

google-oss-prow bot commented Sep 24, 2024

droctothorpe commented Sep 24, 2024

google-oss-prow bot commented Sep 24, 2024

[backend] Performance issue: ScheduledWorkflow is taking significant amount of etcd storage #8757

[backend] Performance issue: ScheduledWorkflow is taking significant amount of etcd storage #8757

Comments

deepk2u commented Jan 25, 2023

Environment

Steps to reproduce

Expected result

Materials and Reference

connor-mccarthy commented Jan 26, 2023

gkcalat commented Jan 27, 2023

deepk2u commented Jan 31, 2023 • edited Loading

gkcalat commented Feb 1, 2023

github-actions bot commented Aug 26, 2023

github-actions bot commented Nov 25, 2023

kuldeepjain commented Jan 5, 2024

google-oss-prow bot commented Jan 5, 2024

deepk2u commented Jan 5, 2024

google-oss-prow bot commented Jan 5, 2024

rimolive commented Apr 3, 2024

google-oss-prow bot commented Apr 3, 2024

rimolive commented Jun 17, 2024

google-oss-prow bot commented Jun 17, 2024

github-actions bot commented Aug 17, 2024

github-actions bot commented Sep 9, 2024

demarna1 commented Sep 24, 2024

google-oss-prow bot commented Sep 24, 2024

droctothorpe commented Sep 24, 2024

google-oss-prow bot commented Sep 24, 2024

deepk2u commented Jan 31, 2023 •

edited

Loading