Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] Performance issue: ScheduledWorkflow is taking significant amount of etcd storage #8757

Open
deepk2u opened this issue Jan 25, 2023 · 20 comments · May be fixed by #11363
Open

[backend] Performance issue: ScheduledWorkflow is taking significant amount of etcd storage #8757

deepk2u opened this issue Jan 25, 2023 · 20 comments · May be fixed by #11363

Comments

@deepk2u
Copy link
Contributor

deepk2u commented Jan 25, 2023

Environment

  • How did you deploy Kubeflow Pipelines (KFP)?
    Full Kuebflow deployment using manifests
  • KFP version: 2.0.0b6
  • KFP SDK version: 2.0.0b6

Steps to reproduce

We have around 125 recurring runs within a single namespace. After a few months of historical runs, we have started seeing performance issues in the k8s cluster.

After digging deeper, we found that we are seeing timeouts in the calls to etcd. When we checked the etcd database for objects we found that one particular namespace which has 125 recurring runs is taking 996MB of etcd space

some data to look at:

Entries by 'KEY GROUP' (total 1.6 GB):
+----------------------------------------------------------------------------------------------------------+--------------------------------+--------+
|                                                KEY GROUP                                                 |              KIND              |  SIZE  |
+----------------------------------------------------------------------------------------------------------+--------------------------------+--------+
| /registry/kubeflow.org/scheduledworkflows/<namespace1>                        | ScheduledWorkflow              | 996 MB |
| /registry/kubeflow.org/scheduledworkflows/<namespace2>                       | ScheduledWorkflow              | 211 MB |
| /registry/kubeflow.org/scheduledworkflows/<namespace3>                           | ScheduledWorkflow              | 118 MB |

.....

namespace1 has 123 recurring runs
namespace2 has 40 recurring runs
namespace3 has 63 recurring runs

Expected result

Looks like we are storing a lot of unnecessary information in the ScheduledWorkflow object, which eventually is taking space in etcd database resulting in all the performance issues

Materials and Reference


Impacted by this bug? Give it a 👍.

@connor-mccarthy
Copy link
Member

/assign @gkcalat

@gkcalat
Copy link
Member

gkcalat commented Jan 27, 2023

Hi @deepk2u!
It may be due to insufficient resource provisioning or the lack of etc maintenance (see here). How longs did it take for you to reach these numbers?

@deepk2u
Copy link
Contributor Author

deepk2u commented Jan 31, 2023

It's an eks cluster. We connected with AWS support and they are maintaining the cluster and running defragmentation and doing all kinds of maintenance for etcd.

the oldest object I have on the list is from 14th July 2022.

@gkcalat
Copy link
Member

gkcalat commented Feb 1, 2023

Can check how large are your pipeline manifests used in these recurring runs?

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Aug 26, 2023
Copy link

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@kuldeepjain
Copy link

/reopen

Copy link

@kuldeepjain: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@deepk2u
Copy link
Contributor Author

deepk2u commented Jan 5, 2024

/reopen

@google-oss-prow google-oss-prow bot reopened this Jan 5, 2024
Copy link

@deepk2u: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-actions github-actions bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jan 5, 2024
@rimolive
Copy link
Member

rimolive commented Apr 3, 2024

Closing this issue. No activity for more than a year.

/close

Copy link

@rimolive: Closing this issue.

In response to this:

Closing this issue. No activity for more than a year.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rimolive
Copy link
Member

/reopen

We found this issue in KFP 2.0.5. We'll work on a pruning mechanism for pipeline run k8s objects.

@google-oss-prow google-oss-prow bot reopened this Jun 17, 2024
Copy link

@rimolive: Reopened this issue.

In response to this:

/reopen

We found this issue in KFP 2.0.5. We'll work on a pruning mechanism for pipeline run k8s objects.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Aug 17, 2024
Copy link

github-actions bot commented Sep 9, 2024

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@github-actions github-actions bot closed this as completed Sep 9, 2024
@demarna1
Copy link

/reopen

I am seeing this issue in KFP 2.2.0. Our cluster has ~50 ScheduledWorkflows and we are seeing ~400MB of bytes written to ETCD every 10 minutes for the /registry/kubeflow.org/scheduledworkflows prefix.

The culprit seems to the heartbeat status updates, e.g.:

Status:
  Conditions:
    Last Heartbeat Time:   2024-09-19T11:16:33Z
    Last Transition Time:  2024-09-19T11:16:33Z
    Message:               The schedule is disabled.
    Reason:                Disabled
    Status:                True
    Type:                  Disabled

These status updates mean the SWF objects fail to reconcile, resulting in the following reconciliation loop:

  1. SWF is added to controller work queue
  2. Controller processes the SWF and updates the status heartbeat and transition time to current time.
  3. Object is re-written to ETCD and resourceVersion is updated.
  4. Controller event handler re-adds the SWF to the work queue.

This reconciliation loop occurs every 10 seconds for every SWF on the cluster (note: the reason it's 10s and not 1s is because of the controller's default queue backoff, so events are always queued for a minimum of 10s).

How is the heartbeat time and transition time used in Kubeflow? If they are not used, then one possible fix here would be to remove them from the status block.

cc @droctothorpe

Copy link

@demarna1: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

I am seeing this issue in KFP 2.2.0. Our cluster has ~50 ScheduledWorkflows and we are seeing ~400MB of bytes written to ETCD every 10 minutes for the /registry/kubeflow.org/scheduledworkflows prefix.

The culprit seems to the heartbeat status updates, e.g.:

Status:
 Conditions:
   Last Heartbeat Time:   2024-09-19T11:16:33Z
   Last Transition Time:  2024-09-19T11:16:33Z
   Message:               The schedule is disabled.
   Reason:                Disabled
   Status:                True
   Type:                  Disabled

These status updates mean the SWF objects fail to reconcile, resulting in the following reconciliation loop:

  1. SWF is added to controller work queue
  2. Controller processes the SWF and updates the status heartbeat and transition time to current time.
  3. Object is re-written to ETCD and resourceVersion is updated.
  4. Controller event handler re-adds the SWF to the work queue.

This reconciliation loop occurs every 10 seconds for every SWF on the cluster (note: the reason it's 10s and not 1s is because of the controller's default queue backoff, so events are always queued for a minimum of 10s).

How is the heartbeat time and transition time used in Kubeflow? If they are not used, then one possible fix here would be to remove them from the status block.

cc @droctothorpe

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@droctothorpe
Copy link
Contributor

/reopen

Copy link

@droctothorpe: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-prow google-oss-prow bot reopened this Sep 24, 2024
@github-actions github-actions bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Sep 25, 2024
demarna1 added a commit to demarna1/pipelines that referenced this issue Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

7 participants