-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[backend] Performance issue: ScheduledWorkflow is taking significant amount of etcd storage #8757
Comments
/assign @gkcalat |
It's an eks cluster. We connected with AWS support and they are maintaining the cluster and running defragmentation and doing all kinds of maintenance for etcd. the oldest object I have on the list is from 14th July 2022. |
Can check how large are your pipeline manifests used in these recurring runs? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
/reopen |
@kuldeepjain: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@deepk2u: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Closing this issue. No activity for more than a year. /close |
@rimolive: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen We found this issue in KFP 2.0.5. We'll work on a pruning mechanism for pipeline run k8s objects. |
@rimolive: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
/reopen I am seeing this issue in KFP 2.2.0. Our cluster has ~50 ScheduledWorkflows and we are seeing ~400MB of bytes written to ETCD every 10 minutes for the The culprit seems to the heartbeat status updates, e.g.:
These status updates mean the SWF objects fail to reconcile, resulting in the following reconciliation loop:
This reconciliation loop occurs every 10 seconds for every SWF on the cluster (note: the reason it's 10s and not 1s is because of the controller's default queue backoff, so events are always queued for a minimum of 10s). How is the heartbeat time and transition time used in Kubeflow? If they are not used, then one possible fix here would be to remove them from the status block. |
@demarna1: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@droctothorpe: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…roller. Fixes kubeflow#8757 Signed-off-by: demarna1 <[email protected]>
Environment
Full Kuebflow deployment using manifests
Steps to reproduce
We have around 125 recurring runs within a single namespace. After a few months of historical runs, we have started seeing performance issues in the k8s cluster.
After digging deeper, we found that we are seeing timeouts in the calls to etcd. When we checked the etcd database for objects we found that one particular namespace which has 125 recurring runs is taking 996MB of etcd space
some data to look at:
namespace1 has 123 recurring runs
namespace2 has 40 recurring runs
namespace3 has 63 recurring runs
Expected result
Looks like we are storing a lot of unnecessary information in the ScheduledWorkflow object, which eventually is taking space in etcd database resulting in all the performance issues
Materials and Reference
Impacted by this bug? Give it a 👍.
The text was updated successfully, but these errors were encountered: