-
Notifications
You must be signed in to change notification settings - Fork 680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core feature] Delete terminated workflows in chunks during garbage collection #2160
Comments
@jeevb to recap, there are a large number of workflows and FlytePropellers management overwhelms the k8s API server resulting in timeouts. These include not only the informer watch on new FlyteWorkflow CRDs but also in deleting terminated CRDs. The problem is that this issue will persist until deletion works, and the failure in deletion is cyclic. Do you have the specific GC error? The github permalink you provided looks like it's listing the namespaces so that it can iterate over each and perform that actual deletion here. I think that given the timeouts you're experiencing either of these calls can timeout, but it might make things easier if we know just one of them is. I think your idea of limiting the number of CRDs deleted on each tick makes sense. However, the limit field in ListOptions is not supported through the delete API. Therefore, we can either delete all CRDs (matching a query on label selectors) or a single one. This makes things a little more difficult. In looking for a solution I discovered k8s FlowSchema. It looks like it allows assigning priorities to API calls based on simple matching that's enabled by default in v1.20 (and can be enabled manually on previous versions). The idea is that we could create a FlowSchema that sets FlyteWorkflow CRD deletion API calls to a high priority, then those calls would experience fewer (hopefully none) timeouts, thus breaking the cycle you're seeing. Do you think this is something that might work? I would be happy to run some tests. |
Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏 |
Commenting to keep open. |
Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. |
I do not think this is valid anymore as we just simply apply label based deletions and let them |
Motivation: Why do you think this is important?
A large number of FlyteWorkflow objects may overwhelm Flyte's garbage collection routine. This is because the garbage collector works by first listing all objects in the respective namespaces. This operation will time out in the event that there is a large number of objects in a given namespace.
Given that FlytePropeller watches for new or updated FlyteWorkflow objects in the namespaces assigned to it, when any of these namespaces has a large number of objects, the
ListAndWatch
operation will timeout as well. This causes the whole workflow engine to grind to a halt when the number of FlyteWorkflow objects blows up beyond what the garbage collector can handle! See below for log:Goal: What should the final outcome look like, ideally?
The garbage collector should limit the number of terminated workflows it lists/deletes every tick. This will avoid timeouts in the event that there are a large number of objects in a given namespace, and as such will be able to complete successfully, even if it may take longer.
Describe alternatives you've considered
We considered a cronjob that manually does chunked deletion of terminated workflows to work around this issue, but we believe that this is better to be fixed in FlytePropeller.
Propose: Link/Inline OR Additional context
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: