Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core feature] Delete terminated workflows in chunks during garbage collection #2160

Open
2 tasks done
jeevb opened this issue Feb 12, 2022 · 5 comments
Open
2 tasks done
Assignees
Labels
enhancement New feature or request

Comments

@jeevb
Copy link
Contributor

jeevb commented Feb 12, 2022

Motivation: Why do you think this is important?

A large number of FlyteWorkflow objects may overwhelm Flyte's garbage collection routine. This is because the garbage collector works by first listing all objects in the respective namespaces. This operation will time out in the event that there is a large number of objects in a given namespace.

Given that FlytePropeller watches for new or updated FlyteWorkflow objects in the namespaces assigned to it, when any of these namespaces has a large number of objects, the ListAndWatch operation will timeout as well. This causes the whole workflow engine to grind to a halt when the number of FlyteWorkflow objects blows up beyond what the garbage collector can handle! See below for log:

I0124 16:52:23.295697       1 trace.go:205] Trace[1562460260]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167 (24-Jan-2022 16:51:53.294) (total time: 30000ms):
Trace[1562460260]: [30.000707142s] [30.000707142s] END
E0124 16:52:23.295723       1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1alpha1.FlyteWorkflow: failed to list *v1alpha1.FlyteWorkflow: Get "https://192.168.3.1:443/apis/flyte.lyft.com/v1alpha1/flyteworkflows?labelSelector=termination-status+notin+%28terminated%29&limit=500&resourceVersion=0": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Goal: What should the final outcome look like, ideally?

The garbage collector should limit the number of terminated workflows it lists/deletes every tick. This will avoid timeouts in the event that there are a large number of objects in a given namespace, and as such will be able to complete successfully, even if it may take longer.

Describe alternatives you've considered

We considered a cronjob that manually does chunked deletion of terminated workflows to work around this issue, but we believe that this is better to be fixed in FlytePropeller.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@jeevb jeevb added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Feb 12, 2022
@hamersaw
Copy link
Contributor

hamersaw commented Mar 2, 2022

@jeevb to recap, there are a large number of workflows and FlytePropellers management overwhelms the k8s API server resulting in timeouts. These include not only the informer watch on new FlyteWorkflow CRDs but also in deleting terminated CRDs. The problem is that this issue will persist until deletion works, and the failure in deletion is cyclic.

Do you have the specific GC error? The github permalink you provided looks like it's listing the namespaces so that it can iterate over each and perform that actual deletion here. I think that given the timeouts you're experiencing either of these calls can timeout, but it might make things easier if we know just one of them is.

I think your idea of limiting the number of CRDs deleted on each tick makes sense. However, the limit field in ListOptions is not supported through the delete API. Therefore, we can either delete all CRDs (matching a query on label selectors) or a single one. This makes things a little more difficult.

In looking for a solution I discovered k8s FlowSchema. It looks like it allows assigning priorities to API calls based on simple matching that's enabled by default in v1.20 (and can be enabled manually on previous versions). The idea is that we could create a FlowSchema that sets FlyteWorkflow CRD deletion API calls to a high priority, then those calls would experience fewer (hopefully none) timeouts, thus breaking the cycle you're seeing. Do you think this is something that might work? I would be happy to run some tests.

@hamersaw hamersaw self-assigned this Mar 2, 2022
@hamersaw hamersaw removed the untriaged This issues has not yet been looked at by the Maintainers label Mar 2, 2022
@github-actions
Copy link

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

@github-actions github-actions bot added the stale label Aug 28, 2023
@hamersaw
Copy link
Contributor

Commenting to keep open.

@github-actions github-actions bot removed the stale label Aug 31, 2023
Copy link

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable.
Thank you for your contribution and understanding! 🙏

@github-actions github-actions bot added the stale label May 28, 2024
@kumare3
Copy link
Contributor

kumare3 commented May 30, 2024

I do not think this is valid anymore as we just simply apply label based deletions and let them
Propagate in the background

@github-actions github-actions bot removed the stale label May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants