[Core feature] Delete terminated workflows in chunks during garbage collection #2160

jeevb · 2022-02-12T16:45:17Z

Motivation: Why do you think this is important?

A large number of FlyteWorkflow objects may overwhelm Flyte's garbage collection routine. This is because the garbage collector works by first listing all objects in the respective namespaces. This operation will time out in the event that there is a large number of objects in a given namespace.

Given that FlytePropeller watches for new or updated FlyteWorkflow objects in the namespaces assigned to it, when any of these namespaces has a large number of objects, the ListAndWatch operation will timeout as well. This causes the whole workflow engine to grind to a halt when the number of FlyteWorkflow objects blows up beyond what the garbage collector can handle! See below for log:

I0124 16:52:23.295697       1 trace.go:205] Trace[1562460260]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167 (24-Jan-2022 16:51:53.294) (total time: 30000ms):
Trace[1562460260]: [30.000707142s] [30.000707142s] END
E0124 16:52:23.295723       1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1alpha1.FlyteWorkflow: failed to list *v1alpha1.FlyteWorkflow: Get "https://192.168.3.1:443/apis/flyte.lyft.com/v1alpha1/flyteworkflows?labelSelector=termination-status+notin+%28terminated%29&limit=500&resourceVersion=0": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Goal: What should the final outcome look like, ideally?

The garbage collector should limit the number of terminated workflows it lists/deletes every tick. This will avoid timeouts in the event that there are a large number of objects in a given namespace, and as such will be able to complete successfully, even if it may take longer.

Describe alternatives you've considered

We considered a cronjob that manually does chunked deletion of terminated workflows to work around this issue, but we believe that this is better to be fixed in FlytePropeller.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

hamersaw · 2022-03-02T16:09:33Z

@jeevb to recap, there are a large number of workflows and FlytePropellers management overwhelms the k8s API server resulting in timeouts. These include not only the informer watch on new FlyteWorkflow CRDs but also in deleting terminated CRDs. The problem is that this issue will persist until deletion works, and the failure in deletion is cyclic.

Do you have the specific GC error? The github permalink you provided looks like it's listing the namespaces so that it can iterate over each and perform that actual deletion here. I think that given the timeouts you're experiencing either of these calls can timeout, but it might make things easier if we know just one of them is.

I think your idea of limiting the number of CRDs deleted on each tick makes sense. However, the limit field in ListOptions is not supported through the delete API. Therefore, we can either delete all CRDs (matching a query on label selectors) or a single one. This makes things a little more difficult.

In looking for a solution I discovered k8s FlowSchema. It looks like it allows assigning priorities to API calls based on simple matching that's enabled by default in v1.20 (and can be enabled manually on previous versions). The idea is that we could create a FlowSchema that sets FlyteWorkflow CRD deletion API calls to a high priority, then those calls would experience fewer (hopefully none) timeouts, thus breaking the cycle you're seeing. Do you think this is something that might work? I would be happy to run some tests.

github-actions · 2023-08-28T00:38:41Z

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

hamersaw · 2023-08-30T15:14:24Z

Commenting to keep open.

github-actions · 2024-05-28T00:07:19Z

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable.
Thank you for your contribution and understanding! 🙏

kumare3 · 2024-05-30T04:18:29Z

I do not think this is valid anymore as we just simply apply label based deletions and let them
Propagate in the background

jeevb added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Feb 12, 2022

hamersaw self-assigned this Mar 2, 2022

hamersaw removed the untriaged This issues has not yet been looked at by the Maintainers label Mar 2, 2022

github-actions bot added the stale label Aug 28, 2023

github-actions bot removed the stale label Aug 31, 2023

github-actions bot added the stale label May 28, 2024

github-actions bot removed the stale label May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core feature] Delete terminated workflows in chunks during garbage collection #2160

[Core feature] Delete terminated workflows in chunks during garbage collection #2160

jeevb commented Feb 12, 2022

hamersaw commented Mar 2, 2022 •

edited

Loading

github-actions bot commented Aug 28, 2023

hamersaw commented Aug 30, 2023

github-actions bot commented May 28, 2024

kumare3 commented May 30, 2024

[Core feature] Delete terminated workflows in chunks during garbage collection #2160

[Core feature] Delete terminated workflows in chunks during garbage collection #2160

Comments

jeevb commented Feb 12, 2022

Motivation: Why do you think this is important?

Goal: What should the final outcome look like, ideally?

Describe alternatives you've considered

Propose: Link/Inline OR Additional context

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

hamersaw commented Mar 2, 2022 • edited Loading

github-actions bot commented Aug 28, 2023

hamersaw commented Aug 30, 2023

github-actions bot commented May 28, 2024

kumare3 commented May 30, 2024

hamersaw commented Mar 2, 2022 •

edited

Loading