[BUG] Flyte array plugin fails with "ResourceDeletedExternally" #5272

pablocasares · 2024-04-23T14:04:49Z

Describe the bug

In some workflows executions using the map_task the workflow fails because of
[498]: code:"ResourceDeletedExternally" message:"resource not found, name [ingestion-pipeline-production/jtep6yfsp4re7teyxnfs-n4-0-498]. reason: pods \"jtep6yfsp4re7teyxnfs-n4-0-498\" not found" [500]: code:"ResourceDeletedExternally" message:"resource not found, name [ingestion-pipeline-production/jtep6yfsp4re7teyxnfs-n4-0-500]. reason: pods \"jtep6yfsp4re7teyxnfs-n4-0-500\" not found" [501]: code:"ResourceDeletedExternally" message:"resource not found, name [ingestion-pipeline-production/jtep6yfsp4re7teyxnfs-n4-0-501]. reason: pods \"jtep6yfsp4re7teyxnfs-n4-0-501\" not found" [502]: code:"ResourceDeletedExternally" message:"resource not found, name [ingestion-pipeline-production/jtep6yfsp4re7teyxnfs-n4-0-502]. reason: pods \"jtep6yfsp4re7teyxnfs-n4-0-502\" not found" [503]: code:"ResourceDeletedExternally" message:"resource not found, name [ingestion-pipeline-production/jtep6yfsp4re7teyxnfs-n4-0-503]. reason: pods \"jtep6yfsp4re7teyxnfs-n4-0-503\" not found" ... and many more.

This seems to be happening because the pod is being removed before the final status is checked by the propeller.
Those pods mentioned didn't failed. They were executed normally but it seems they are being cleaned up before the propeller reads the Succeeded final status.

Expected behavior

Flyte array plugin should not fail because of this. The pod final status should be checked before being cleaned.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

runllm · 2024-04-23T14:05:23Z

To address the "ResourceDeletedExternally" error in Flyte when using the array plugin, consider enabling finalizers in your Flyte deployment. Finalizers prevent Kubernetes from garbage collecting resources prematurely, allowing Flyte to check the pod's final status before it is cleaned up. This issue often arises when pods are deleted externally due to Kubernetes operations like node draining or resource constraints, which can be mitigated by using finalizers.

For further investigation, ensure that the inject-finalizer configuration is set to true in your Flyte deployment, as discussed in this Flyte discussion. If the issue persists despite having finalizers enabled, it may be necessary to examine the Kubernetes and FlytePropeller logs for any anomalies during the pod lifecycle. Would you like more detailed steps on how to enable finalizers or how to investigate this issue further?

Sources:

This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @RunLLM to ask follow-up questions. See runllm.com for more!

hamersaw · 2024-05-09T17:15:17Z

IIUC you already have finalizers enabled and they are being set on Pods created by the maptask per this conversation. The finalizer is used to suggest that this Pod is not deleted until the finalizer is removed by flytepropeller. If it is being cleaned up, even though the finalizer exists, this suggests that k8s is under very heavy load and is not respecting the finalizers. Do you have a notion of the order of concurrent Pod executions when this behavior occurs? It would explain why reproducing this is difficult.

andresgomezfrr · 2024-05-30T08:52:58Z

this suggests that k8s is under very heavy load and is not respecting the finalizers

Is it possible? I think that even with a very heavy load k8s should respect the finalizers.

Do you have a notion of the order of concurrent Pod executions when this behavior occurs? It would explain why reproducing this is difficult.

thousands of pods more or less

hamersaw · 2024-05-30T13:47:56Z

Is it possible? I think that even with a very heavy load k8s should respect the finalizers.

Yes, we see it relatively frequently with a massive number of Pods in a k8s cluster. My intuition is that is exactly what is happening here. A potential solution is to configure Flyte to execute over multiple k8s clusters to reduce the number of Pods per cluster.

pablocasares added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Apr 23, 2024

eapolinario assigned hamersaw May 2, 2024

eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Flyte array plugin fails with "ResourceDeletedExternally" #5272

[BUG] Flyte array plugin fails with "ResourceDeletedExternally" #5272

pablocasares commented Apr 23, 2024

runllm bot commented Apr 23, 2024

hamersaw commented May 9, 2024

andresgomezfrr commented May 30, 2024

hamersaw commented May 30, 2024 •

edited

Loading

[BUG] Flyte array plugin fails with "ResourceDeletedExternally" #5272

[BUG] Flyte array plugin fails with "ResourceDeletedExternally" #5272

Comments

pablocasares commented Apr 23, 2024

Describe the bug

Expected behavior

Additional context to reproduce

Screenshots

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

runllm bot commented Apr 23, 2024

hamersaw commented May 9, 2024

andresgomezfrr commented May 30, 2024

hamersaw commented May 30, 2024 • edited Loading

hamersaw commented May 30, 2024 •

edited

Loading