Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add _request_timeout to Kubernetes watches #15744

Merged
merged 3 commits into from
Oct 17, 2024
Merged

Conversation

kevingrismore
Copy link
Contributor

@kevingrismore kevingrismore commented Oct 17, 2024

Closes #15622

This is a workaround for an issue in kubernetes_asyncio in which a 5 minute request timeout is the default behavior. There is an open PR to resolve there that seems like it'll be released in December at the earliest. In the meantime, we can override the 5 minute timeout by passing a ClientTimeout into _request_timeout on our watches.

This enables timeout_seconds to once again do its job of enforcing timeouts on watches.

After starting a flow run with a CPU request that was impossible to schedule on my cluster and a 20 minute timeout pod_watch_timeout_seconds:

23:44:49.737 | INFO    | prefect.flow_runs.worker - Completed submission of flow run 'd957920d-ef3d-4c26-a148-a287d037e0d8'
00:04:49.626 | ERROR   | prefect.flow_runs.worker - Job 'proud-coati-jh4qw': Pod never started.
00:04:49.657 | INFO    | prefect.flow_runs.worker - Job event 'SuccessfulCreate' at 2024-10-16 23:44:49+00:00: Created pod: proud-coati-jh4qw-dsmxl
00:04:49.661 | INFO    | prefect.flow_runs.worker - Pod event 'NotTriggerScaleUp' (92 times) at 2024-10-16 23:59:59+00:00: pod didn't trigger scale-up: 1 max node group size reached
00:04:49.666 | INFO    | prefect.flow_runs.worker - Pod event 'FailedScheduling' (5 times) at 2024-10-17 00:00:18+00:00: 0/3 nodes are available: 3 Insufficient cpu, 3 Insufficient memory. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.
00:04:49.830 | INFO    | prefect.flow_runs.worker - Reported flow run 'd957920d-ef3d-4c26-a148-a287d037e0d8' as crashed: Flow run infrastructure exited with non-zero status code -1.

Checklist

  • This pull request references any related issue by including "closes <link to issue>"
    • If no issue exists and your change is not a small fix, please create an issue first.
  • If this pull request adds new functionality, it includes unit tests that cover the changes
  • If this pull request removes docs files, it includes redirect settings in mint.json.
  • If this pull request adds functions or classes, it includes helpful docstrings.

@github-actions github-actions bot added the bug Something isn't working label Oct 17, 2024
@kevingrismore kevingrismore marked this pull request as ready for review October 17, 2024 18:08
Copy link
Collaborator

@zzstoatzz zzstoatzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@kevingrismore kevingrismore merged commit 55380bc into main Oct 17, 2024
18 checks passed
@kevingrismore kevingrismore deleted the k8s-async-timeout branch October 17, 2024 18:24
@NicholasFiorentini
Copy link

@kevingrismore FYI: https://github.com/tomplus/kubernetes_asyncio/releases/tag/31.1.1 has been released. Should Prefect upgrade the dependencies and roll back to the previous code? This PR may be related to #16210.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kubernetes worker: TimeoutError during pod creation
3 participants