Prefect-GCP: Implement Retry Logic for Transient Errors with Cloud Run V2 #16448
Labels
enhancement
An improvement of an existing feature
good first issue
This issue is good for newcomers
integrations
Related to integrations with other services
Describe the current behavior
When utilizing the Cloud Run V2 worker for Prefect flows, if the job submission encounters a transient error such as a 503 HTTP status code, the entire flow run is marked as crashed. This situation does not trigger any retries, leading to false alerts and unnecessary manual intervention.
Describe the proposed behavior
It would be beneficial to implement automatic retries for the callers of JobV2.create, for example _create_job_and_wait_for_registration within the Prefect-GCP integration. This should include a mechanism to handle transient errors by retrying the submission a configurable number of times with exponential backoff. This approach aligns with best practices for managing resources prone to transient errors.
Example Use
No response
Additional context
We are experiencing a high number of flow crashes attributed to 503 errors. Our current workaround involves an "Automation" to restart these crashed flows, but it does not differentiate between actual job failures and these infrastructure-related interruptions. It also create false alert notification on crashed jobs. Implementing this feature would significantly reduce false alarms and improve the robustness of our automated workflows - and avoid "Negative data engineering"!
The text was updated successfully, but these errors were encountered: