Prefect-GCP: Implement Retry Logic for Transient Errors with Cloud Run V2 #16448

IsaacDayan · 2024-12-19T10:07:46Z

Describe the current behavior

When utilizing the Cloud Run V2 worker for Prefect flows, if the job submission encounters a transient error such as a 503 HTTP status code, the entire flow run is marked as crashed. This situation does not trigger any retries, leading to false alerts and unnecessary manual intervention.

Describe the proposed behavior

It would be beneficial to implement automatic retries for the callers of JobV2.create, for example _create_job_and_wait_for_registration within the Prefect-GCP integration. This should include a mechanism to handle transient errors by retrying the submission a configurable number of times with exponential backoff. This approach aligns with best practices for managing resources prone to transient errors.

Example Use

No response

Additional context

We are experiencing a high number of flow crashes attributed to 503 errors. Our current workaround involves an "Automation" to restart these crashed flows, but it does not differentiate between actual job failures and these infrastructure-related interruptions. It also create false alert notification on crashed jobs. Implementing this feature would significantly reduce false alarms and improve the robustness of our automated workflows - and avoid "Negative data engineering"!

tav-singh · 2024-12-23T23:12:00Z

Hi, I can pick this up.

IsaacDayan added the enhancement An improvement of an existing feature label Dec 19, 2024

zzstoatzz added integrations Related to integrations with other services good first issue This issue is good for newcomers labels Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefect-GCP: Implement Retry Logic for Transient Errors with Cloud Run V2 #16448

Prefect-GCP: Implement Retry Logic for Transient Errors with Cloud Run V2 #16448

IsaacDayan commented Dec 19, 2024

tav-singh commented Dec 23, 2024

Prefect-GCP: Implement Retry Logic for Transient Errors with Cloud Run V2 #16448

Prefect-GCP: Implement Retry Logic for Transient Errors with Cloud Run V2 #16448

Comments

IsaacDayan commented Dec 19, 2024

Describe the current behavior

Describe the proposed behavior

Example Use

Additional context

tav-singh commented Dec 23, 2024