Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefect-GCP: Implement Retry Logic for Transient Errors with Cloud Run V2 #16448

Open
IsaacDayan opened this issue Dec 19, 2024 · 1 comment
Open
Labels
enhancement An improvement of an existing feature good first issue This issue is good for newcomers integrations Related to integrations with other services

Comments

@IsaacDayan
Copy link

Describe the current behavior

When utilizing the Cloud Run V2 worker for Prefect flows, if the job submission encounters a transient error such as a 503 HTTP status code, the entire flow run is marked as crashed. This situation does not trigger any retries, leading to false alerts and unnecessary manual intervention.

Describe the proposed behavior

It would be beneficial to implement automatic retries for the callers of JobV2.create, for example _create_job_and_wait_for_registration within the Prefect-GCP integration. This should include a mechanism to handle transient errors by retrying the submission a configurable number of times with exponential backoff. This approach aligns with best practices for managing resources prone to transient errors.

Example Use

No response

Additional context

We are experiencing a high number of flow crashes attributed to 503 errors. Our current workaround involves an "Automation" to restart these crashed flows, but it does not differentiate between actual job failures and these infrastructure-related interruptions. It also create false alert notification on crashed jobs. Implementing this feature would significantly reduce false alarms and improve the robustness of our automated workflows - and avoid "Negative data engineering"!

@IsaacDayan IsaacDayan added the enhancement An improvement of an existing feature label Dec 19, 2024
@zzstoatzz zzstoatzz added integrations Related to integrations with other services good first issue This issue is good for newcomers labels Dec 20, 2024
@tav-singh
Copy link

Hi, I can pick this up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement of an existing feature good first issue This issue is good for newcomers integrations Related to integrations with other services
Projects
None yet
Development

No branches or pull requests

3 participants