Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missed jobs scheduler not working for retries #21

Open
StoneFrog opened this issue Jun 15, 2023 · 6 comments
Open

Missed jobs scheduler not working for retries #21

StoneFrog opened this issue Jun 15, 2023 · 6 comments
Labels
bug Something isn't working documentation Improvements or additions to documentation

Comments

@StoneFrog
Copy link
Collaborator

StoneFrog commented Jun 15, 2023

According to docs:

Missed Jobs Periodical Handler
Recommended especially when you don't use Sidekiq Pro's super_fetch. If you dequeue job from Redis and the process is killed (by OOM, for example) then good luck with having the job finished. However, if the job is stored in Postgres, this is not an issue. You can just look for the jobs that look as if they were missed and re-enqueue them. Periodically.

That being said it works only for the case when OOM would happen on first attempt.
If first attempt fails, then failed_at will be set. Then on retry OOM happens. Missed Jobs Scheduler uses:

def missed_jobs(missed_job_policy:)
      jobs_database
        .where(completed_at: nil, dropped_at: nil, failed_at: nil)
        .select { |potentially_missed_job| missed_job_policy.call(potentially_missed_job) }
    end

to find missed jobs, so as failed_at was assigned - it's never rescheduled.

As I understand the biggest challenge is to figure out which missed job qualifies due to backoff, so perhaps this scenario should be addressed on specific application knowing it's internals. On the other hand whenever it's supposed to be retried we could bump execute_at to match next attempt. That way we could easily find jobs that were supposed to be executed already?
That being said - we should at least mention that in documentation as right now it can give false confidence in reliability of this gem.

@StoneFrog StoneFrog added bug Something isn't working documentation Improvements or additions to documentation labels Jun 15, 2023
@Azdaroth
Copy link
Member

@StoneFrog How about explicitly assigning "missed" label when rescued for the first time, so that we could use or query, like jobs_database.where(completed_at: nil, dropped_at: nil, failed_at: nil).or(jobs_database.where(completed_at: nil, dropped_at: nil, missed_job: true))? looking for missed jobs is executed every few hours, so having some potential conflicts and risk of duplication and processing the same job twice etc. is very low

@StoneFrog
Copy link
Collaborator Author

@Azdaroth what do you mean with "when rescued"? When failed_at is set for the first time?

Assuming it was actually a legit non-recoverable failure won't that interfere with retry? i.e. even though regular sidekiq would push that to retry queue with exponential backoff, we will be executing that every time missed job scheduler is executed?

@Azdaroth
Copy link
Member

@StoneFrog By "rescued" I mean the one that were picked Missed Jobs Handler.

Maybe it would interfere, but what's the risk here? Some jobs might be executed a couple of times and fail each time I guess (otherwise, it would be executed just once) and it would happen only to the one that were labeled as such.

This of course depends on the frequency of the OOMs but they should not be that common to make more than few % of all jobs (which would already be massive but probably we could live with that).

@StoneFrog
Copy link
Collaborator Author

@Azdaroth But that problem is that Missed Job Handler never picked them.

@Azdaroth
Copy link
Member

@StoneFrog that's why we would need to introduce some extra labelling and change the query. Wouldn't that be enough ?

@StoneFrog
Copy link
Collaborator Author

@Azdaroth Maybe I don't fully get you proposal, but you have said that:
you want to label it when "rescued" and by rescued you mean the one that were picked Missed Jobs Handler.

But in described case they will never be picked by handler, so they will never be labeled and this query won't change a thing 😛

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants