-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missed jobs scheduler not working for retries #21
Comments
@StoneFrog How about explicitly assigning "missed" label when rescued for the first time, so that we could use |
@Azdaroth what do you mean with "when rescued"? When Assuming it was actually a legit non-recoverable failure won't that interfere with retry? i.e. even though regular sidekiq would push that to retry queue with exponential backoff, we will be executing that every time missed job scheduler is executed? |
@StoneFrog By "rescued" I mean the one that were picked Missed Jobs Handler. Maybe it would interfere, but what's the risk here? Some jobs might be executed a couple of times and fail each time I guess (otherwise, it would be executed just once) and it would happen only to the one that were labeled as such. This of course depends on the frequency of the OOMs but they should not be that common to make more than few % of all jobs (which would already be massive but probably we could live with that). |
@Azdaroth But that problem is that Missed Job Handler never picked them. |
@StoneFrog that's why we would need to introduce some extra labelling and change the query. Wouldn't that be enough ? |
@Azdaroth Maybe I don't fully get you proposal, but you have said that: But in described case they will never be picked by handler, so they will never be labeled and this query won't change a thing 😛 |
According to docs:
That being said it works only for the case when OOM would happen on first attempt.
If first attempt fails, then
failed_at
will be set. Then on retry OOM happens. Missed Jobs Scheduler uses:to find missed jobs, so as failed_at was assigned - it's never rescheduled.
As I understand the biggest challenge is to figure out which missed job qualifies due to backoff, so perhaps this scenario should be addressed on specific application knowing it's internals. On the other hand whenever it's supposed to be retried we could bump
execute_at
to match next attempt. That way we could easily find jobs that were supposed to be executed already?That being said - we should at least mention that in documentation as right now it can give false confidence in reliability of this gem.
The text was updated successfully, but these errors were encountered: