-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
queue worker jobs error handling strategy. #14
Comments
Comment by woodsaj As far as i am concerned, any job that completes processing should be acked, regardless of whether an error was encountered or not. The purpose of the alerting jobs is to determine the state of something at a specific time. I think logging that you could not determine the state is more important then back-filling data. That is the while reason for having If we do want to handle brief transient errors, then we can either perform the retry inline X number of times before the error state increases to fatal. eg. perhaps we got a timeout because a backend node just died, retrying immediately will connect us to different backend, Alternatively instead of NACK'ing a job on error, we can queue a new job that includes some metadata about the previous errors and how many times the job has been tried and ACK the original job. |
Comment by Dieterbe
I like these points. So the question then becomes which types of errors should cause the job to "complete" with status=unknown? Currently there's a whole class of errors (like graphite query failures) that bail out early with an error. Are you saying that basically any error encountered should cause us to try exiting with status=unknown as opposed to bailing out?
The problem with putting the retry in the actual job executor means it basically blocks the pipeline for other jobs that need processing, and we typically want to keep this timeframe short, meaning it also becomes less likely that certain types of errors get resolved in the short retry interval (i know if it's just an issue on 1 backend then getting LB'd to a different one should fix it but it's not a robust solution that covers all kinds of failure scenarios especially when they need time to resolve). The re-queue with updated metadata approach lets us build in a solid retry interval during which we can try other jobs. But of course then we have to worry about what happens when we can't store a new job in the queue, or we lose the connection right after putting in the new job, before we can ack the current message. Then we'll have duplicates in the queue. (why don't work queues have a built-in for incrementing attempt numbers and retrying a few times :( or do they) |
- use new rabbitmq pubsub code. fixes #34 fixes #27 - remove executor goroutines in favor of just spawing as many goroutines as are needed to handle alerting load. - dont retry failed jobs. The code that was in place didnt work due to the lru cache, retried jobs would just be skipped due to being in the cache already. issue #14
Issue by Dieterbe
Friday Jul 24, 2015 at 15:16 GMT
Originally opened as raintank/grafana#367
what follows is partly a rant about the state of worker queue systems (to the extent I'm familiar with the space, hopefully I'm wrong), but also partly a well-written (or so I hope) piece where I highlight issues I've seen around worker queue semantics and their effects (while often subtle, they can have significant impacts), I want to get us on the same page and design for a deliberate behavior that is well understood across the various failure scenarios, and as hassle free as possible.
I tried to keep it short but if it's still too long, you can skip to the last section.
My examples are around worker jobs for alerting that need to query graphite via the bosun library for data, but obviously the same concepts apply with other examples.
the 3 error categories
3 classes of errors that can appear during execution of a worker job:
A can't connect to graphite: could be fatal (misconfigured connection string) or non-fatal (graphite host is temporarily down and it needs restarting, manually or by an init system)
B can't parse json response from graphite. if it's due to an error such as "series does not exist" it is pretty fatal if the series name is badly specified, or the job might succeed later when the series gets created. Or if the error is caused by a bug in the worker or in graphite that could be fixed and deployed fairly quickly, you can see this error as non-fatal.
C a function specified in a bosun/graphite expression does not exist: perhaps there is a typo (fatal) or perhaps we forgot to deploy the code that provides the new function (non-fatal)
message acknowledgement behavior based on error category
The traditional notion that I'm used to is simply
But this breaks down for a few reasons:
I don't recall ever seeing a worker queueing system that takes all these issues into account and provides insights (statistical overviews and breakdowns of which jobs have been retried for how long or x times), tools (dropping a bunch of jobs based on fine-grained criteria (scheduled between x and y AND if their resulting error was foo), it sounds like i want to store structured metadata for jobs and have an SQL interface to said metadata. Currently we all just seem to log errors in a log, which is so decoupled from metadata about the job inside of the queue. For example it might make more sense to store "bad-smtp-server: can't connect" in the job metadata for the queue so that we could
select from job where error = "bad-smtp-server: can't connect"
So where does this leave raintank and Grafana?
luckily for the case of alerting jobs, and with our current simplistic system (stale jobs are not very useful because we mostly care about latest state because we only do critical notifications, stale jobs are only useful for filling in gaps in the health metrics)
so we can basically do one of two things:
Also this shortcut no longer provides solace once we support warning emails for older data.
Luckily at least the workers have access to a rich sql store with extensive data that lets us make better educated decisions (like latest state seen etc)
I'm personally a fan of the latter approach.
The text was updated successfully, but these errors were encountered: