Better failure reporting from database queue #119
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I'm curious for your thoughts on #119 and #118.
We recently encountered heavy load of
EXCLUSIVE
table locks on PPM. We have this complicated workflow with completing a queue job. Basically, it boils down, to something like this:queue_failure
table. If the job ended in error, insert a new record intoqueue_failure
.Meanwhile, the node that requested the asset from the cache is running a polling loop. This loop:
a. A notification saying that work with address is complete. When that's received, it polls for the asset again.
b. A notification saying that chunks of the work for address are completed. When that's received, it doesn't poll again, but simply returns so we can start serving the chunks.
c. A
time.Ticker
to fire. We added this just in case we never get the completion notification. In this case, we poll again.In the above polling loop, each time we check the cache for the asset, we:
queue_failure
table to see if a failure is recorded for the work. If so, return the error.I feel like our notification mechanism is robust and proven enough that we may be able to eliminate all of this complication, including the
queue_failure
table. The PRs above update the "work complete" notification to also include the error if the work ended in error. If a node misses the notification, it'll still periodically poll and pick up the asset if it gets created without an error (same as before). The only difference will be if the work completes in error, but the notification isn't received. In that case, an actor that's waiting on the asset may poll and find that the work is completed, but the asset won't be found. We already handle this scenario with an error that reports something like "the queue reported that x work is complete, but the item was not found in the cache".