Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better failure reporting from database queue #119

Open
wants to merge 1 commit into
base: jon-queue-failure
Choose a base branch
from

Conversation

jonyoder
Copy link
Collaborator

@jonyoder jonyoder commented Aug 1, 2023

I'm curious for your thoughts on #119 and #118.

We recently encountered heavy load of EXCLUSIVE table locks on PPM. We have this complicated workflow with completing a queue job. Basically, it boils down, to something like this:

  1. When the queue agent encounters the end of a job, and if the job is for addressed work (typically, work for the cache), then
  2. Call a method to finalize/complete the work, which
  3. Grabs an exclusive lock. Clears any existing failure records for that address in the queue_failure table. If the job ended in error, insert a new record into queue_failure.
  4. After that, a notification goes out to all nodes that indicates that work for address is done.

Meanwhile, the node that requested the asset from the cache is running a polling loop. This loop:

  1. At the start, preemptively checks the cache to see if the asset is already here. If so, it simply returns.
  2. Waits for one of the following:
    a. A notification saying that work with address is complete. When that's received, it polls for the asset again.
    b. A notification saying that chunks of the work for address are completed. When that's received, it doesn't poll again, but simply returns so we can start serving the chunks.
    c. A time.Ticker to fire. We added this just in case we never get the completion notification. In this case, we poll again.

In the above polling loop, each time we check the cache for the asset, we:

  1. Check the queue_failure table to see if a failure is recorded for the work. If so, return the error.
  2. If no error, return the asset.

I feel like our notification mechanism is robust and proven enough that we may be able to eliminate all of this complication, including the queue_failure table. The PRs above update the "work complete" notification to also include the error if the work ended in error. If a node misses the notification, it'll still periodically poll and pick up the asset if it gets created without an error (same as before). The only difference will be if the work completes in error, but the notification isn't received. In that case, an actor that's waiting on the asset may poll and find that the work is completed, but the asset won't be found. We already handle this scenario with an error that reports something like "the queue reported that x work is complete, but the item was not found in the cache".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant