Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix concurrency issue between AbandonPendingBacklog() and CheckBacklogForTimeouts(), and remove backlog locking #2430

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

kornelpal
Copy link
Contributor

There is only one ProcessBacklogAsync() thread running at a time and all current backlog locks are within that thread, so there is no need for these locks. On the other hand AbandonPendingBacklog() can run concurrently with the ProcessBacklogAsync() thread, that runs CheckBacklogForTimeouts(), but AbandonPendingBacklog() is not locking the backlog that can result in concurrency issues. This can result in CheckBacklogForTimeouts() leaving the dequeued message abandoned in an uncompleted (hung) state. This fix the resolves the concurrency issue by introducing an _abandonPendingBacklogException field that also enables removing the lock. The "failed" message is completed with the thrown exception to make any potential concurrency issues more visible.

@mgravell
Copy link
Collaborator

mgravell commented Apr 5, 2023 via email

@kornelpal
Copy link
Contributor Author

I've created this lock-free fix inspired by this comment.

At one point #2397 had the following fix to the same problem using a lock, if you prefer that:

private void AbandonPendingBacklog(Exception ex)
{
    while (true)
    {
        Message? next;
        lock (_backlog)
        {
            if (!BacklogTryDequeue(out next)) break;
        }

        Multiplexer?.OnMessageFaulted(next, ex);
        next.SetExceptionAndComplete(ex, this);
    }
}

@mgravell
Copy link
Collaborator

mgravell commented Apr 5, 2023

that can result in concurrency issues

Please can you be very explicit about what concurrency issue we're discussing? what is the actual symptom/issue that we're looking at resolving here? To understand whether this resolves them, first I need to have a clear vision of what that "them" are.

So: talk me through it; what scenario are we discussing?

@kornelpal
Copy link
Contributor Author

Although PhysicalBridge._backlog is a ConcurrentQueue, PhysicalBridge.CheckBacklogForTimeouts() is using it in a non-thread-safe way. The existing comment from that method describes it best:

Because peeking at the backlog, checking message and then dequeuing, is not thread-safe, we do have to use
a lock here, for mutual exclusion of backlog DEQUEUERS. Unfortunately.

When not all dequeuers are locking the backlog then CheckBacklogForTimeouts() can dequeue a message then abandon it without ever being completed.

Code from inside the lock in CheckBacklogForTimeouts() annotated by me for the problematic scenario:

// There is a message in the backlog, so no break.
if (!_backlog.TryPeek(out message)) break;
// The message peeked at has timed out, so no break.
if (!message.HasTimedOut(now, timeout, out var _)) break;
// Another thread without locking the backlog already dequeued the previous message
// between the TryPeek() and BacklogTryDequeue() calls.
// Scenario 1; there were no messages left: This is not really an issue.
// Scenario 2; another message (message2) was dequeued: It may or may not be timed out,
// but the current logic does not care, just abandons the message and it will not be completed
// as it is not stored anywhere else. This is a problem for async messages only,
// not for sync (wait timeout), or F+F (not completed otherwise either).
if (!BacklogTryDequeue(out var message2) || (message != message2))
{
    // In both Scenario 1 and 2 the backlog processing thread fails,
    // but a new one will be started by the heartbeat or by adding a message to the backlog.
    throw new RedisException("Thread safety bug detected! A queue message disappeared while we had the backlog lock");
}

Methods dequeuing from the backlog:

  • CheckBacklogForTimeouts(): Properly locks the backlog, and only runs on the backlog processing thread.
  • ProcessBacklogAsync(): Properly locks the backlog, and only runs on the backlog processing thread.
  • AbandonPendingBacklog(): Does not lock the backlog and can run concurrently with the backlog processing thread. Called from PhysicalBridge.Dispose(), ~PhysicalBridge() and PhysicalBridge.OnConnectionFailed().

Since the two methods that actually lock the backlog cannot run concurrently, the current lock is just an overhead.

On the other hand not locking the backlog in AbandonPendingBacklog() can cause the concurrency issue described in the annotated code above that can cause one task per occurrence to be left in a hung state.

@kornelpal
Copy link
Contributor Author

I've added a test for the Dispose() case. It fails without the fix and succeeds with the fix. Should be possible to cause the issue for BacklogPolicy.AbortPendingOnConnectionFailure = true too, but I don't know how to simulate a connection failure with a large backlog.

@kornelpal
Copy link
Contributor Author

I just realized that clearing _abandonPendingBacklogException at the end of AbandonPendingBacklog() can result in CheckBacklogForTimeouts() failing when AbandonPendingBacklog() is running on multiple threads in parallel, so more complexity (like a wrapper for the backlog) would be needed reliable bug detection in CheckBacklogForTimeouts().

@kornelpal
Copy link
Contributor Author

kornelpal commented Apr 12, 2023

I have one more idea, inspired by PhysicalBridge.HasPendingCallerFacingItems(); Instead of removing items, CheckBacklogForTimeouts() could be changed to enumerate the items, and ProcessBridgeBacklogAsync() could be changed to ignore completed items. This way the concurrency issue was eliminated and there was no need for a lock or the exception field. Although adds some more compute overhead, checking for timeout in ProcessBridgeBacklogAsync() again might be simpler than adding tweaks at other places to complete timed out sync messages and identify timed out F+F messages (that never have a result box). Update: It might not be a good option as it keeps all the messages when there is an extended outage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants