Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Internal] Upgrade Resiliency: Refactors LoadBalancingPartition to fix race conditions caused by Dispose() ing too early. #4005

Conversation

kundadebdatta
Copy link
Member

Pull Request Template

Description

Problem: Recently, in the encryption test project, I started observing some strange behavior, where few of the tests started failing when the replica validation is turned on. The exception is below:

image

It appears that a System.InvalidOperationException Collection was modified was thrown while calling LoadBalancingPartition.Dispose(). I analyzed this and figured out that a potential cause could be a thread contention while accessing and modifying the this.openChannelsat the same time:

        public void Dispose()
        {
            this.capacityLock.EnterWriteLock();
            try
            {
                foreach (LbChannelState channelState in this.openChannels)
                {
                    channelState.Dispose();
                }
            }
            finally
            {
                this.capacityLock.ExitWriteLock();
            }
            try
            {
                this.capacityLock.Dispose();
            }
            catch(SynchronizationLockException)
            {
                // SynchronizationLockException is thrown if there are inflight requests during the disposal of capacityLock
                // suspend this exception to avoid crashing disposing other partitions/channels in hierarchical calls
                return;
            }
        }

My Analysis and Solution: Why this is happening, is because it is possible that during the OpenChannelAsync stage, we exit the write lock too early, before the connection opening task is finished and thus it gives the opportunity to modify the collection by another thread, potentially the disposing thread.

Please take a look at my proposed code changes below, while the left is the existing code and the right one is my new code changes:

In theory, instead of returning the return this.OpenChannelAndIncrementCapacity if we could do an await this.OpenChannelAndIncrementCapacity, that should make sure that we wait on the operation to finish before releasing the write lock. But, this brings in another problem.

The new problem it creates is that, this.capacityLock.EnterWriteLock() or any other operations on ReaderWriterLockSlim is not applicable for asynchronous operations.

Our call to OpenChannelAndIncrementCapacity during the rntbd connection opening (both replica validation + create and initialize async) is deterministic in nature thus we wait on the background channel opening task await newChannel.OpenChannelAsync(activityId); . I have observed that, a thread which is acquiring the write lock, might get lost because of the background task, and a second thread tried to release the lock which never acquired it at the first place, which ends up creating a dead lock situation.

To better understand this, please take a look at the trace logs below:

You will notice that the Exiting write lock statement was printed once that is because the thread id (i.e. 8) that acquired the lock was same that tried to release it.

So, it appears that the ReaderWriterLockSlim is not helpful in this case where we have the async call.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

Closing issues

To automatically close an issue: closes #IssueNumber

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good!

@kundadebdatta kundadebdatta changed the title [Internal] Upgrade Resiliency: Code changes to fix race conditions. [Internal] Upgrade Resiliency: Refactors 'LoadBalancingPartition' to fix race conditions caused by Dispose() ing too early. Jul 25, 2023
@kundadebdatta kundadebdatta self-assigned this Jul 25, 2023
@kundadebdatta kundadebdatta changed the title [Internal] Upgrade Resiliency: Refactors 'LoadBalancingPartition' to fix race conditions caused by Dispose() ing too early. [Internal] Upgrade Resiliency: Refactors LoadBalancingPartition to fix race conditions caused by Dispose() ing too early. Jul 25, 2023
@@ -243,7 +253,10 @@ internal Task OpenChannelAsync(Guid activityId)
}
finally
{
this.capacityLock.ExitWriteLock();
if (this.capacityLock.IsUpgradeableReadLockHeld)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert on IsUpgradeableReadLockHeld - there is no thread switch possible - should always be true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (this.capacityLock.IsWriteLockHeld)
{
this.capacityLock.ExitWriteLock();
Interlocked.Exchange(ref LoadBalancingPartition.openChannelsPending, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this shoudl still happen under write lock

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
finally
{
if (this.capacityLock.IsWriteLockHeld)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again - assert - should never be false - no thread switch possible

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants