[Internal] Upgrade Resiliency: Refactors `LoadBalancingPartition` to fix race conditions caused by `Dispose()` ing too early. #4005

kundadebdatta · 2023-07-25T00:56:36Z

Pull Request Template

Description

Problem: Recently, in the encryption test project, I started observing some strange behavior, where few of the tests started failing when the replica validation is turned on. The exception is below:

It appears that a System.InvalidOperationException Collection was modified was thrown while calling LoadBalancingPartition.Dispose(). I analyzed this and figured out that a potential cause could be a thread contention while accessing and modifying the this.openChannelsat the same time:

        public void Dispose()
        {
            this.capacityLock.EnterWriteLock();
            try
            {
                foreach (LbChannelState channelState in this.openChannels)
                {
                    channelState.Dispose();
                }
            }
            finally
            {
                this.capacityLock.ExitWriteLock();
            }
            try
            {
                this.capacityLock.Dispose();
            }
            catch(SynchronizationLockException)
            {
                // SynchronizationLockException is thrown if there are inflight requests during the disposal of capacityLock
                // suspend this exception to avoid crashing disposing other partitions/channels in hierarchical calls
                return;
            }
        }

My Analysis and Solution: Why this is happening, is because it is possible that during the OpenChannelAsync stage, we exit the write lock too early, before the connection opening task is finished and thus it gives the opportunity to modify the collection by another thread, potentially the disposing thread.

Please take a look at my proposed code changes below, while the left is the existing code and the right one is my new code changes:

In theory, instead of returning the return this.OpenChannelAndIncrementCapacity if we could do an await this.OpenChannelAndIncrementCapacity, that should make sure that we wait on the operation to finish before releasing the write lock. But, this brings in another problem.

The new problem it creates is that, this.capacityLock.EnterWriteLock() or any other operations on ReaderWriterLockSlim is not applicable for asynchronous operations.

Our call to OpenChannelAndIncrementCapacity during the rntbd connection opening (both replica validation + create and initialize async) is deterministic in nature thus we wait on the background channel opening task await newChannel.OpenChannelAsync(activityId); . I have observed that, a thread which is acquiring the write lock, might get lost because of the background task, and a second thread tried to release the lock which never acquired it at the first place, which ends up creating a dead lock situation.

To better understand this, please take a look at the trace logs below:

You will notice that the Exiting write lock statement was printed once that is because the thread id (i.e. 8) that acquired the lock was same that tried to release it.

So, it appears that the ReaderWriterLockSlim is not helpful in this case where we have the async call.

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

Closing issues

To automatically close an issue: closes #IssueNumber

github-actions

All good!

FabianMeiswinkel · 2023-07-25T02:14:03Z

Microsoft.Azure.Cosmos/src/direct/LoadBalancingPartition.cs

@@ -243,7 +253,10 @@ internal Task OpenChannelAsync(Guid activityId)
            }
            finally
            {
-                this.capacityLock.ExitWriteLock();
+                if (this.capacityLock.IsUpgradeableReadLockHeld)


Assert on IsUpgradeableReadLockHeld - there is no thread switch possible - should always be true

FabianMeiswinkel · 2023-07-25T02:18:03Z

Microsoft.Azure.Cosmos/src/direct/LoadBalancingPartition.cs

+                if (this.capacityLock.IsWriteLockHeld)
+                {
+                    this.capacityLock.ExitWriteLock();
+                    Interlocked.Exchange(ref LoadBalancingPartition.openChannelsPending, 0);


I think this shoudl still happen under write lock

FabianMeiswinkel · 2023-07-25T02:18:37Z

Microsoft.Azure.Cosmos/src/direct/LoadBalancingPartition.cs

+            }
+            finally
+            {
+                if (this.capacityLock.IsWriteLockHeld)


Again - assert - should never be false - no thread switch possible

Code changes to fix race conditions.

a03edec

github-actions bot reviewed Jul 25, 2023

View reviewed changes

kundadebdatta changed the title ~~[Internal] Upgrade Resiliency: Code changes to fix race conditions.~~ [Internal] Upgrade Resiliency: Refactors 'LoadBalancingPartition' to fix race conditions caused by Dispose() ing too early. Jul 25, 2023

kundadebdatta self-assigned this Jul 25, 2023

kundadebdatta changed the title ~~[Internal] Upgrade Resiliency: Refactors 'LoadBalancingPartition' to fix race conditions caused by Dispose() ing too early.~~ [Internal] Upgrade Resiliency: Refactors LoadBalancingPartition to fix race conditions caused by Dispose() ing too early. Jul 25, 2023

FabianMeiswinkel reviewed Jul 25, 2023

View reviewed changes

Code changes to address review comments.

80e1f3b

kundadebdatta closed this Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Internal] Upgrade Resiliency: Refactors `LoadBalancingPartition` to fix race conditions caused by `Dispose()` ing too early. #4005

[Internal] Upgrade Resiliency: Refactors `LoadBalancingPartition` to fix race conditions caused by `Dispose()` ing too early. #4005

kundadebdatta commented Jul 25, 2023

github-actions bot left a comment •

edited

Loading

FabianMeiswinkel Jul 25, 2023

kundadebdatta Jul 25, 2023

FabianMeiswinkel Jul 25, 2023

kundadebdatta Jul 25, 2023

FabianMeiswinkel Jul 25, 2023

kundadebdatta Jul 25, 2023

[Internal] Upgrade Resiliency: Refactors LoadBalancingPartition to fix race conditions caused by Dispose() ing too early. #4005

[Internal] Upgrade Resiliency: Refactors LoadBalancingPartition to fix race conditions caused by Dispose() ing too early. #4005

Conversation

kundadebdatta commented Jul 25, 2023

Pull Request Template

Description

Type of change

Closing issues

github-actions bot left a comment • edited Loading

Choose a reason for hiding this comment

FabianMeiswinkel Jul 25, 2023

Choose a reason for hiding this comment

kundadebdatta Jul 25, 2023

Choose a reason for hiding this comment

FabianMeiswinkel Jul 25, 2023

Choose a reason for hiding this comment

kundadebdatta Jul 25, 2023

Choose a reason for hiding this comment

FabianMeiswinkel Jul 25, 2023

Choose a reason for hiding this comment

kundadebdatta Jul 25, 2023

Choose a reason for hiding this comment

[Internal] Upgrade Resiliency: Refactors `LoadBalancingPartition` to fix race conditions caused by `Dispose()` ing too early. #4005

[Internal] Upgrade Resiliency: Refactors `LoadBalancingPartition` to fix race conditions caused by `Dispose()` ing too early. #4005

github-actions bot left a comment •

edited

Loading