-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Internal] Upgrade Resiliency: Refactors LoadBalancingPartition
to fix race conditions caused by Dispose()
ing too early.
#4005
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good!
Dispose()
ing too early.
Dispose()
ing too early.LoadBalancingPartition
to fix race conditions caused by Dispose()
ing too early.
@@ -243,7 +253,10 @@ internal Task OpenChannelAsync(Guid activityId) | |||
} | |||
finally | |||
{ | |||
this.capacityLock.ExitWriteLock(); | |||
if (this.capacityLock.IsUpgradeableReadLockHeld) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assert on IsUpgradeableReadLockHeld - there is no thread switch possible - should always be true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if (this.capacityLock.IsWriteLockHeld) | ||
{ | ||
this.capacityLock.ExitWriteLock(); | ||
Interlocked.Exchange(ref LoadBalancingPartition.openChannelsPending, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this shoudl still happen under write lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
finally | ||
{ | ||
if (this.capacityLock.IsWriteLockHeld) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again - assert - should never be false - no thread switch possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Pull Request Template
Description
Problem: Recently, in the encryption test project, I started observing some strange behavior, where few of the tests started failing when the replica validation is turned on. The exception is below:
It appears that a
System.InvalidOperationException
Collection was modified was thrown while callingLoadBalancingPartition.Dispose()
. I analyzed this and figured out that a potential cause could be a thread contention while accessing and modifying the this.openChannelsat the same time:My Analysis and Solution: Why this is happening, is because it is possible that during the OpenChannelAsync stage, we exit the write lock too early, before the connection opening task is finished and thus it gives the opportunity to modify the collection by another thread, potentially the disposing thread.
Please take a look at my proposed code changes below, while the left is the existing code and the right one is my new code changes:
In theory, instead of returning the return
this.OpenChannelAndIncrementCapacity
if we could do an awaitthis.OpenChannelAndIncrementCapacity
, that should make sure that we wait on the operation to finish before releasing the write lock. But, this brings in another problem.The new problem it creates is that,
this.capacityLock.EnterWriteLock()
or any other operations onReaderWriterLockSlim
is not applicable for asynchronous operations.Our call to OpenChannelAndIncrementCapacity during the rntbd connection opening (both replica validation + create and initialize async) is deterministic in nature thus we wait on the background channel opening task await newChannel.OpenChannelAsync(activityId); . I have observed that, a thread which is acquiring the write lock, might get lost because of the background task, and a second thread tried to release the lock which never acquired it at the first place, which ends up creating a dead lock situation.
To better understand this, please take a look at the trace logs below:
You will notice that the Exiting write lock statement was printed once that is because the thread id (i.e. 8) that acquired the lock was same that tried to release it.
So, it appears that the ReaderWriterLockSlim is not helpful in this case where we have the async call.
Type of change
Please delete options that are not relevant.
Closing issues
To automatically close an issue: closes #IssueNumber