Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock in InboundLedgers and NetworkOPs #5124

Merged
merged 6 commits into from
Nov 6, 2024

Conversation

Bronek
Copy link
Collaborator

@Bronek Bronek commented Sep 9, 2024

High Level Overview of Change

Fixes a deadlock bug in 2.2.2 release.

Context of Change

Functions acquireAsync and NetworkOPsImp::recvValidation contains lock.lock() which will deadlock if an exception is thrown while the (non-recursive) mutex is owned by the calling thread, that is before lock.unlock()

In 2.2.2 release we switched some operations from synchronous to asynchronous, guarded by mutexes, but failed to account for exception safety of the critical section, which may result in a deadlock if an exception is thrown before lock.unlock() inside the try block. It is extremely unlikely that a deadlock will actually occur in practice, since the possible exceptions in this section of code are scarce.

The solution adopted in this PR is to move ScopedUnlock from LedgerMaster.cpp to basics/scope.h (adjusting casing to match other utilities in this file, to scope_unlock) and to use this type to create a RAII unlock, removing the problematic lock.lock().

An alternative solution might be to abstract away the "pending operation" with the corresponding mutex, and then use it where appropriate, but I wanted this change to generalise an existing lower-level utility (i.e. scoped unlock) rather than write a new one. The "pending operation" approach could be further generalised into structured asynchronous operations, which would place it in a much larger PR.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

@Bronek Bronek marked this pull request as ready for review September 9, 2024 10:08
@Bronek Bronek marked this pull request as draft September 9, 2024 10:49
@Bronek Bronek marked this pull request as ready for review September 9, 2024 12:13
Copy link

codecov bot commented Sep 9, 2024

Codecov Report

Attention: Patch coverage is 80.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 77.9%. Comparing base (c5c0e70) to head (b19505a).
Report is 2 commits behind head on develop.

Files with missing lines Patch % Lines
src/xrpld/app/misc/NetworkOPs.cpp 0.0% 2 Missing ⚠️
src/xrpld/app/ledger/detail/LedgerMaster.cpp 75.0% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##           develop   #5124     +/-   ##
=========================================
+ Coverage     77.7%   77.9%   +0.2%     
=========================================
  Files          779     782      +3     
  Lines        66015   66614    +599     
  Branches      8156    8140     -16     
=========================================
+ Hits         51261   51887    +626     
+ Misses       14754   14727     -27     
Files with missing lines Coverage Δ
include/xrpl/basics/scope.h 100.0% <100.0%> (ø)
src/xrpld/app/ledger/detail/InboundLedgers.cpp 38.6% <100.0%> (ø)
src/xrpld/app/ledger/detail/LedgerMaster.cpp 43.9% <75.0%> (-0.4%) ⬇️
src/xrpld/app/misc/NetworkOPs.cpp 70.1% <0.0%> (+<0.1%) ⬆️

... and 12 files with indirect coverage changes

Impacted file tree graph

Copy link
Collaborator

@thejohnfreeman thejohnfreeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it just me, or do these changes not actually change any behavior?

In InboundLedgers::acquireAsync(), we have this sequence before:

  • 142: unlock().
  • 143: call acquire(). If it does not throw, continue to line 157. If it does throw, one of the exception handlers is entered (no exception can skip both handlers), neither throws (they just log warnings), and execution continues to line 157. Either way, we end up at line 157.
  • 157: lock().

Now, I prefer a RAII type like scope_unlock, but I think the only difference in behavior it produces here is that the call to lock() is moved before the exception handler is entered (if one is entered). But that is not a material difference. Is that right?

Very similar story for NetworkOPsImp::recvValidation.

include/xrpl/basics/scope.h Outdated Show resolved Hide resolved
Review feedback

Co-authored-by: John Freeman <[email protected]>
@Bronek
Copy link
Collaborator Author

Bronek commented Sep 11, 2024

Is it just me, or do these changes not actually change any behavior?

In InboundLedgers::acquireAsync(), we have this sequence before:

  • 142: unlock().
  • 143: call acquire(). If it does not throw, continue to line 157. If it does throw, one of the exception handlers is entered (no exception can skip both handlers), neither throws (they just log warnings), and execution continues to line 157. Either way, we end up at line 157.
  • 157: lock().

The change in behaviour is if we have an exception before unlock() at the line 142, e.g. inside insert(). If this happens, then the lock() outside of the try-catch block will deadlock (because the mutex is already owned, by the very same thread).

Now, I prefer a RAII type like scope_unlock, but I think the only difference in behavior it produces here is that the call to lock() is moved before the exception handler is entered (if one is entered). But that is not a material difference. Is that right?

Very similar story for NetworkOPsImp::recvValidation.

Yes, same applies - there's an insert() before unlock() and if it throws, then we will deadlock.

The RAII unlock_scope fixes both possible deadlocks, because it removes the problematic lock() below the try-catch blocks.

Copy link
Collaborator

@ximinez ximinez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be redundant with the CanProcess class in #5126. I'm biased, but I prefer my solution because it hides all the locking in a single class. I'm open to being convinced otherwise, though.

@Bronek
Copy link
Collaborator Author

Bronek commented Sep 13, 2024

This may be redundant with the CanProcess class in #5126. I'm biased, but I prefer my solution because it hides all the locking in a single class. I'm open to being convinced otherwise, though.

you are right that your PR solves this problem, however given the number of other issues that your PR solves, and the associated code churn, it might be a little until it is tested and approved. In the meantime, if we get this merged and, some time after, your PR merged as well, we will eventually end up with unlock_scope moved from LedgerMaster.cpp to scope.h which is in itself (I think) an improvement.

Copy link
Collaborator

@ximinez ximinez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Edit: I left one small suggested change to a comment, but whether you take it or not, this is good to go.

Comment on lines +221 to +224
} // mut gets locked here.

} // mut gets unlocked here
@endcode
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to drive the point home:

Suggested change
} // mut gets locked here.
} // mut gets unlocked here
@endcode
} // mut gets locked here.
... do some more stuff with it locked ...
} // mut gets unlocked here
@endcode

@ximinez ximinez added the Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label Nov 6, 2024
@ximinez ximinez merged commit 9e48fc0 into XRPLF:develop Nov 6, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants