Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node: close ledger and part keys on node shutdown #6039

Merged
merged 3 commits into from
Jun 25, 2024

Conversation

algorandskiy
Copy link
Contributor

@algorandskiy algorandskiy commented Jun 25, 2024

Summary

  • node_test runs multiple nodes in a single process that leads to file descriptors leak (see tests: close open file descriptors from make full node #5057 for more details).
  • just closing ledger is not enough because of concurrent operations evaluation operations done by transaction pool.
  • made transaction pool shutdown-able and stop it before ledger termination.
Deadlock example as seen in this p2p build
POTENTIAL DEADLOCK:
Previous place where the lock was grabbed
goroutine 412 lock 0xc0006fe300
node.go:446 node.(*AlgorandFullNode).Stop { node.mu.Lock() } <<<<<
node_test.go:938 node.TestNodeHybridTopology { } }

Have been trying to lock it again for more than 30s
goroutine 17114 lock 0xc0006fe300
node.go:1125 node.(*AlgorandFullNode).oldKeyDeletionThread { node.mu.Lock() } <<<<<

Here is what goroutine 412 doing now
goroutine 412 [semacquire]:
sync.runtime_Semacquire(0xc0094269b0?)
/opt/cibuild/.gimme/versions/go1.21.10.linux.amd64/src/runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xc0094269a8)
/opt/cibuild/.gimme/versions/go1.21.10.linux.amd64/src/sync/waitgroup.go:116 +0xa5
github.com/algorand/go-algorand/ledger.(*blockNotifier).close(0xc009426958)
/opt/cibuild/project/ledger/notifier.go:82 +0x97
github.com/algorand/go-algorand/ledger.(*trackerRegistry).close(0xc009426a80)
/opt/cibuild/project/ledger/tracker.go:506 +0x104
github.com/algorand/go-algorand/ledger.(*Ledger).Close(0xc009426000)
/opt/cibuild/project/ledger/ledger.go:416 +0x111
github.com/algorand/go-algorand/node.(*AlgorandFullNode).Stop(0xc0006fe300)
/opt/cibuild/project/node/node.go:473 +0x717
github.com/algorand/go-algorand/node.TestNodeHybridTopology(0xc006d53860)
/opt/cibuild/project/node/node_test.go:938 +0xe2a
testing.tRunner(0xc006d53860, 0x30e34c8)
/opt/cibuild/.gimme/versions/go1.21.10.linux.amd64/src/testing/testing.go:1595 +0x262
created by testing.(*T).Run in goroutine 1
/opt/cibuild/.gimme/versions/go1.21.10.linux.amd64/src/testing/testing.go:1648 +0x846
Other goroutines holding locks:
goroutine 617 lock 5a4
../data/pools/transactionPool.go:530 pools.(*TransactionPool).OnNewBlock { pool.mu.Lock() } <<<<<
../ledger/notifier.go:67 ledger.(*blockNotifier).worker { listener.OnNewBlock(blk.block, blk.delta) }

goroutine 617 lock 3f
../data/pools/transactionPool.go:781 pools.(*TransactionPool).recomputeBlockEvaluator { pool.assemblyMu.Lock() } <<<<<
../data/pools/transactionPool.go:560 pools.(*TransactionPool).OnNewBlock { stats = pool.recomputeBlockEvaluator(committedTxids, knownCommitted) }
../ledger/notifier.go:67 ledger.(*blockNotifier).worker { listener.OnNewBlock(blk.block, blk.delta) }

goroutine 412 lock 32
../ledger/ledger.go:412 ledger.(*Ledger).Close { l.trackerMu.Lock() } <<<<<
node.go:473 node.(*AlgorandFullNode).Stop { node.ledger.Close() }
node_test.go:938 node.TestNodeHybridTopology { } }

POTENTIAL DEADLOCK:
Previous place where the lock was grabbed
goroutine 412 lock 0xc009426fe8
../ledger/ledger.go:412 ledger.(*Ledger).Close { l.trackerMu.Lock() } <<<<<
node.go:473 node.(*AlgorandFullNode).Stop { node.ledger.Close() }
node_test.go:938 node.TestNodeHybridTopology { } }

Have been trying to lock it again for more than 30s
goroutine 617 lock 0xc009426fe8
../ledger/ledger.go:636 ledger.(*Ledger).LookupWithoutRewards { l.trackerMu.RLock() } <<<<<
../ledger/eval/eval.go:200 eval.(*roundCowBase).lookup { ad, _, err := x.l.LookupWithoutRewards(x.rnd, addr) }
../ledger/eval/cow.go:186 eval.(*roundCowState).lookup { return cb.lookupParent.lookup(addr) }
../ledger/eval/eval.go:1910 eval.(*BlockEvaluator).GenerateBlock { acct, err := eval.state.lookup(addrs[i]) }
../data/pools/transactionPool.go:794 pools.(*TransactionPool).recomputeBlockEvaluator { lvb, err := pool.pendingBlockEvaluator.GenerateBlock(pool.getVotingAccountsForRound(evalRnd)) }
../data/pools/transactionPool.go:560 pools.(*TransactionPool).OnNewBlock { stats = pool.recomputeBlockEvaluator(committedTxids, knownCommitted) }
../ledger/notifier.go:67 ledger.(*blockNotifier).worker { listener.OnNewBlock(blk.block, blk.delta) }

Additionally added part keys closing.

Monitored with while pgrep node.test > /dev/null; do lsof -p $(pgrep node.test) | wc -l; sleep 1; done on p2p branch:

description peak fd avg fd last 3 avg fd
no closing 1158 1060 1130
ledger close 888 770 653
all closed 589 498 157

Test Plan

Existing tests should pass

* node_test runs multiple nodes in a single process that leads
  to file descriptors leak (see algorand#5057 for more details).
* just closing ledger is not enough because of concurrent operations
  evaluation operations done by transaction pool.
* made transaction pool shutdown-able and stop it before ledger termination.
Copy link

codecov bot commented Jun 25, 2024

Codecov Report

Attention: Patch coverage is 31.81818% with 15 lines in your changes missing coverage. Please review.

Project coverage is 54.84%. Comparing base (052ceb2) to head (e2a110f).
Report is 1 commits behind head on master.

Files Patch % Lines
data/pools/transactionPool.go 0.00% 8 Missing and 2 partials ⚠️
node/node.go 50.00% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6039      +/-   ##
==========================================
- Coverage   55.86%   54.84%   -1.03%     
==========================================
  Files         482      482              
  Lines       68576    68593      +17     
==========================================
- Hits        38311    37617     -694     
- Misses      27659    28325     +666     
- Partials     2606     2651      +45     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@gmalouf gmalouf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asked two questions, other than that makes sense.

data/pools/transactionPool.go Show resolved Hide resolved
data/pools/transactionPool.go Show resolved Hide resolved
jasonpaulos
jasonpaulos previously approved these changes Jun 25, 2024
Copy link
Contributor

@jasonpaulos jasonpaulos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

@algorandskiy algorandskiy changed the title node: close ledger on node shutdown node: close ledger and part keys on node shutdown Jun 25, 2024
node/node.go Show resolved Hide resolved
data/pools/transactionPool.go Show resolved Hide resolved
cce
cce previously approved these changes Jun 25, 2024
node/node.go Show resolved Hide resolved
@algorandskiy algorandskiy merged commit 24382d8 into algorand:master Jun 25, 2024
18 checks passed
algorandskiy added a commit to algorandskiy/go-algorand that referenced this pull request Jun 26, 2024
* algorand#6039 discovered blockNotifier preserves state between
  ledger reloads that breaks accumptions on how trackers work
* Made node to clear and re-register block listeners explicitly
  on fast catchup.
* Also removed unused blockListeners arg from data.Ledger
algorandskiy added a commit to algorandskiy/go-algorand that referenced this pull request Jun 26, 2024
* algorand#6039 discovered blockNotifier preserves state between
  ledger reloads that breaks assumptions on how trackers work
* Made node to clear and re-register block listeners explicitly
  on fast catchup.
* Also removed unused blockListeners argument from data.Ledger
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants