-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] Coupling block sync to DAG state #3268
Conversation
@raychu86 is this PR ready to go (assuming all tests pass)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for code quality. Please ensure stress tests pass prior to proceeding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some light testing w/ a 4 validator devnet with these changes merged into the latest mainnet-staging
and every time I restarted a validator or resynced a validator from genesis I encountered various errors:
dev0.log - Syncing the BFT to block 40..... Syncing the BFT to block 54..
over and over
dev0.log - Failed to speculate on transactions - Failed to post-ratify
after a restart after syncing
dev0.log - BFT failed to advance the subdag for round 420 - Leader certificate has an incorrect committee ID
+ Malicious peer - proposed batch has a different committee ID
I'm not sure how many of these are related to this specific PR or just validator sync issues in general from devnet.
My devnet setup: (git checkout fix/dag-syncing && git merge mainnet-staging && cargo build --release
)
./target/release/snarkos start --network 0 --validator --nodisplay --dev 0 --no-dev-txs --dev-num-validators 4 --validators 127.0.0.1:5000,127.0.0.1:5001,127.0.0.1:5002,127.0.0.1:5003 --logfile dev0.log
./target/release/snarkos start --network 0 --validator --nodisplay --dev 1 --validators 127.0.0.1:5000,127.0.0.1:5001,127.0.0.1:5002,127.0.0.1:5003 --logfile dev1.log
./target/release/snarkos start --network 0 --validator --nodisplay --dev 2 --validators 127.0.0.1:5000,127.0.0.1:5001,127.0.0.1:5002,127.0.0.1:5003 --logfile dev2.log
./target/release/snarkos start --network 0 --validator --nodisplay --dev 3 --validators 127.0.0.1:5000,127.0.0.1:5001,127.0.0.1:5002,127.0.0.1:5003 --logfile dev3.log
Reproduction steps:
- spin up all 4 devnet validators
- wait about 5-10 minutes
- stop validator 0, delete its ledger + proposal cache, start validator 0
- errors occur within 2 minutes about 50-60% of the time
|
Since reported errors have not been addressed, holding off on merge for now, and therefore holding off on adding this prior to code freeze. If we want this, it will have to be after launch. |
@mdelle1 Have you had a chance to take a look at the issues being observed with the change? |
Superseded by #3386 |
Motivation
This PR focuses on coupling block sync to DAG state replication. When a node is syncing via block responses, it will sync its storage and DAG with the certificates contained in the block and attempt to update its ledger. Previously, there were scenarios where a node would commit certificates in its DAG without advancing blocks. Instead, the committal of certificates and advancement of blocks during sync should be coupled. This PR commits certificates in the DAG only when blocks are advanced to in the sync module and creates a channel to the BFT to ensure that the leader certificate of the block being added was recently committed in the BFT.