Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Coupling block sync to DAG state #3268

Closed
wants to merge 10 commits into from
Closed

Conversation

mdelle1
Copy link
Contributor

@mdelle1 mdelle1 commented May 22, 2024

Motivation

This PR focuses on coupling block sync to DAG state replication. When a node is syncing via block responses, it will sync its storage and DAG with the certificates contained in the block and attempt to update its ledger. Previously, there were scenarios where a node would commit certificates in its DAG without advancing blocks. Instead, the committal of certificates and advancement of blocks during sync should be coupled. This PR commits certificates in the DAG only when blocks are advanced to in the sync module and creates a channel to the BFT to ensure that the leader certificate of the block being added was recently committed in the BFT.

@mdelle1 mdelle1 requested a review from raychu86 May 22, 2024 19:31
@raychu86 raychu86 marked this pull request as ready for review May 23, 2024 00:35
@vicsn
Copy link
Contributor

vicsn commented Jun 3, 2024

@raychu86 is this PR ready to go (assuming all tests pass)?

Copy link
Contributor

@howardwu howardwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for code quality. Please ensure stress tests pass prior to proceeding.

Copy link
Contributor

@Meshiest Meshiest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some light testing w/ a 4 validator devnet with these changes merged into the latest mainnet-staging and every time I restarted a validator or resynced a validator from genesis I encountered various errors:

dev0.log - Syncing the BFT to block 40..... Syncing the BFT to block 54.. over and over

dev0.log - Failed to speculate on transactions - Failed to post-ratify after a restart after syncing

dev0.log - BFT failed to advance the subdag for round 420 - Leader certificate has an incorrect committee ID + Malicious peer - proposed batch has a different committee ID

I'm not sure how many of these are related to this specific PR or just validator sync issues in general from devnet.

My devnet setup: (git checkout fix/dag-syncing && git merge mainnet-staging && cargo build --release)

./target/release/snarkos start --network 0 --validator --nodisplay --dev 0 --no-dev-txs --dev-num-validators 4 --validators 127.0.0.1:5000,127.0.0.1:5001,127.0.0.1:5002,127.0.0.1:5003 --logfile dev0.log
./target/release/snarkos start --network 0 --validator --nodisplay --dev 1 --validators 127.0.0.1:5000,127.0.0.1:5001,127.0.0.1:5002,127.0.0.1:5003 --logfile dev1.log
./target/release/snarkos start --network 0 --validator --nodisplay --dev 2 --validators 127.0.0.1:5000,127.0.0.1:5001,127.0.0.1:5002,127.0.0.1:5003 --logfile dev2.log
./target/release/snarkos start --network 0 --validator --nodisplay --dev 3 --validators 127.0.0.1:5000,127.0.0.1:5001,127.0.0.1:5002,127.0.0.1:5003 --logfile dev3.log

Reproduction steps:

  1. spin up all 4 devnet validators
  2. wait about 5-10 minutes
  3. stop validator 0, delete its ledger + proposal cache, start validator 0
  4. errors occur within 2 minutes about 50-60% of the time

@vicsn
Copy link
Contributor

vicsn commented Jun 6, 2024

I did some light testing w/ a 4 validator devnet with these changes merged into the latest mainnet-staging and every time I restarted a validator or resynced a validator from genesis I encountered various errors:

Interesting findings. I'm not able to reproduce after 5 tries on an M2 Max. Can you share the machine specs used? I assume a slower machine may induce these conditions. Correction, I can reproduce on M2 Max.

  • For mainnet-staging 81ca9cf after 3 tries waiting 10 minutes, there was no issue.
  • For this branch with mainnet-staging merged in (resulting commit c4c9fb305), with deleting batch proposal cache, after waiting 4 minutes, I was also able to trigger it...
  • For this branch with mainnet-staging merged in (resulting commit c4c9fb305), without deleting batch proposal cache, after 2 tries waiting 5 minutes, I was also able to trigger it...

@apruden2008
Copy link
Contributor

Since reported errors have not been addressed, holding off on merge for now, and therefore holding off on adding this prior to code freeze. If we want this, it will have to be after launch.

@raychu86
Copy link
Contributor

@mdelle1 Have you had a chance to take a look at the issues being observed with the change?

@vicsn
Copy link
Contributor

vicsn commented Sep 5, 2024

Superseded by #3386

@vicsn vicsn closed this Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants