Refactor data column reconstruction and avoid blocking processing #6403

jimmygchen · 2024-09-17T04:33:23Z

Issue Addressed

Continuation of #5990.

I've taken the changes from #5990 with some cleanups. This should simplify the code a bit and reduce supernode bandwidth and performance (reconstruction no longer blocks DA processing for the block).

Proposed Changes

Move reconstruction logic out of overflow_lru_cache, this simplifies the code and avoids having to pass DataColumnsToPublish around.
Changes to data column reconstruction:
- Attempt reconstruction without holding availability cache lock, so we can process other gossip / rpc data columns simultaneously
- Check availability cache again before publishing reconstructed columns to avoid publishing excess duplicates

… code and avoids having to pass `DataColumnsToPublish` around and blocking other processing.

beacon_node/beacon_chain/src/beacon_chain.rs

beacon_node/network/src/network_beacon_processor/mod.rs

… functions.

michaelsproul

LGTM.

Just needs some conflicts fixed

dapplion

good to go modulo conflicts

# Conflicts: # beacon_node/beacon_chain/src/beacon_chain.rs # beacon_node/beacon_chain/src/data_availability_checker.rs # beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs # beacon_node/network/src/network_beacon_processor/sync_methods.rs

jimmygchen · 2024-09-30T05:26:58Z

Thanks @michaelsproul and @dapplion ! I've resolved conflicts.
Also noticed that I was doing reconstruction in an async method, so I've moved that into a blocking task here: d3c84e8

# Conflicts: # beacon_node/network/src/network_beacon_processor/mod.rs

)

dapplion · 2024-10-03T12:16:21Z

beacon_node/beacon_chain/src/data_availability_checker.rs

+        let Some(pending_components) = self
+            .availability_cache
+            .peek_pending_components(block_root, |pending_components_opt| {
+                pending_components_opt.cloned()
+            })
+        else {
+            // Block may have been imported as it does not exist in availability cache.
+            return Ok(None);
+        };
+
+        if !self.should_reconstruct(&pending_components) {
+            return Ok(None);
+        }
+
+        self.availability_cache
+            .set_reconstruction_started(block_root);


There's a potential race condition here. We clone an old view of the pending components, check if reconstruction_started, then acquire the lock again in set_reconstruction_started and set reconstruction_started to true. We could have N parallel reconstruction attempts happening:

thread 1: peek_pending_components + clone + drop lock

thread 2: peek_pending_components + clone + drop lock

thread 1: should_reconstruct returns true

thread 2: should_reconstruct returns true

So either hold the lock until you set_reconstruction_started or restructure the check into being atomic.

Good catch!

I think this is fixed in b0c3379.

I've also added error handling to reconstruction failure, so we can recover from potentially bad columns that somehow made it through to DA checker.

dapplion · 2024-10-03T12:20:12Z

beacon_node/beacon_chain/src/data_availability_checker.rs

+        );
+
+        self.availability_cache
+            .put_kzg_verified_data_columns(


We have the same issue here, we take a snapshot of an old view of the shared state in imported_custody_column_indexes and then later acquire the write lock again in put_kzg_verified_data_columns. However, if the bug above is fixed and reconstruction strictly happens once we should be fine? Also worst case we publish the same thing twice. However, now that the entry into the da_checker is not dropped until after import this can still be problematic and lead to double imports?

However, if the bug above is fixed and reconstruction strictly happens once we should be fine

Yep I think so. The double import issue has been raised here, I think it might be best to address it separately:
#6439

beacon_node/network/src/network_beacon_processor/mod.rs

# Conflicts: # beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs

…y Lion. Also added error handling to reconstruction failure.

dapplion · 2024-10-15T13:38:17Z

beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs

+    pub fn handle_reconstruction_failure(&self, block_root: &Hash256) {
+        if let Some(pending_components_mut) = self.critical.write().get_mut(block_root) {
+            pending_components_mut.verified_data_columns = vec![];
+            pending_components_mut.reconstruction_started = false;


Nice! Will sleep better with this fallback :)

dapplion

LGTM! Thanks for the iterations looks solid now

ethDreamer

I checked out all the DA checker / overflow_lru_cache changes. It looks good to me although I can't really comment on race conditions with block import until I have a conversation with you guys about it.

ethDreamer · 2024-10-16T18:37:28Z

beacon_node/beacon_chain/src/data_availability_checker.rs

+        {
+            ReconstructColumnsDecision::Yes(pending_components) => pending_components,
+            ReconstructColumnsDecision::No(reason) => {
+                return Ok(DataColumnReconstructionResult::NotRequired(reason));


I wanted to point out that the Error variant here is called NotRequired but ReconstructionDecision includes "not enough columns" which (unlike every other variant) doesn't actually mean it's not required, only that it can't be done yet. I thought you might instead change things to:

#[derive(Debug)] pub enum RejectionCriteria { BlockAlreadyImported, AlreadyStarted, NotRequiredForFullNode, AllColumnsReceived, NotEnoughColumns, } pub enum DataColumnReconstructionResult<E: EthSpec> { Success(AvailabilityAndReconstructedColumns<E>), NotRequired(RejectionCriteria), Pending(RejectionCriteria), } pub(crate) enum ReconstructColumnsDecision<E: EthSpec> { Yes(PendingComponents<E>), No(RejectionCriteria), } impl<E: EthSpec> From<RejectionCriteria> for DataColumnReconstructionResult<E> { fn from(criteria: RejectionCriteria) -> Self { match criteria { RejectionCriteria::NotEnoughColumns => DataColumnReconstructionResult::Pending(criteria), _ => DataColumnReconstructionResult::NotRequired(criteria), } } }

Ah yeah I thought about having an enum for all the reasons, and I decided to go with string instead because we don't really need to handle each variant differently, the main usage of the reason is for metric label, and it requires a string, hence I went with it for simplicity and efficiency:

lighthouse/beacon_node/beacon_chain/src/beacon_chain.rs

Lines 3199 to 3202 in 0e6eaa2

metrics::inc_counter_vec(

&metrics::KZG_DATA_COLUMN_RECONSTRUCTION_INCOMPLETE_TOTAL,

&[reason],

);

the Error variant here is called NotRequired but ReconstructionDecision includes "not enough columns" which (unlike every other variant) doesn't actually mean it's not required, only that it can't be done yet.

Yes that's right, if this is confusing I could rename DataColumnReconstructionResult::NotRequired to something else?

I've renamed NotRequired to NotStarted.

ethDreamer · 2024-10-16T18:40:33Z

beacon_node/beacon_chain/src/data_availability_checker.rs

+
+        // Check indices from cache again to make sure we don't publish components we've already received.
+        let Some(existing_column_indices) = self.cached_data_column_indexes(block_root) else {
+            return Ok(DataColumnReconstructionResult::RecoveredColumnsNotImported(


Just checking my understanding here.. but shouldn't this never happen due to the reconstruction_started gate?

Yeah good question - this scenario can happen if we receive the columns via gossip, and we no longer need to import/publish them.

michaelsproul · 2024-10-17T03:22:18Z

@mergify queue

mergify · 2024-10-17T03:22:54Z

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at ee7fca3

michaelsproul · 2024-10-17T03:53:02Z

@mergify requeue

mergify · 2024-10-17T03:53:14Z

requeue

❌ This pull request head commit has not been previously disembarked from queue.

michaelsproul · 2024-10-17T03:53:27Z

@mergify refresh

mergify · 2024-10-17T03:53:30Z

refresh

✅ Pull request refreshed

Move reconstruction logic out of overflow_lru_cache to simplify the…

934ebcb

… code and avoids having to pass `DataColumnsToPublish` around and blocking other processing.

jimmygchen added ready-for-review The code is ready for review das Data Availability Sampling labels Sep 17, 2024

jimmygchen mentioned this pull request Sep 17, 2024

Reconstruct data columns without blocking processing and import #5990

Closed

dapplion reviewed Sep 18, 2024

View reviewed changes

beacon_node/beacon_chain/src/beacon_chain.rs Outdated Show resolved Hide resolved

beacon_node/network/src/network_beacon_processor/mod.rs Show resolved Hide resolved

Publish reconstructed cells before recomputing head. Remove duplicate…

2e64190

… functions.

dapplion mentioned this pull request Sep 18, 2024

Improving blob propagation post-PeerDAS with Decentralized Blob Building #6268

Merged

1 task

Merge branch 'unstable' into non-blocking-reconstruction

4f122d6

michaelsproul approved these changes Sep 23, 2024

View reviewed changes

michaelsproul added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Sep 23, 2024

dapplion approved these changes Sep 25, 2024

View reviewed changes

jimmygchen added 2 commits September 30, 2024 15:11

Spawn a blocking task for reconstruction.

d3c84e8

jimmygchen added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Sep 30, 2024

jimmygchen added 2 commits October 1, 2024 17:11

Merge branch 'unstable' into non-blocking-reconstruction

1b49247

# Conflicts: # beacon_node/network/src/network_beacon_processor/mod.rs

Fix fmt

634d137

jimmygchen added a commit that referenced this pull request Oct 3, 2024

Refactor data column reconstruction and avoid blocking processing (#6403

9647326

)

dapplion reviewed Oct 3, 2024

View reviewed changes

beacon_node/network/src/network_beacon_processor/mod.rs Show resolved Hide resolved

Merge branch 'unstable' into non-blocking-reconstruction

430870b

# Conflicts: # beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs

jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Oct 4, 2024

jimmygchen added 2 commits October 10, 2024 12:22

Fix race condition by making check and mutation atomic as suggested b…

b0c3379

…y Lion. Also added error handling to reconstruction failure.

Add reconstruction reason metric and more debug logging to da checker.

0e6eaa2

jimmygchen added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Oct 11, 2024

Add comment and logging.

9efa768

dapplion reviewed Oct 15, 2024

View reviewed changes

dapplion approved these changes Oct 15, 2024

View reviewed changes

ethDreamer reviewed Oct 16, 2024

View reviewed changes

pawanjay176 approved these changes Oct 16, 2024

View reviewed changes

Rename NotRequired to NotStarted.

2710d3c

michaelsproul approved these changes Oct 17, 2024

View reviewed changes

Remove extra character added.

e66a750

michaelsproul added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Oct 17, 2024

mergify bot added a commit that referenced this pull request Oct 17, 2024

Merge of #6403

6c78263

mergify bot mentioned this pull request Oct 17, 2024

merge queue: embarking unstable (772929f) and #6403 together #6503

Closed

5 tasks

mergify bot merged commit ee7fca3 into sigp:unstable Oct 17, 2024
28 of 29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor data column reconstruction and avoid blocking processing #6403

Refactor data column reconstruction and avoid blocking processing #6403

jimmygchen commented Sep 17, 2024 •

edited

Loading

michaelsproul left a comment

dapplion left a comment

jimmygchen commented Sep 30, 2024

dapplion Oct 3, 2024

jimmygchen Oct 3, 2024

jimmygchen Oct 10, 2024

dapplion Oct 3, 2024

jimmygchen Oct 10, 2024

dapplion Oct 15, 2024

dapplion left a comment

ethDreamer left a comment

ethDreamer Oct 16, 2024

jimmygchen Oct 17, 2024

jimmygchen Oct 17, 2024

ethDreamer Oct 16, 2024

jimmygchen Oct 17, 2024

michaelsproul commented Oct 17, 2024

mergify bot commented Oct 17, 2024 •

edited

Loading

michaelsproul commented Oct 17, 2024

mergify bot commented Oct 17, 2024

michaelsproul commented Oct 17, 2024

mergify bot commented Oct 17, 2024

	metrics::inc_counter_vec(
	&metrics::KZG_DATA_COLUMN_RECONSTRUCTION_INCOMPLETE_TOTAL,
	&[reason],
	);

Refactor data column reconstruction and avoid blocking processing #6403

Refactor data column reconstruction and avoid blocking processing #6403

Conversation

jimmygchen commented Sep 17, 2024 • edited Loading

Issue Addressed

Proposed Changes

michaelsproul left a comment

Choose a reason for hiding this comment

dapplion left a comment

Choose a reason for hiding this comment

jimmygchen commented Sep 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dapplion left a comment

Choose a reason for hiding this comment

ethDreamer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelsproul commented Oct 17, 2024

mergify bot commented Oct 17, 2024 • edited Loading

✅ The pull request has been merged automatically

michaelsproul commented Oct 17, 2024

mergify bot commented Oct 17, 2024

❌ This pull request head commit has not been previously disembarked from queue.

michaelsproul commented Oct 17, 2024

mergify bot commented Oct 17, 2024

✅ Pull request refreshed

jimmygchen commented Sep 17, 2024 •

edited

Loading

mergify bot commented Oct 17, 2024 •

edited

Loading