Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[audit] fix: avoid reprocessing withdrawals #20

Merged
merged 6 commits into from
Jan 4, 2024

Conversation

bendanzhentan
Copy link
Contributor

@bendanzhentan bendanzhentan commented Dec 6, 2023

This PR introduces additional checks to prevent redundant processing of withdrawTo transactions in the ProcessBotDelegatedWithdrawals function.

  1. We now compare the chain nonce with the pending nonce prior to processing. A discrepancy implies existence of transactions from the same address currently in the node’s tx-pool. Therefore, processing is deferred until the chain and pending nonces match, circumventing unnecessary reprocessing of already proven or finalized withdrawals.
  2. A local LRU table is implemented as a tombstone, storing recently processed transactions timestamp. For each transaction, we check if it was processed within the past 5 minutes. If this is true, the transaction is skipped.

Implementing these checks enhances our code efficiency and reduces the risk of errors derived from reprocessing transactions

@bendanzhentan bendanzhentan changed the base branch from main to migrate-to-mysql December 6, 2023 02:05
cmd/bot/run.go Outdated
@@ -100,6 +115,11 @@ func ProcessUnprovenBotDelegatedWithdrawals(ctx context.Context, log log.Logger,
}

for _, unproven := range unprovens {
// In order to avoid re-processing the same withdrawal, we use a tombstone to mark the withdrawal as processed.
if hasWithdrawalRecentlyProcessed(&unproven, tombstone) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we already have isPendingAndChainNonceEqual check before calling ProcessUnprovenBotDelegatedWithdrawals, this check should be redundant.

Since there is no pending txns, and this loop will only send at most once for the same ID.
Without this check, the event will get the error OptimismPortal: withdrawal hash has already been proven in dry run, and marked proven then.
But with this check, we have to wait at least 10 min before marking it as proven.

What we want to ensure is that there are no pending txns which handle the same ID. But this check can't achieve this goal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your analysis. At first, I only implemented the nonce check, but during testing, I found that re-processing still existed. I have no clue that why re-processing still happens even with nonce check, just suspect it’s related to the L1 node(?)

I didn't spend time to troubleshoot this issue, just only add the local tombstone check to check more strictly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but during testing, I found that re-processing still existed. I have no clue that why re-processing still happens even with nonce check, just suspect it’s related to the L1 node(?)

The pending status is inconsistent and varies between different nodes when called.

One easy and reliable way is checking the nonce of the transaction and the on-chain latest nonce.
If the on-chain latest nonce >= the txn nonce, then either the txn is mined, or replaced by another txn, which leads this txns to be dropped. What we can confirm is this txn is not pending anymore.

Consider the implementation below:

type PendingTxnCheck struct {
  pendingTxns map[uint]nonce
}

// add event id and associated Txn hash to the map
func (c *PendingTxnCheck) AddPendingTx(uint id, hash nonce)

// check whether there is pending Txn for specific event id
// call this function before send tx for event in both approve and finalize process, skip the event if it returns true
func (c *PendingTxnCheck) ExistsPendingTxn(uint id) bool {
  return if the id exists in the map
}

func (c *PendingTxnCheck) CheckLoop() {
  // loop the map, check whether the txn is still pending, if not, remove it from the map
  for {
      // get the latest on-chain nonce
      for id, nonce := range map {       
          // remove the item whose nonce is <= on-chain nonce
      }
      sleep()
  }
}

Another thing we can change is we can always manage the nonce manually.

After we ensured the pending nonce and latest on-chain nonce is equal, we keep the latest on-chain nonce, use it to send next txn and +1 after we do this.

Thus we don't depend on the the underlying function to fetch the nonce. And even if something is wrong(e.g. some txn stuck), we can replace it with new txns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @owen-reorg, thanks for your review and sorry for late reply.

I have removed hasWithdrawalRecentlyProcessed check and added PendingTxnCheckcheck at c18eb3c

Thanks again.

cmd/bot/run.go Outdated
case <-ctx.Done():
return
}
}
}

func ProcessUnprovenBotDelegatedWithdrawals(ctx context.Context, log log.Logger, db *gorm.DB, l1Client *core.ClientExt, l2Client *core.ClientExt, cfg core.Config) {
func ProcessUnprovenBotDelegatedWithdrawals(ctx context.Context, log log.Logger, db *gorm.DB, l1Client *core.ClientExt, l2Client *core.ClientExt, cfg core.Config, tombstone *lru.Cache[uint, time.Time]) {
processor := core.NewProcessor(log, l1Client, l2Client, cfg)
limit := 1000
maxBlockTime := time.Now().Unix() - cfg.Misc.ProposeTimeWindow
Copy link
Collaborator

@owen-reorg owen-reorg Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use time.Now().Unix() - cfg.Misc.ProposeTimeWindow to filter events to prove is unreliable.
Most of the time it will be faster than this, while when op-proposer fails to submit output root in time, it will result in we try the case and get error in every loop.

Consider using a global variable to track the latestBlockNumber in oracle contract and update it periodically(say every minute) in a separate goroutine.
Then you can compare the block number of the event with this number to ensure the events you take from the db can be proven.

CleanShot 2023-12-06 at 20 50 47@2x

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing with the latestBlockNumber could be more accurate!

Will apply this suggestion soon. Thanks

Copy link
Contributor Author

@bendanzhentan bendanzhentan Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply this suggestion at 5b2740b

There is one difference, instead of maintaining a latestBlockNumber in background, I call L2OutputOracle.latestBlockNumber() every time.

Copy link
Contributor Author

@bendanzhentan bendanzhentan Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention that there are some changes for DB type at ef5844c

cmd/bot/run.go Outdated
}
}
}

func ProcessUnfinalizedBotDelegatedWithdrawals(ctx context.Context, log log.Logger, db *gorm.DB, l1Client *core.ClientExt, l2Client *core.ClientExt, cfg core.Config) {
func ProcessUnfinalizedBotDelegatedWithdrawals(ctx context.Context, log log.Logger, db *gorm.DB, l1Client *core.ClientExt, l2Client *core.ClientExt, cfg core.Config, tombstone *lru.Cache[uint, time.Time]) {
processor := core.NewProcessor(log, l1Client, l2Client, cfg)
limit := 1000
maxBlockTime := time.Now().Unix() - cfg.Misc.ChallengeTimeWindow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ChallengeTimeWindow starts from the time when the txn is proven. So in the hour between eventTime + cfg.Misc.ChallengeTimeWindow to proveTime + cfg.Misc.ChallengeTimeWindow, we'll keep trying and get error for the event.

I suggest when we mark the event as proven, we also update the timestamp to the the time it happens.
But this way the field should not be block_time. Maybe we can add a new proven_time field for this.

The accurate way is to monitor the prove event and use the timestamp in the event. But it will take more effect. The solution above should be good enough to reduce the error cases(reduce the dry run call).

			if strings.Contains(err.Error(), "OptimismPortal: withdrawal hash has already been proven") {
				// The withdrawal has already proven, mark it
				result := db.Model(&unproven).Update("proven", true)
				if result.Error != nil {
					log.Error("failed to update proven l2_contract_events", "error", result.Error)
				}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. This optimization can reduce the re-try times. Will update soon.

Copy link
Contributor Author

@bendanzhentan bendanzhentan Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated at 5b2740b
PTAL, thanks.

@bendanzhentan bendanzhentan force-pushed the migrate-to-mysql branch 2 times, most recently from 78c88d4 to 66af473 Compare December 8, 2023 03:39
Base automatically changed from migrate-to-mysql to develop December 8, 2023 03:40
@bendanzhentan bendanzhentan force-pushed the avoid-reprocessing-withdrawals branch 3 times, most recently from 094a157 to c6292a3 Compare December 8, 2023 07:25
1. check pending nonce and chain nonce before processing
2. check recently processed using local records before processing
1. Rename L2ContractEvent to BotDelegatedWithdrawal
2. Add unique constraint idx_bot_delegated_withdrawals_transaction_hash_log_index_key
3. Add new field `InitiatedBlockNumber int64` to indicate the L2 number
   of initiated withdrawal transaction
3. Add new fields `ProvenTime *Time` and `FinalizedTime *Time` to
   indicate the local time of L1 proven transaction and finalized
   transaction
4. Modify the `FailureReason` to type `FailureReason *string`
1. Determine the proven timing based on the `L2OutputOracle.latestBlockNumber`
2. Determine the finalized timing based on the db `proven_time`
Copy link

@welkin22 welkin22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bendanzhentan bendanzhentan merged commit 3c81f35 into develop Jan 4, 2024
2 checks passed
@bendanzhentan bendanzhentan deleted the avoid-reprocessing-withdrawals branch January 4, 2024 03:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants