-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[audit] fix: avoid reprocessing withdrawals #20
Conversation
cmd/bot/run.go
Outdated
@@ -100,6 +115,11 @@ func ProcessUnprovenBotDelegatedWithdrawals(ctx context.Context, log log.Logger, | |||
} | |||
|
|||
for _, unproven := range unprovens { | |||
// In order to avoid re-processing the same withdrawal, we use a tombstone to mark the withdrawal as processed. | |||
if hasWithdrawalRecentlyProcessed(&unproven, tombstone) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we already have isPendingAndChainNonceEqual
check before calling ProcessUnprovenBotDelegatedWithdrawals
, this check should be redundant.
Since there is no pending txns, and this loop will only send at most once for the same ID.
Without this check, the event will get the error OptimismPortal: withdrawal hash has already been proven
in dry run, and marked proven then.
But with this check, we have to wait at least 10 min before marking it as proven.
What we want to ensure is that there are no pending txns which handle the same ID. But this check can't achieve this goal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your analysis. At first, I only implemented the nonce check, but during testing, I found that re-processing still existed. I have no clue that why re-processing still happens even with nonce check, just suspect it’s related to the L1 node(?)
I didn't spend time to troubleshoot this issue, just only add the local tombstone check to check more strictly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but during testing, I found that re-processing still existed. I have no clue that why re-processing still happens even with nonce check, just suspect it’s related to the L1 node(?)
The pending status is inconsistent and varies between different nodes when called.
One easy and reliable way is checking the nonce of the transaction and the on-chain latest nonce.
If the on-chain latest nonce >= the txn nonce, then either the txn is mined, or replaced by another txn, which leads this txns to be dropped. What we can confirm is this txn is not pending anymore.
Consider the implementation below:
type PendingTxnCheck struct {
pendingTxns map[uint]nonce
}
// add event id and associated Txn hash to the map
func (c *PendingTxnCheck) AddPendingTx(uint id, hash nonce)
// check whether there is pending Txn for specific event id
// call this function before send tx for event in both approve and finalize process, skip the event if it returns true
func (c *PendingTxnCheck) ExistsPendingTxn(uint id) bool {
return if the id exists in the map
}
func (c *PendingTxnCheck) CheckLoop() {
// loop the map, check whether the txn is still pending, if not, remove it from the map
for {
// get the latest on-chain nonce
for id, nonce := range map {
// remove the item whose nonce is <= on-chain nonce
}
sleep()
}
}
Another thing we can change is we can always manage the nonce manually.
After we ensured the pending nonce and latest on-chain nonce is equal, we keep the latest on-chain nonce, use it to send next txn and +1 after we do this.
Thus we don't depend on the the underlying function to fetch the nonce. And even if something is wrong(e.g. some txn stuck), we can replace it with new txns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @owen-reorg, thanks for your review and sorry for late reply.
I have removed hasWithdrawalRecentlyProcessed
check and added PendingTxnCheck
check at c18eb3c
Thanks again.
cmd/bot/run.go
Outdated
case <-ctx.Done(): | ||
return | ||
} | ||
} | ||
} | ||
|
||
func ProcessUnprovenBotDelegatedWithdrawals(ctx context.Context, log log.Logger, db *gorm.DB, l1Client *core.ClientExt, l2Client *core.ClientExt, cfg core.Config) { | ||
func ProcessUnprovenBotDelegatedWithdrawals(ctx context.Context, log log.Logger, db *gorm.DB, l1Client *core.ClientExt, l2Client *core.ClientExt, cfg core.Config, tombstone *lru.Cache[uint, time.Time]) { | ||
processor := core.NewProcessor(log, l1Client, l2Client, cfg) | ||
limit := 1000 | ||
maxBlockTime := time.Now().Unix() - cfg.Misc.ProposeTimeWindow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use time.Now().Unix() - cfg.Misc.ProposeTimeWindow
to filter events to prove is unreliable.
Most of the time it will be faster than this, while when op-proposer fails to submit output root in time, it will result in we try the case and get error in every loop.
Consider using a global variable to track the latestBlockNumber in oracle contract and update it periodically(say every minute) in a separate goroutine.
Then you can compare the block number of the event with this number to ensure the events you take from the db can be proven.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing with the latestBlockNumber could be more accurate!
Will apply this suggestion soon. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apply this suggestion at 5b2740b
There is one difference, instead of maintaining a latestBlockNumber
in background, I call L2OutputOracle.latestBlockNumber()
every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention that there are some changes for DB type at ef5844c
cmd/bot/run.go
Outdated
} | ||
} | ||
} | ||
|
||
func ProcessUnfinalizedBotDelegatedWithdrawals(ctx context.Context, log log.Logger, db *gorm.DB, l1Client *core.ClientExt, l2Client *core.ClientExt, cfg core.Config) { | ||
func ProcessUnfinalizedBotDelegatedWithdrawals(ctx context.Context, log log.Logger, db *gorm.DB, l1Client *core.ClientExt, l2Client *core.ClientExt, cfg core.Config, tombstone *lru.Cache[uint, time.Time]) { | ||
processor := core.NewProcessor(log, l1Client, l2Client, cfg) | ||
limit := 1000 | ||
maxBlockTime := time.Now().Unix() - cfg.Misc.ChallengeTimeWindow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ChallengeTimeWindow
starts from the time when the txn is proven. So in the hour between eventTime + cfg.Misc.ChallengeTimeWindow
to proveTime + cfg.Misc.ChallengeTimeWindow
, we'll keep trying and get error for the event.
I suggest when we mark the event as proven, we also update the timestamp to the the time it happens.
But this way the field should not be block_time
. Maybe we can add a new proven_time
field for this.
The accurate way is to monitor the prove
event and use the timestamp in the event. But it will take more effect. The solution above should be good enough to reduce the error cases(reduce the dry run call).
if strings.Contains(err.Error(), "OptimismPortal: withdrawal hash has already been proven") {
// The withdrawal has already proven, mark it
result := db.Model(&unproven).Update("proven", true)
if result.Error != nil {
log.Error("failed to update proven l2_contract_events", "error", result.Error)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. This optimization can reduce the re-try times. Will update soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated at 5b2740b
PTAL, thanks.
78c88d4
to
66af473
Compare
094a157
to
c6292a3
Compare
c6292a3
to
5b2740b
Compare
1. check pending nonce and chain nonce before processing 2. check recently processed using local records before processing
1. Rename L2ContractEvent to BotDelegatedWithdrawal 2. Add unique constraint idx_bot_delegated_withdrawals_transaction_hash_log_index_key 3. Add new field `InitiatedBlockNumber int64` to indicate the L2 number of initiated withdrawal transaction 3. Add new fields `ProvenTime *Time` and `FinalizedTime *Time` to indicate the local time of L1 proven transaction and finalized transaction 4. Modify the `FailureReason` to type `FailureReason *string`
1. Determine the proven timing based on the `L2OutputOracle.latestBlockNumber` 2. Determine the finalized timing based on the db `proven_time`
5b2740b
to
c18eb3c
Compare
c18eb3c
to
494b941
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR introduces additional checks to prevent redundant processing of withdrawTo transactions in the ProcessBotDelegatedWithdrawals function.
Implementing these checks enhances our code efficiency and reduces the risk of errors derived from reprocessing transactions