Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Operation status labels are reset across relayer restarts #5060

Open
daniel-savu opened this issue Dec 20, 2024 · 1 comment · May be fixed by #5182
Open

bug: Operation status labels are reset across relayer restarts #5060

daniel-savu opened this issue Dec 20, 2024 · 1 comment · May be fixed by #5182
Assignees

Comments

@daniel-savu
Copy link
Contributor

Problem

Is your feature request related to a problem? Please describe.

The relayer has an enum of possible statuses a pending operation (i.e. pending message) can have here. New statuses are assigned based on where a message is in its submission lifecycle. For instance, all messages are created with a FirstPrepareAttempt status, then if metadata is fetched successfully (e.g. validator signatures for the multisig ISM) the message's status is updated to ReadyToSubmit here, then if the transaction on the destination chain is successfully dispatched the status is updated to Confirm(SubmittedBySelf) here, and after the message is finalized on the destination it is removed from in-memory queues.

This status is also persisted to the local db when it is updated so upon restarting all old messages should be loaded into the prep queue with their old statuses (such as Retry(GasPaymentNotFound)).

The problem is that relayer restarts causes all old messages to be loaded into the prep queue with a FirstPrepareAttempt status, as can be observed below for celo. The red colour chunk is always reset upon restart (you may even need to zoom in to see it), although all of those messages showing up as green have been attempted before.

Screenshot 2024-12-20 at 13 40 21

https://abacusworks.grafana.net/goto/8cJ2XfIHg?orgId=1

Solution

Statuses should stay consistent across restarts by storing and retrieving them from the local db. Understand and fix why this is not the case

@daniel-savu daniel-savu changed the title Operation status labels are reset across relayer restarts bug: Operation status labels are reset across relayer restarts Dec 20, 2024
@tkporter tkporter moved this to In Progress in Hyperlane Tasks Jan 2, 2025
@kamiyaa
Copy link
Collaborator

kamiyaa commented Jan 6, 2025

Discussed this with @daniel-savu

Takeaways:

  • potentially add method to State to return a list of Validator tasks to be dropped
  • need to find a way to drop relayer task without dropping its tmpdir
    • this will allow us to restart the relayer and check message statuses are correctly being loaded from db
  1. use kathy to send some messages after Validators are killed so we end up with a bunch of ReprepareReason::CouldNotFetchMetadata messages
  2. restart relayer
  3. check GET /list_messages and check messages are in their correct status

@cmcewen cmcewen moved this from In Progress to In Review in Hyperlane Tasks Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Review
2 participants