Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prod Release 19/07/24 #897

Merged
merged 6 commits into from
Jul 18, 2024
Merged

Prod Release 19/07/24 #897

merged 6 commits into from
Jul 18, 2024

Commits on Jul 18, 2024

  1. refactor: Spawn dedicated control loops per Indexer (#866)

    This PR introduces dedicated/self-contained control loops per Indexer,
    replacing the single/combined control loop. The motive for this ticket
    is described in #811, you can read more about it there.
    
    Overall, there is lots of clean up to be done, but I wanted to get this
    out the door as quick as possible as to not block the features required
    to build on top of this. I've discussed some of the major concerns
    below.
    
    ## `LifecycleManager`
    These dedicated control loops are managed by the `LifecycleManager`
    struct. This is a state machine which progresses the Indexer through
    different states depending on the context. The different states and
    their transitions are described on the `LifecycleState` enum:
    ```rust
    /// Represents the different lifecycle states of an Indexer
    #[derive(Default, Debug, Clone, serde::Serialize, serde::Deserialize, PartialEq, Eq)]
    pub enum LifecycleState {
        /// Pre-requisite resources, i.e. Data Layer, are being created.
        ///
        /// Transitions:
        /// - `Running` on success
        /// - `Repairing` on Data Layer provisioning failure
        #[default]
        Initializing,
        /// Indexer is functional, Block Stream and Executors are continouously monitored to ensure
        /// they are running the latest version of the Indexer.
        ///
        /// Transitions:
        /// - `Stopping` if suspended
        /// - `Running` if Block Stream or Executor fails to synchronise, essentially triggering a
        /// retry
        /// - `Running` on success
        Running,
        /// Indexer is being stopped, Block Stream and Executors are being stopped.
        ///
        /// Transitions:
        /// - `Stopping` on failure, triggering a retry
        /// - `Stopped` on success
        Stopping,
        /// Indexer is stopped, Block Stream and Executors are not running.
        ///
        /// Transitions:
        /// - `Running` if unsuspended
        Stopped,
        /// Indexer is in a bad state, currently requires manual intervention, but should eventually
        /// self heal. This is a dead-end state
        ///
        /// Transitions:
        /// - `Repairing` continuously
        Repairing, // TODO Add `error` to enable reparation
        /// Indexer is being deleted, all resources are being cleaned up
        ///
        /// Transitions:
        /// - `Deleting` on failure, triggering a retry
        /// - `Deleted` on success
        Deleting,
        /// Indexer is deleted, all resources are cleaned up, lifecycle manager will exit
        Deleted,
    }
    ```
    
    The logic of this `struct` is very light, triggering high-level actions
    required within each state, and then returning the next desired state.
    Most of the "doing" logic has has been encapsulated in the other related
    `structs` as discussed below.
    
    The lifecycle state is stored in Redis so that the Indexer can pickup
    where it left off. A migration has been added to accommodate this new
    field, which replaces the existing `provisioned_state` field.
    
    ## `Handler`s
    Previously, the "handlers", i.e. `BlockStreamsHandler`, were lightweight
    `structs` which wrapped the gRPC client/methods. In this PR, I've moved
    all "synchronisation" logic to these structs. So rather than calling the
    e.g. `data_layer_handler.start_provisioning_task()` method, we can call
    `ensure_provisioned()` which manages all related logic. I feel this has
    been encapsulation, and allows the `LifecycleManager` to be light.
    
    I've had to remove `automock`, so we don't have mocked versions for this
    right now. Cloning mocked versions is tricky, and requires manual
    mocking. Rather than bloat this PR, I've left this out. Eventually, I'll
    separate the "sync" logic from the "client" logic, so that the latter
    can be easily mocked, and the sync logic covered by unit tests.
    
    Additionally, I've added `get` methods for both Block Streamer and
    Executors RPC, as listing is no longer convenient given we are managing
    Indexers individually. The getters use `account_id` and `function_name`
    as opposed to IDs. I'm considering moving away from IDs as the only way
    to get them is via list, which isn't helpful. Right now it's somewhat of
    a transitory state.
    morgsmccauley committed Jul 18, 2024
    Configuration menu
    Copy the full SHA
    f2cdc78 View commit details
    Browse the repository at this point in the history
  2. feat: Expose health info from Block Stream/Executor RPC (#889)

    This PR exposes a `health` field on the Block Stream/Executor info,
    which can be accessed via RPC. The intent of this field is for
    Coordinator to monitor it, and then act accordingly. I wanted to raise
    this work first, so that the overall result is not too large.
    
    Essentially, `health` contains only a single `enum` describing the
    "state" of the process, but this can be expanded over time as needed.
    morgsmccauley committed Jul 18, 2024
    Configuration menu
    Copy the full SHA
    29bde3c View commit details
    Browse the repository at this point in the history
  3. feat: Restart stalled Block Streams & Executors (#891)

    Coordinator will now continuously monitor Block Stream and Executor
    health, and restart them if they are "stalled". The goal here is to
    avoid the need for manually intervening on stopped processes.
    
    Stalled means slightly different things for each:
    - Block Stream - When it is able to, is not actively processing blocks
    - Executors - An uncaught error was encountered, causing the thread to
    exit
    morgsmccauley committed Jul 18, 2024
    Configuration menu
    Copy the full SHA
    ee5845b View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    7dfb2aa View commit details
    Browse the repository at this point in the history
  5. feat: In-Memory DmlHandler Test Fixture (#809)

    In order to enable quick testing of indexer code for the purposes of
    local indexer development, there needs to be a way to functionally mock
    dependencies that are used by the Indexer code. These dependencies are
    roughly encapsulated under the context object. In particular context.db
    is of high importance as it is the method through with Indexers interact
    with their persistent data. This PR focuses on creating an MVP of an in
    memory DML Handler Test Fixture which can be used interchangeably with
    an actual DmlHandler. This allows context.db calls to pretend to be real
    calls to an actual Postgres DB without actually having an instance
    running (Which would be hard to integrate into unit testing).
    
    This PR is mainly focused on getting a basic DmlHandler test fixture out
    the door. It has the same functions which roughly behave similarly to if
    they were called on an actual DB. There is the added benefit of being
    extremely quick to tear down as all data is represented in memory,
    making it suitable for fast iteration through unit testing. However,
    since I am essentially mocking Postgres DB responses, the correctness
    standard is much lower than using an actual PG DB, but since DmlHandler
    simplifies user interactions with the DB anyway, we can mock a great
    deal of Indexer context.db use cases sufficiently to serve as useful for
    end users.
    darunrs committed Jul 18, 2024
    Configuration menu
    Copy the full SHA
    f1c1757 View commit details
    Browse the repository at this point in the history
  6. fix: Avoid restarting block streams which have just started (#895)

    On Block Streamer startup, all Block Streams are started in one big
    herd. Some of which can take a while to start processing. These end up
    getting marked as "Stalled", and are therefore restarted.
    
    This PR increases the monitoring scrape interval, so that Block Streams
    have a longer window to prove they are "Processing", and therefore do
    not get restarted. Also bumped the "stale" check in Coordinator, so they
    are less likely to get marked as Stale.
    morgsmccauley committed Jul 18, 2024
    Configuration menu
    Copy the full SHA
    0177c26 View commit details
    Browse the repository at this point in the history