fix: Add metrics for reliably measuring Block Stream/Executor health #843
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The current methods for determining both Block Stream and Executor health is flawed. This PR addresses these flaws by adding new, more reliable, metrics for use within Grafana.
Block Streams
A Block Stream is considered healthy if
LAST_PROCESSED_BLOCK
is continuously incremented, i.e. we are continuously downloading blocks from S3. This is flawed for the following reasons:To address these flaws, I've introduced a new dedicated metric:
BLOCK_STREAM_UP
, which:Executors
An Executor is considered unhealthy if: it has messages in the Redis Stream, and no reported execution durations. The latter only being recorded on success. The inverse of this is used to determine "healthy". This is flawed for the following reasons:
To address these I have added the following metrics:
EXECUTOR_UP
which is incremented on every Executor loop, like above, a static value means unhealthy.SUCCESSFUL_EXECUTIONS
/FAILED_EXECUTIONS
which track successful/failed executions directly, rather than tracking using durations. This will be useful for tracking health of specific Indexers, e.g. thestaking
indexer should never have failed executions.