-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect node is behind #3373
base: master
Are you sure you want to change the base?
Detect node is behind #3373
Conversation
73769f0
to
4c8303c
Compare
4c8303c
to
9ef0c38
Compare
b2c9a19
to
91cc26e
Compare
f732a03
to
670aa53
Compare
|
||
validators := nodeWeights(sd.connectedValidators().List()) | ||
|
||
nodeWeight2lastAccepted := make(nodeWeights2Blocks, len(validators)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we change the name of this variable as well eg nodeWeightToLastAccepted
?
|
||
ratio := float64(totalWeightWeKnowItsLastAcceptedBlock) / float64(totalValidatorWeight) | ||
|
||
if ratio < 0.8 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use the constant here as well?
zap.Float64("ratio", ratio)) | ||
return nil, 0, 0 | ||
} | ||
return nodeWeight2lastAccepted, totalValidatorWeight, totalWeightWeKnowItsLastAcceptedBlock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we do the calculation of requiring 0.8 of the stake both in getNetworkSnapshot
and getNetworkInfo
, could this computation be simplified somehow?
f func(time.Duration) | ||
} | ||
|
||
func NewDecoratedEngine(e *Engine, time func() time.Time, f func(time.Duration)) common.Engine { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we call this something else like WithStragglerDetection
or NewStragglerDetector
, sine Decorated seems more generic.
Also seems we could unexport the decoratedEngine
struct, since we're returning common.Engine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we do the calculation of requiring 0.8 of the stake both in getNetworkSnapshot and getNetworkInfo, could this computation be simplified somehow?
I can do it if you feel strongly about it, it's quite easy to do that.
However after thinking about it, I'm not sure that's ideal. I know the code can be made simpler by not repeating the computation, but, I structured it because I wanted the logs to distinguish between the following two cases:
- We do not know yet the last accepted block for enough nodes (stake)
- We know the last accepted block for enough nodes (stake) but most of them is exactly like our own last accepted block.
I can easily remove the code handling the first case, but then if we don't proceed checking if we're behind the rest, the logs may falsely state it is because our last accepted block is as the rest, but it might be that we simply haven't gathered enough info.
f func(time.Duration) | ||
} | ||
|
||
func NewDecoratedEngineWithStragglerDetector(e *Engine, time func() time.Time, f func(time.Duration)) common.Engine { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use mockable.Clock
instead of time func() time.Time
if we require it for testing? We might not even need it to be here and mockable.Clock can be just a field in stragglerDetectorConfig
and test can set it directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I changed the tests to use mockable.Clock
, good idea.
However I kept the function pointer func() time.Time()
as I don't want a mock to be part of the production code.
de.Engine.Config.Ctx.Log.Info("We are behind the rest of the network", zap.Float64("seconds", behindDuration.Seconds())) | ||
} | ||
de.Engine.metrics.stragglingDuration.Set(float64(behindDuration)) | ||
de.f(behindDuration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this for testing only? If so, I feel like this hook is equivalent of directly testing CheckIfWeAreStragglingBehind()
. So not sure if we really need it here. If we really want to test this maybe we should test the set metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not for testing. This is a decorator, so it invokes the method it decorates but also invokes a function, which is pluggable. The pluggable function it invokes, if this f
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still slightly think this is a bit extra (since we don't actually use it in production). but if you think we should keep it, can we rename it and add comment so that it's visible for consumers?
getTime func() time.Time, | ||
log logging.Logger, | ||
minConfirmationThreshold float64, | ||
lastAccepted func() (ids.ID, uint64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we rather take tracker.Accepted
interface here? (or any sub-interface that provides this function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is an interface superior to a function pointer in this case? If I put an interface I would need to make a mock structure for it in the tests. Can you please explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally interfaces provides more "defined" and flexible structures rather than pointer functions. I don't particularly like function pointers because they lock the function signature and all functions expecting the function pointer needs to be changed if it is modified. Also Mocking complex functions as a single interface can be much more easily done through mockgens. So in the end it's all readability, code standards and flexibility. IMHO having one or very specific pointer function is fine, but repeating it for the sake of easy testing is generally an issue.
log logging.Logger, | ||
minConfirmationThreshold float64, | ||
lastAccepted func() (ids.ID, uint64), | ||
connectedValidators func() set.Set[ids.NodeWeight], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we take a single tracker.Peers
interface for this and below (connectedPercent) funcs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some changes to the struct, let me know if it looks better now.
newProcessingWeight, err := safemath.Add(processingWeight, nw.Weight) | ||
if err != nil { | ||
sd.log.Error("Cumulative weight overflow", zap.Uint64("cumulative", processingWeight), zap.Uint64("added", nw.Weight)) | ||
return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we instead just return err and handle it in either CheckIfWeAreStragglingBehind
or again return err from there and handle wherever CheckIfWeAreStragglingBehind is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error means the total weight of the subnet has overflowed uint64. I don't think it's worth to change the API to propagate this further up, but if you feel strongly about it, I'll change it accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say we can return err from failedCatchingUp
and handle in CheckIfWeAreStragglingBehind
, but really does not super matter if we dont want to return err from CheckIfWeAreStragglingBehind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the failedCatchingUp
method (now called areWeBehindTheRest
) returns true or false whether it managed to detect the caller node is behind. Handling an error returned by it by the calling method is simply logging it. Therefore I don't see why we need to propagate the error further upwards, if we can handle it at a lower level and simplify the method signature of failedCatchingUp
.
|
||
res.connStakePercent = sd.connectedPercent() | ||
|
||
validators := nodeWeights(sd.connectedValidators().List()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems racy with connectedPercent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does not matter if the validators change after the connectedPercent was read.
We only use the connected stake to know if to abort early or continue the assessment.
Let's look at two possibilities:
- The validators have more stake than connected percent, and we abort early in
validateNetInfo
because connected percent was below the threshold: Then, we will simply not detect whether we are behind in this probe. This scheduling is equivalent to a scheduling where we read the connected percent and the validators in closer temporal proximity. - We read a connected percent above the threshold, but afterwards, we read validators such that the connected percent is below the threshold: This is equivalent to the logic ignoring the connected percent altogether, and just relying on the data from
connectedValidators()
.
Initially, in the first version of this PR, the connected percent was the first thing we compute, and we wouldn't continue computing anything if it was too low.
However, after the first code review, I changed the computation to first collect data, and then validate it.
I can remove the connected percent, as it might make the implementation simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd not oppose to have a simpler implementation if you say this is not ultra necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it in the follow-up commit
return nodeWeights(maps.Keys(nwb)).totalWeight() | ||
} | ||
|
||
func dropHeight(f func() (ids.ID, uint64)) func() ids.ID { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function is a bit difficult to follow, and not sure if we %100 need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add comment or prefably drop the function? IMHO we should be using tracker.Accepted interface anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we remove this function? Seems like overkill to use in a single place
78901b7
to
ce5cb7c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments
I think the case we'd care about most is when the chain has fallen sufficiently far behind that the last accepted block reported by most peers is a block that we have not verified or marked as processing. This seems to only catch the case that we have a long processing chain that is up to date with the latest accepted block of the network.
stragglingDuration: prometheus.NewGauge(prometheus.GaugeOpts{ | ||
Name: "straggling_duration", | ||
Help: "For how long we have been straggling behind the rest, in nano-seconds.", | ||
}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we put this at the end ?
@@ -37,10 +36,10 @@ type Peers interface { | |||
SampleValidator() (ids.NodeID, bool) | |||
// GetValidators returns the set of all validators | |||
// known to this peer manager | |||
GetValidators() set.Set[ids.NodeID] | |||
GetValidators() set.Set[ids.NodeWeight] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return a map instead of set if we need both? The nodeIDs should be unique within the set, but this change makes it possible to add the nodeID twice with two different weights. This shouldn't happen because we're creating it on demand, but it seems like an odd use of set to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to use the validator set type on the snowman config type rather than making this change to Peers
? https://github.com/ava-labs/avalanchego/blob/master/snow/engine/snowman/config.go#L23
engine := smeng.NewDecoratedEngineWithStragglerDetector(sme, time.Now, func(_ time.Duration) {}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we include a comment on why the anonymous function here is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make this a decorator instead of part of the engine?
return nodeWeights(maps.Keys(nwb)).totalWeight() | ||
} | ||
|
||
func dropHeight(f func() (ids.ID, uint64)) func() ids.ID { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we remove this function? Seems like overkill to use in a single place
|
||
// Ensure we have collected last accepted blocks that are not our own last accepted block. | ||
nodeWeightToLastAccepted = nodeWeightToLastAccepted.filter(func(id ids.ID) bool { | ||
return ourLastAcceptedBlock.Compare(id) != 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return ourLastAcceptedBlock.Compare(id) != 0 | |
return ourLastAcceptedBlock != id |
|
||
totalValidatorWeight, err := validators.totalWeight() | ||
if err != nil { | ||
s.log.Error("Failed computing total weight", zap.Error(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than emitting error logs in each place where we return an error, would it make sense to leave logging the error to the caller? Seems this is only called from one place.
return res, nil | ||
} | ||
|
||
type netInfo struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we include some comments explaining the phases of filtering?
The process seems to be:
- Gather the last accepted block from each validator
- Filter out any blocks that match our own last accepted block in
getNetworkInfo
- Filter out any blocks that we have marked as processing in
areWeBehindTheRest
but this requires reading from the bottom to top of the file to understand the flow.
How do we handle blocks that are accepted by the network, but we have not added to processing yet because we have not fetched the full ancestry?
Since we are using consensus.Processing(...)
, we'll only count blocks that we have already verified, but I thought the intent behind this PR was to detect scenarios where we are sufficiently far behind that the blocks reported as the last accepted block by our peers are far enough ahead that we have not verified them.
This commit adds a mechanism that detects that our node is behind the majority of the stake. The intent is to later have this mechanism be the trigger for the bootstrapping mechanism. Currently, the bootstrapping mechanism is only active upon node boot, but not at a later point. The mechanism works in the following manner: - It intercepts the snowman engine's Chits message handling, and upon every reception of the Chits message, the mechanism that detects if the node is a straggler (a node with a ledger height behind the rest) may be invoked, if it wasn't invoked too recently. - The mechanism draws statistics from the validators known to it, and computes the latest accepted block for each validator. - The mechanism then proceeds to determine which blocks are pending to be processed (a block pending to be processed was not accepted). - The mechanism then collects a snapshot of all blocks it hasn't accepted yet, and the amount of stake that has accepted this block. - The mechanism then waits for its next invocation, in order to see if it has accepted blocks correlated with enough stake. - If there is too much stake that has accepted blocks by other nodes correlated to it that the node hasn't accepted, then the mechanism announces the node is behind, and returns the time period between the two invocations. - The mechanism sums the total time it has detected the node is behind, until a sampling concludes it is not behind, and then the total time is nullified. Signed-off-by: Yacov Manevich <[email protected]>
Signed-off-by: Yacov Manevich <[email protected]>
Signed-off-by: Yacov Manevich <[email protected]>
Signed-off-by: Yacov Manevich <[email protected]>
Signed-off-by: Yacov Manevich <[email protected]>
ce5cb7c
to
d2aecae
Compare
Signed-off-by: Yacov Manevich <[email protected]>
d2aecae
to
438768d
Compare
This PR has become stale because it has been open for 30 days with no activity. Adding the |
Why this should be merged
Currently, the bootstrapping mechanism is only active upon node boot, but not at a later point.
This commit adds a mechanism that detects that our node is behind the majority of the stake.
The intent is to later have this mechanism be the trigger for the bootstrapping mechanism.
How this works
The mechanism works in the following manner:
the mechanism that detects if the node is a straggler (a node with a ledger height behind the rest) may be invoked,
if it wasn't invoked too recently.
the mechanism announces the node is behind, and returns the time period between the two invocations.
How this was tested
Apart from unit tests, I conducted an experiment by running a node on Fuji.
I added delay when verifying a block:
and let the nodes run a few minutes in trace logging level.
Although snowman devolving back into bootstrapping and then returning back to snowman is a subsequent commit, the above log shows that the detection mechanism works.