You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We currently fallback to the slow quorum path if any replicas fail. We have heartbeat information from all replicas so we should instead use that to detect which replicas are healthy and not wait for them.
We currently fallback to the slow quorum path if any replicas fail. We have heartbeat information from all replicas so we should instead use that to detect which replicas are healthy and not wait for them.
https://github.com/pytorch-labs/torchft/blob/main/src/lighthouse.rs#L386
The heartbeat threshold should be configurable though currently we heartbeat every 100ms so 1s timeout seems fine.
We may also want to extract the quorum algorithm into separate configurable/plugable strategies so we can switch between the old and the new logic.
Relevant existing tests:
The text was updated successfully, but these errors were encountered: