Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovery line may recover parts of different global checkpoints #2

Open
ArcadeMode opened this issue Aug 16, 2021 · 0 comments
Open

Comments

@ArcadeMode
Copy link
Contributor

In coordinated mode, the ChandyLamportBarrierSource may generate barriers while a failure is present (but undetected) in the system.
The issue is hardly observed with checkpoint intervals of 20-30 seconds plus. but at lower intervals the detection-time for a failure becomes largely equal or higher than the checkpoint interval, meaning its almost guaranteed to generate barriers while a failure is in the system. this causes a partial global checkpoint to be taken (barriers wont move past the failed instance). The recovery line calculations then computes a consistent global checkpoint which may include checkpoints from the previous global checkpoint, resulting in part of the latest global checkpoint and part of the second latest global checkpoint to be restored.

A hotfix has been put in place that preemptively stops the barrier generation timer when a connection from the coordinator to a worker fails. This is not a proper fix but does reduce the emergence of this behavior significantly since the behavior is easily observed in rollback distance metrics, cases where this behavior is triggered can be re-run with high probability that the run will then be successful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant