-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Naive Fault Tolerance Support #697
Conversation
huangworld
commented
Nov 19, 2023
- Added naive mock of worker crashes through controlled sleep via worker_timeout and worker_timeout_choice flags.
- Added naive detection of worker crashes in scheduler based on a retry and backoff logic.
- Added naive fault tolerance logic in scheduler - re-split the original IR and distribute to healthy workers to (re)-execute again.
- Upreved benchmarks based on data branch (though require some additional modifications to make these updates suitable for distr context (will be updated in a separate PR in dish repo).
OS:ubuntu-20.04 |
c624267
to
8544ab8
Compare
OS:ubuntu-20.04 |
1. Added naive mock of worker crashes through controlled sleep via worker_timeout and worker_timeout_choice flags. 2. Added naive detection of worker crashes in scheduler based on a retry and backoff logic 3. Added naive fault tolerance logic in scheduler - re-split the original IR and distribute to healthy workers to (re)-execute again. 4. Upreved benchmarks based on data branch (though require some additional modifications to make these updates suitable for distr context (will be updated in a separate PR in dish repo).
8544ab8
to
f8ac509
Compare
OS:ubuntu-20.04 |
@@ -174,6 +174,12 @@ def add_common_arguments(parser): | |||
parser.add_argument("--version", | |||
action='version', | |||
version='%(prog)s {version}'.format(version=__version__)) | |||
parser.add_argument("--worker_timeout", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be OK for now, but it would be good if these flags are marked as experimental (in their help) and maybe at some point in the future hidden a bit, or put behind a subgroup so that pash users dont consider using them.
@@ -1,12 +1,12 @@ | |||
#!/bin/bash | |||
|
|||
sed 's;^;http://ndr.md/data/noaa/;' | | |||
sed 's;^;atlas-group.cs.brown.edu/data/noaa/;' | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these changes were in a different PR, why are they visible here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rebase your branch to the correct one (or cherrypick your commits only?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll separate the changes corresponding to benchmark updates to its own PR!
@@ -45,21 +45,34 @@ def get_running_processes(self): | |||
# answer = self.socket.recv(1024) | |||
return self._running_processes | |||
|
|||
def send_graph_exec_request(self, graph, shell_vars, functions, debug=False) -> bool: | |||
def send_graph_exec_request(self, graph, shell_vars, functions, debug=False, worker_timeout=0) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little bit concerned that the killing code (sending the worker timeout) is so deeply engrained in the main flow of the pash code. That is OK for now, but is there a way to somehow move this code out of pash? For example, by having a separate entity that is the chaos_manager/fault_orchestrator that sends messages to workers to be killed and so on. This entity can be started when pash starts, but it would make it much simpler to separate the fault causing code from the fault recovery code, and also would make it much easier to remove the faults for pash production use. Let's discuss this in the next meeting!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have left some comments, which we should try to discuss in a next meeting! In general feel free to merge PRs on non mainline branches (main/future) without waiting for review (especially if you want to keep moving and we are busy :)
This merge into dspash-ft is reverted. In a separate pr: #707, I retargeted it to ft-future. |