Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Naive Fault Tolerance Support #697

Merged
merged 2 commits into from
Nov 29, 2023
Merged

Conversation

huangworld
Copy link
Collaborator

  1. Added naive mock of worker crashes through controlled sleep via worker_timeout and worker_timeout_choice flags.
  2. Added naive detection of worker crashes in scheduler based on a retry and backoff logic.
  3. Added naive fault tolerance logic in scheduler - re-split the original IR and distribute to healthy workers to (re)-execute again.
  4. Upreved benchmarks based on data branch (though require some additional modifications to make these updates suitable for distr context (will be updated in a separate PR in dish repo).

Copy link

OS:ubuntu-20.04
Sun Nov 19 21:43:06 UTC 2023
intro: 2/2 tests passed.
interface: 35/36 tests passed.
compiler: 54/54 tests passed.
agg: 109/109 tests passed.
test_set are not identical

Copy link

OS:ubuntu-20.04
Mon Nov 20 00:01:34 UTC 2023
intro: 2/2 tests passed.
interface: 36/36 tests passed.
compiler: 54/54 tests passed.
agg: 109/109 tests passed.

1. Added naive mock of worker crashes through controlled sleep via worker_timeout and worker_timeout_choice flags.
2. Added naive detection of worker crashes in scheduler based on a retry and backoff logic
3. Added naive fault tolerance logic in scheduler - re-split the original IR and distribute to healthy workers to (re)-execute again.
4. Upreved benchmarks based on data branch (though require some additional modifications to make these updates suitable for distr context (will be updated in a separate PR in dish repo).
Copy link

OS:ubuntu-20.04
Tue Nov 21 04:10:49 UTC 2023
intro: 2/2 tests passed.
interface: 36/36 tests passed.
compiler: 54/54 tests passed.
agg: 109/109 tests passed.

@@ -174,6 +174,12 @@ def add_common_arguments(parser):
parser.add_argument("--version",
action='version',
version='%(prog)s {version}'.format(version=__version__))
parser.add_argument("--worker_timeout",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be OK for now, but it would be good if these flags are marked as experimental (in their help) and maybe at some point in the future hidden a bit, or put behind a subgroup so that pash users dont consider using them.

@@ -1,12 +1,12 @@
#!/bin/bash

sed 's;^;http://ndr.md/data/noaa/;' |
sed 's;^;atlas-group.cs.brown.edu/data/noaa/;' |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes were in a different PR, why are they visible here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rebase your branch to the correct one (or cherrypick your commits only?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll separate the changes corresponding to benchmark updates to its own PR!

@@ -45,21 +45,34 @@ def get_running_processes(self):
# answer = self.socket.recv(1024)
return self._running_processes

def send_graph_exec_request(self, graph, shell_vars, functions, debug=False) -> bool:
def send_graph_exec_request(self, graph, shell_vars, functions, debug=False, worker_timeout=0) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little bit concerned that the killing code (sending the worker timeout) is so deeply engrained in the main flow of the pash code. That is OK for now, but is there a way to somehow move this code out of pash? For example, by having a separate entity that is the chaos_manager/fault_orchestrator that sends messages to workers to be killed and so on. This entity can be started when pash starts, but it would make it much simpler to separate the fault causing code from the fault recovery code, and also would make it much easier to remove the faults for pash production use. Let's discuss this in the next meeting!!

Copy link
Member

@angelhof angelhof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left some comments, which we should try to discuss in a next meeting! In general feel free to merge PRs on non mainline branches (main/future) without waiting for review (especially if you want to keep moving and we are busy :)

@huangworld huangworld merged commit 5b924ec into binpash:dspash-ft Nov 29, 2023
0 of 7 checks passed
@huangworld
Copy link
Collaborator Author

This merge into dspash-ft is reverted. In a separate pr: #707, I retargeted it to ft-future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants