Naive Fault Tolerance Support #697

huangworld · 2023-11-19T21:39:57Z

Added naive mock of worker crashes through controlled sleep via worker_timeout and worker_timeout_choice flags.
Added naive detection of worker crashes in scheduler based on a retry and backoff logic.
Added naive fault tolerance logic in scheduler - re-split the original IR and distribute to healthy workers to (re)-execute again.
Upreved benchmarks based on data branch (though require some additional modifications to make these updates suitable for distr context (will be updated in a separate PR in dish repo).

github-actions · 2023-11-19T21:43:07Z

OS:ubuntu-20.04
Sun Nov 19 21:43:06 UTC 2023
intro: 2/2 tests passed.
interface: 35/36 tests passed.
compiler: 54/54 tests passed.
agg: 109/109 tests passed.
test_set are not identical

github-actions · 2023-11-20T00:01:34Z

OS:ubuntu-20.04
Mon Nov 20 00:01:34 UTC 2023
intro: 2/2 tests passed.
interface: 36/36 tests passed.
compiler: 54/54 tests passed.
agg: 109/109 tests passed.

1. Added naive mock of worker crashes through controlled sleep via worker_timeout and worker_timeout_choice flags. 2. Added naive detection of worker crashes in scheduler based on a retry and backoff logic 3. Added naive fault tolerance logic in scheduler - re-split the original IR and distribute to healthy workers to (re)-execute again. 4. Upreved benchmarks based on data branch (though require some additional modifications to make these updates suitable for distr context (will be updated in a separate PR in dish repo).

github-actions · 2023-11-21T04:10:49Z

OS:ubuntu-20.04
Tue Nov 21 04:10:49 UTC 2023
intro: 2/2 tests passed.
interface: 36/36 tests passed.
compiler: 54/54 tests passed.
agg: 109/109 tests passed.

angelhof · 2023-11-21T14:42:44Z

compiler/config.py

@@ -174,6 +174,12 @@ def add_common_arguments(parser):
    parser.add_argument("--version",
                        action='version',
                        version='%(prog)s {version}'.format(version=__version__))
+    parser.add_argument("--worker_timeout",


I think this should be OK for now, but it would be good if these flags are marked as experimental (in their help) and maybe at some point in the future hidden a bit, or put behind a subgroup so that pash users dont consider using them.

angelhof · 2023-11-21T14:43:57Z

evaluation/benchmarks/max-temp/max-temp-preprocess.sh

@@ -1,12 +1,12 @@
 #!/bin/bash

-sed 's;^;http://ndr.md/data/noaa/;' |
+sed 's;^;atlas-group.cs.brown.edu/data/noaa/;' |


these changes were in a different PR, why are they visible here?

can you rebase your branch to the correct one (or cherrypick your commits only?)

Yes, I'll separate the changes corresponding to benchmark updates to its own PR!

angelhof · 2023-11-21T14:49:31Z

compiler/dspash/worker_manager.py

@@ -45,21 +45,34 @@ def get_running_processes(self):
        #     answer = self.socket.recv(1024)
        return self._running_processes

-    def send_graph_exec_request(self, graph, shell_vars, functions, debug=False) -> bool:
+    def send_graph_exec_request(self, graph, shell_vars, functions, debug=False, worker_timeout=0) -> bool:


I am a little bit concerned that the killing code (sending the worker timeout) is so deeply engrained in the main flow of the pash code. That is OK for now, but is there a way to somehow move this code out of pash? For example, by having a separate entity that is the chaos_manager/fault_orchestrator that sends messages to workers to be killed and so on. This entity can be started when pash starts, but it would make it much simpler to separate the fault causing code from the fault recovery code, and also would make it much easier to remove the faults for pash production use. Let's discuss this in the next meeting!!

angelhof

I have left some comments, which we should try to discuss in a next meeting! In general feel free to merge PRs on non mainline branches (main/future) without waiting for review (especially if you want to keep moving and we are busy :)

huangworld · 2023-11-29T22:21:09Z

This merge into dspash-ft is reverted. In a separate pr: #707, I retargeted it to ft-future.

huangworld force-pushed the dspash-future branch from c624267 to 8544ab8 Compare November 19, 2023 23:58

huangworld force-pushed the dspash-future branch from 8544ab8 to f8ac509 Compare November 21, 2023 04:07

angelhof reviewed Nov 21, 2023

View reviewed changes

Rewinded evaluation link updates to include in a separate PR

d4512b0

huangworld merged commit 5b924ec into binpash:dspash-ft Nov 29, 2023
0 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Naive Fault Tolerance Support #697

Naive Fault Tolerance Support #697

huangworld commented Nov 19, 2023

github-actions bot commented Nov 19, 2023

github-actions bot commented Nov 20, 2023

github-actions bot commented Nov 21, 2023

angelhof Nov 21, 2023

angelhof Nov 21, 2023

angelhof Nov 21, 2023

huangworld Nov 21, 2023

angelhof Nov 21, 2023

angelhof left a comment

huangworld commented Nov 29, 2023

Naive Fault Tolerance Support #697

Naive Fault Tolerance Support #697

Conversation

huangworld commented Nov 19, 2023

github-actions bot commented Nov 19, 2023

github-actions bot commented Nov 20, 2023

github-actions bot commented Nov 21, 2023

angelhof Nov 21, 2023

Choose a reason for hiding this comment

angelhof Nov 21, 2023

Choose a reason for hiding this comment

angelhof Nov 21, 2023

Choose a reason for hiding this comment

huangworld Nov 21, 2023

Choose a reason for hiding this comment

angelhof Nov 21, 2023

Choose a reason for hiding this comment

angelhof left a comment

Choose a reason for hiding this comment

huangworld commented Nov 29, 2023