Implement `dora run` command #703

phil-opp · 2024-11-06T12:35:26Z

Runs a dataflow locally, without requiring any any daemon or coordinator processes. Multi-machine dataflows are not supported. The default log level is set to INFO, overwriting it is possible by setting the RUST_LOG environment variable.

This exposes the internal dora daemon --run-dataflow command that we use for testing.

This addition was proposed in https://github.com/orgs/dora-rs/discussions/698#discussioncomment-11125465 .

The second commit adds a ctrl-c handler. On first ctrl-c, we send a stop command to all nodes. On second ctrl-c, we exit immediately and kill all spawned nodes. On third ctrl-c, we abort the process directly without waiting (child processes keep running).

haixuanTao · 2024-11-07T04:27:28Z

This makes sense to me.

But could be good to:

Not print error on SIGINT ( Ctrl-C ) as it is what is expected
Make the Ctrl-C return result faster. I'm slightly confuse why ctrl-c a process takes couple of seconds to clear as it should almost be instant.

haixuanTao · 2024-11-07T04:29:34Z

Screencast.from.2024-11-07.05-28-42.webm

In this example, I hit ctrl-C at t=10
and the dataflow finish 10s later.

Runs a dataflow locally, without requiring any any daemon or coordinator processes. This exposes the internal `dora daemon --run-dataflow` command that we use for testing. This addition was proposed in https://github.com/orgs/dora-rs/discussions/698#discussioncomment-11125465 .

On first ctrl-c, send a stop command to all nodes. On second ctrl-c, exit immediately and kill all spawned nodes. On third ctrl-c, abort the process directly without waiting (child processes keep running). This change affects both `dora run` and `dora daemon` commands.

phil-opp · 2024-11-11T15:19:26Z

Rebased original commits on top of main.

phil-opp · 2024-11-11T15:29:55Z

But could be good to:

* Not print error on SIGINT ( Ctrl-C ) as it is what is expected

* Make the Ctrl-C return result faster. I'm slightly confuse why ctrl-c a process takes couple of seconds to clear as it should almost be instant.

See the description of the second commit:

On first ctrl-c, send a stop command to all nodes. On second ctrl-c, exit immediately and kill all spawned nodes. On third ctrl-c, abort the process directly without waiting (child processes keep running).

The reason that the ctrl-c takes multiple seconds in your case is that your dataflow doesn't react to the stop command properly for some reason (the warning message indicates a connection with #625). The nodes are then killed when the grace period runs out. This is considered an abnormal exit, so we print an error.

Could you try pressing ctrl-c twice? This should lead to an immediate kill and no additional error.

haixuanTao · 2024-11-13T10:30:00Z

So, when I press Ctrl-C, all process receive a SIGINT on the first ctrl-c. And therefore they return an error.

Isn't it what your experiencing as well?

phil-opp · 2024-11-13T14:13:38Z

Usually only the foreground process receives the ctrl-c. The spawned child processes should continue running. It's also how it behaves on my machine.

Could you paste the complete terminal output here? In your video I only see the end of it.

For me it looks like this when I do a single ctrl-c with the python-dataflow example:

2024-11-13T14:07:36.656457Z  INFO run_inner: dora_daemon: received ctrlc signal -> stopping all dataflows self.machine_id=
2024-11-13T14:07:51.792383Z  WARN dora_daemon: object-detection was killed due to not stopping within the 15s grace period
2024-11-13T14:07:51.810161Z  WARN dora_daemon: plot was killed due to not stopping within the 15s grace period
[...]
[ERROR]
Dataflow failed:

Node `camera` failed: exited because of signal SIGINT with stderr output:
[...]
Node `plot` failed: node was killed by dora because it didn't react to a stop message in time (SIGKILL)
Node `object-detection` failed: node was killed by dora because it didn't react to a stop message in time (SIGKILL)

haixuanTao · 2024-11-14T01:06:26Z

Ah i see!

So, it might be an OS thing. This is what I have:

(lebai) ~/D/w/d/e/rust-dataflow ❯❯❯ dora run dataflow.yml                                                                                    (lebai) dora-run ✭ ◼
2024-11-14T01:04:43.443567Z  INFO run_inner: dora_daemon::spawn: spawning: /home/peter/Documents/work/dora/target/debug/rust-dataflow-example-node self.machine_id=
2024-11-14T01:04:43.444772Z  INFO run_inner: dora_daemon::spawn: spawning: /home/peter/Documents/work/dora/target/debug/rust-dataflow-example-status-node self.machine_id=
2024-11-14T01:04:43.445278Z  INFO run_inner: dora_daemon::spawn: spawning: /home/peter/Documents/work/dora/target/debug/rust-dataflow-example-sink self.machine_id=
2024-11-14T01:04:43.445739Z  INFO run_inner: dora_daemon::spawn: spawning: /home/peter/Documents/work/dora/target/debug/dora-record self.machine_id=
2024-11-14T01:04:43.468757Z  INFO run_inner: dora_daemon: all nodes are ready, starting dataflow `01932833-7db0-7b0b-bcfe-dc0c3016a200` self.machine_id=
^C2024-11-14T01:04:44.148733Z  INFO run_inner: dora_daemon: received ctrlc signal -> stopping all dataflows self.machine_id=
2024-11-14T01:04:44.154888Z  WARN dora_daemon::node_communication: failed to send NextFinishedDropTokens reply: NextDropEvents([])

Caused by:
   0: failed to send reply to node `rust-status-node`
   1: failed to send DaemonReply
   2: Broken pipe (os error 32)

Location:
    binaries/daemon/src/node_communication/tcp.rs:91:14
2024-11-14T01:04:44.156431Z  WARN dora_daemon::node_communication: failed to send NextFinishedDropTokens reply: NextDropEvents([])

Caused by:
   0: failed to send reply to node `dora-record`
   1: failed to send DaemonReply
   2: Broken pipe (os error 32)

Location:
    binaries/daemon/src/node_communication/tcp.rs:91:14
2024-11-14T01:04:44.296294Z  WARN dora_daemon::node_communication: failed to send NextFinishedDropTokens reply: NextDropEvents([])

Caused by:
   0: failed to send reply to node `rust-node`
   1: failed to send DaemonReply
   2: Broken pipe (os error 32)

Location:
    binaries/daemon/src/node_communication/tcp.rs:91:14
2024-11-14T01:04:44.297052Z  INFO run_inner: dora_daemon: Dataflow `01932833-7db0-7b0b-bcfe-dc0c3016a200` finished on machine `` self.machine_id=
2024-11-14T01:04:44.297065Z  INFO run_inner: dora_daemon: exiting daemon because all required dataflows are finished self.machine_id=
2024-11-14T01:04:44.297066Z  WARN dora_daemon::node_communication: failed to send NextFinishedDropTokens reply: NextDropEvents([])

Caused by:
   0: failed to send reply to node `rust-sink`
   1: failed to send DaemonReply
   2: Broken pipe (os error 32)

Location:
    binaries/daemon/src/node_communication/tcp.rs:91:14


[ERROR]
Dataflow failed:

Node `rust-status-node` failed: exited because of signal SIGINT
Node `dora-record` failed: exited because of signal SIGINT
Node `rust-node` failed: exited because of signal SIGINT
Node `rust-sink` failed: exited because of signal SIGINT

Screencast.from.2024-11-14.02-04-27.webm

I'm using rust node as it might be easier to debug on.

phil-opp · 2024-11-14T10:37:14Z

So, it might be an OS thing.

I'm on an ubuntu-based Linux distro. You?

haixuanTao · 2024-11-14T11:25:03Z

I have fixed my issue with #712

Co-authored-by: Philipp Oppermann <[email protected]>

haixuanTao · 2024-11-15T03:56:06Z

So FYI, there was a race condition that was happening because of the node process stopping before the daemon could send the stop message and so the daemon was stuck.

By splitting process group, this race condition does not happen.

phil-opp requested a review from haixuanTao November 6, 2024 12:35

haixuanTao force-pushed the dora-run branch 3 times, most recently from 84e2291 to 886cfe5 Compare November 11, 2024 06:18

phil-opp force-pushed the dora-run branch from 886cfe5 to d36b049 Compare November 11, 2024 15:09

phil-opp added 2 commits November 11, 2024 16:18

phil-opp force-pushed the dora-run branch from d36b049 to 4437618 Compare November 11, 2024 15:18

haixuanTao added 2 commits November 14, 2024 12:17

Fix child process receiving ctrl-c by setting own process group

157eea6

Adding unix flag for cross platform support

ce4a6ab

haixuanTao approved these changes Nov 14, 2024

View reviewed changes

haixuanTao and others added 3 commits November 14, 2024 23:47

Update spawn.rs

abfacd0

Co-authored-by: Philipp Oppermann <[email protected]>

Update spawn.rs

9062241

Co-authored-by: Philipp Oppermann <[email protected]>

Fix child process receiving ctrl-c by setting own process group (#712)

43fe257

phil-opp enabled auto-merge November 14, 2024 16:09

phil-opp merged commit 70ca30d into main Nov 14, 2024
51 of 52 checks passed

phil-opp deleted the dora-run branch November 14, 2024 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `dora run` command #703

Implement `dora run` command #703

phil-opp commented Nov 6, 2024 •

edited

Loading

haixuanTao commented Nov 7, 2024

haixuanTao commented Nov 7, 2024 •

edited

Loading

phil-opp commented Nov 11, 2024

phil-opp commented Nov 11, 2024

haixuanTao commented Nov 13, 2024

phil-opp commented Nov 13, 2024

haixuanTao commented Nov 14, 2024

phil-opp commented Nov 14, 2024

haixuanTao commented Nov 14, 2024

haixuanTao commented Nov 15, 2024

Implement dora run command #703

Implement dora run command #703

Conversation

phil-opp commented Nov 6, 2024 • edited Loading

haixuanTao commented Nov 7, 2024

haixuanTao commented Nov 7, 2024 • edited Loading

phil-opp commented Nov 11, 2024

phil-opp commented Nov 11, 2024

haixuanTao commented Nov 13, 2024

phil-opp commented Nov 13, 2024

haixuanTao commented Nov 14, 2024

phil-opp commented Nov 14, 2024

haixuanTao commented Nov 14, 2024

haixuanTao commented Nov 15, 2024

Implement `dora run` command #703

Implement `dora run` command #703

phil-opp commented Nov 6, 2024 •

edited

Loading

haixuanTao commented Nov 7, 2024 •

edited

Loading