Should Barrier.wait() return as soon as a task fails? #272

emiltin · 2023-08-24T07:44:22Z

emiltin
Aug 24, 2023

When calling wait() on a barrier, it will return exceptions from any of the tasks in the barrier, but apparently not until all tasks added before the failing task completes.

Here two seconds pass before the barrier reports the error:

require 'async/barrier'
Async do
  barrier = Async::Barrier.new
  barrier.async { sleep 2 }
  barrier.async { RuntimeError.new }
  barrier.wait # => 2 seconds later.... RuntimeError
ensure
  barrier.stop
end

But just but changing the order the tasks are created, the barrier now returns the error immediately:

require 'async/barrier'
Async do
  barrier = Async::Barrier.new
  barrier.async { RuntimeError.new }
  barrier.async { sleep 2 }
  barrier.wait # => immediately: RuntimeError
ensure
  barrier.stop
end

I would prefer that it returns exceptions as soon as any task fails.
It also seems odd to me that the order of the tasks matter, even though they run asynchronously.

ioquatix · 2023-08-24T08:21:03Z

ioquatix
Aug 24, 2023
Maintainer

A barrier is supposed to be a synchronisation point of multiple tasks.

Probably Barrier#wait should ignore errors, and just wait until all tasks are completed (success or fail).

However, I suppose what I was thinking, was that if a task has failed with an unhandled error, the entire process might be failed. Order can matter, since you might have inter-dependencies.

I don't mind changing this semantic, but I suppose we just need to be mindful of what problems we are trying to solve and the best way to organise the code/solutions.

0 replies

emiltin · 2023-08-24T08:50:20Z

emiltin
Aug 24, 2023
Author

It's common to start several task and wait for them all to complete. This is what barrier handles.

But if one of the tasks fails with an unhandled error, then there is often no need to wait for all tasks to complete, because we know that we're not going to to able to get a complete result because one of the tasks is corrupted. So we might want to handle this by e.g. stopping all remaining task and starting over, failing the parent task, or perhaps aborting entirely.

In general I think handling errors is one of the most difficult aspects of using Async. My goal is to create fault tolerant code. I'm currently inspired by Erlang/Elixir and supervisor trees, and the idea of restarting failed parts from a known good state.

0 replies

emiltin · 2023-08-24T14:45:03Z

emiltin
Aug 24, 2023
Author

Perhaps you could choose between whether the barrier:

re-raises an uncaught error in a task as soon as it happens
waits for all tasks to complete (ignoring failures)

I think the current behaviour is somewhere between these, but to me seems a bit hard to use because it depends on the ordering of tasks.

Async do
  barrier = Async::Barrier.new
  
  barrier.async { raise 'ups' }
  barrier.async { sleep 1 }

  barrier.wait # raise as soon as any task fails
  barrier.complete # wait for all tasks, ignoring errors

rescue StandardError => e
  # what task caused the error?
ensure
  barrier.stop
end

Task#wait raises if the task fails, so it seems appropriate that Barrier#wait does the same, but for all tasks.

Is there a way to glean form the error what task originally raised it?

0 replies

emiltin · 2023-08-30T10:39:04Z

emiltin
Aug 30, 2023
Author

I tried with this implementation of Async::Barrier#wait:

def wait
  condition = Async::Condition.new
  guard = Async do
    until @tasks.empty?
      result = condition.wait
      raise result if result.is_a? StandardError
    end
  end

  @tasks.each do |waiting|
    Async do
      begin
        task = waiting.task
        task.wait
      ensure
        @tasks.remove?(waiting) unless task.alive?
      end
      condition.signal :ok
    rescue StandardError => e
      condition.signal e
    end
  end

  guard.wait
end

Now the barrier will abort and re-raise as soon as any task fails:

require 'async'
require 'async/barrier'
Async do
  barrier = Async::Barrier.new

  barrier.async do |task1|
    task1.annotate(:task1)
    sleep 1000
  end

  barrier.async do |task2|
    task2.annotate(:task2)
    sleep 0.1
    RuntimeError.new 'boom!'
  end

  barrier.wait
  puts 'All tasks completed'
rescue StandardError => e
  puts "Task error: #{e}"
ensure
  barrier.stop
end

Instead of waiting for each task in turn, we run a separate task the waits for a condition, as long as there are tasks remaining. When a task completes or fails it signals the condition and removes itself. The guard can the abort the wait if a task failed.

But having to run each task inside a task fells clunky, there's problably a better way to do it?

2 replies

ioquatix Aug 30, 2023
Maintainer

A barrier does not impose any order constraints - only that it's a synchronisation point. I think handling errors should be seen as exceptional. There will always be ambiguity in the order of error handling - as soon as you have non-determinism, you cannot make any guarantees about order, even if we do try to be as "predictable" as possible in Async.

The general model for Async::Barrier is this:

  def Barrier(parent: nil, &block)
    Barrier.new(parent: parent).tap do |barrier|
      yield barrier
      barrier.wait
    ensure
      barrier.stop
    end
  end

As I've said before, tasks that raise exceptions are exceptional and the flow control is also exceptional. Applications that use Barrier#wait should probably not raise exceptions as part of normal code execution.

That being said, I do understand your use case and the ideas about robustness. Your point is, as soon as one of the tasks fails, the entire request is essentially a failure, so why wait?

There are several patterns (with overlap), for a given set of tasks:

All tasks failed = total failure.
At least one failure = total failure.
At least one success = total success.
All tasks success = total success.

I think your one is (2), but actually I've also seen (3) and (4) in real code.

I think a barrier implements (2) and (4). It's true that a common pattern might be: fan out and fail fast; or fan out and succeed fast, and cancel the remainder. Or fan out, and wait for at least N successful responses.

I think we can change barrier to fail fast without impacting the interface the user expects. If a task fails, whether it fails now or later is not specified by barrier - just that Barrier#wait will eventually re-raise the exception.

The best way to implement this, is for the task to notify the barrier that it's done - success or failure, and for Barrier#wait to wait on that condition. We can make a few small changes to make this more ergonomic: #276

emiltin Aug 30, 2023
Author

as soon as one of the tasks fails, the entire request is essentially a failure, so why wait?
Exactly.

Isn't 1 and 3 the same? A total success requires just one to succeed; if all fails it's a total failure.
And 2 and 4 the same? A total success requires all to succeed; if just one fails, it's a total failure.

I think the currerent implementation is close to 2 and 4, but might not fail fast, depending on the ordering of tasks.

emiltin · 2023-08-30T12:23:16Z

emiltin
Aug 30, 2023
Author

The best way to implement this, is for the task to notify the barrier that it's done - success or failure, and for Barrier#wait to wait on that condition. We can make a few small changes to make this more ergonomic

I think this is what I attemped - use a notification to let tasks inform the barirer whether they succeed or fail, so the barrier can fail fast. But I'm sure you can improve :-)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should Barrier.wait() return as soon as a task fails? #272

{{title}}

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Should Barrier.wait() return as soon as a task fails? #272

emiltin Aug 24, 2023

Replies: 5 comments · 2 replies

ioquatix Aug 24, 2023 Maintainer

emiltin Aug 24, 2023 Author

emiltin Aug 24, 2023 Author

emiltin Aug 30, 2023 Author

ioquatix Aug 30, 2023 Maintainer

emiltin Aug 30, 2023 Author

emiltin Aug 30, 2023 Author

emiltin
Aug 24, 2023

Replies: 5 comments 2 replies

ioquatix
Aug 24, 2023
Maintainer

emiltin
Aug 24, 2023
Author

emiltin
Aug 24, 2023
Author

emiltin
Aug 30, 2023
Author

ioquatix Aug 30, 2023
Maintainer

emiltin Aug 30, 2023
Author

emiltin
Aug 30, 2023
Author