TUI crashes when resetting while running #2

rrbutani · 2020-02-12T22:51:50Z

To reproduce:

Run (r)
Reset (alt + r)

The text was updated successfully, but these errors were encountered:

rrbutani · 2020-02-15T22:37:35Z

This is because of our impl of EventFutureSharedState::reset; as expected our current strategy does not work and needs to be revisited.

rrbutani · 2020-02-15T23:12:50Z

If we want to properly resolve this we need to change EventFutureSharedState to be able to handle Futures that are no longer polled.

Ultimately I think this means we have to abandon the "single batch" model that we have now.

Right now, we assume that all the futures registered with us are waiting on the next event and that when the event occurs (when the batch is sealed) no more futures will be registered until a new batch starts (which will happen when all the futures poll and acknowledge that an event has happened).

This is a decent assumption to make if the Control user is actually using the Futures produced by run_until_event: such users will likely know not to reset until the returned future resolves. However, in cases where Control users aren't doing anything with the returned future (like our own TUI at the moment), users don't know to wait til the future resolves to call reset and our assumption is violated.

One simple fix is to have the reset function trigger a Halt or a Pause event as a way to manually resolve the futures. Unfortunately this does not work: while it seals the batch, we cannot force the Futures we have out (or more precisely their executors) to poll themselves so now we have a synchronization issue: we can't reset the shared state until the futures have all been resolved.

Spinning in reset until this condition is true also does not work for a couple of reasons:

First, as part of the RPC setup the device polls the future that it can potentially have pending within the tick function; so we'd need to call tick repeatedly instead of just spinning so we aren't deadlocked. This isn't a deal breaker; just something to keep track of.
Second — and this is a deal breaker — we can't guarantee that the futures we have out will ever be polled. In the above example with the TUI, the future is never sent to an executor so we'd just be stuck on that batch forever.

One way to fix this is to try to detect cases like the above where the future is not actually being polled. A decent proxy for this is adding hooks to Drop; we can have each EventFuture take a wrapper type that contains a reference to an impl of EventFutureSharedState (or the porcelain) and have that notify the impl when it gets dropped (via a custom Drop impl). There are two problems with this:

This custom Drop would get run on futures that are never used or abandoned and also on futures that resolve! We'd need some way of not counting the futures that resolve. We'd also want to disallow Clone on this wrapper type, but that's another thing.
Not all abandoned futures get dropped. Having an API that just crashes when if/when users forget to drop a thing might not be a best, but this is likely to be enough of an edge case that I think it'd be reasonable to call this good enough for now.

On the other hand, the above solution still could suffer from weird race conditions that happen because of the way events get re-sequenced once they go through RPC. More importantly, it will leave some edge cases like the one above.

Another approach is to stop trying to stick to the single batch model and to support multiple batches. The downside is complexity (particularly while maintaining no-std support for the EventFutureSharedState impl). We'd also need to decide how to handle cases where the batch counter rolls over but that is definitely an extreme edge case.

rrbutani · 2020-02-15T23:26:05Z

Here's a note from EventFutureSharedState::reset:

    //
    // TODO: This is problematic since the way to 'drop' a Future from the
    // client side is to simply stop polling it: when this happens we will not
    // be notified. Instead, users will either be unable to call reset or will
    // encounter a panic when they try to call reset after dropping a
    // RunUntilEvent future.
    //
    // A better system would be to have every batch have a number. Each future
    // would hold this number in addition to a reference to the
    // EventFutureSharedState instance. When futures that aren't in the current
    // batch try to poll, they should error out. The edge case that exists is
    // that if the batch number loops around, it's possible that a very old
    // future comes along and takes the place of a new future.
    //
    // An alternative solution is to use `Drop` to have futures that aren't
    // being used anymore signal to us that they're dead by calling a function
    // on us decrementing the count. A side-effect of this approach is that we
    // won't know whether the waker we currently have a reference to belongs
    // to the future that just told us that it's dead. So, we can't unregister
    // the current (if we have one) waker that we know about.
    //
    // As a result any waker that gets used with an `EventFutureSharedState`
    // backed Future must be able to handle being called on a future is no
    // longer around.
    //
    // Actually, this has an issue: `Drop` is called eventually even when the
    // future resolves. In these cases, the count will be decremented twice:
    // once through the `get_event` function and once through the `decrement`
    // function (when `Drop` occurs).
    //
    // We could just not decrement the count in `get_event` but then we'd have
    // to assume that executors go and `Drop` their futures as soon as they
    // resolve (a reasonable assumption but not required -- afaik the invariant
    // they must uphold is not _polling_ Futures once they've resolved).
    //
    // So, this won't work.

Eventually we should do something that looks like the first thing (batches with numbers).

rrbutani · 2020-02-15T23:41:20Z

For now let's do this:

We won't force all the Futures to have resolved when Control::reset is called.
Good Control impls will produce a halt event interp.halt(); self.step() when triggering a reset.
RPC's device will catch this before processing any other requests (on the next invocation of tick which is how requests even arrive), resolving the one future that it may have out. This means that even if the very next request it processes is a RunUntilEvent, there will be no problems (i.e. no race conditions) because the old batch will have closed first. This is maintained so long as Device is the only client and sequential consistency is assumed (both of these are things we assume and need anyways).
- This should work even with multiple RPC stages.

This leaves the user's end: there we still don't have any way of ensuring that the Future we produce is polled and that the Controller's batch is empty (we can and do ensure that it is sealed) before the next RunUntilEvent comes along.

When the user tries, the program will crash.

I think this is acceptable for now. I'll update the TUI to poll the future it gets within its inner loop.

rrbutani · 2020-02-15T23:47:58Z

It's worth noting that another approach would be to just not resolve any existing futures when a reset occurs but to have them carry over. Calling run_until_event again wouldn't be a problem because we'd just have that future (in addition to any current ones) be waiting on the next event.

The problem is that is a user were to block on a future that they got before a reset without starting the machine again (i.e. calling run_until_event) they'd be stuck. Or at least, their future would never resolve since the machine is paused.

But is this a problem?

I think it depends on your answer to this: Does the existence of an unresolved EventFuture imply that the machine is running?

In crafting the API I definitely thought it did. I still do think the above would be surprising behaviour for a user but it's a decidedly better failure mode than just crashing.

I think I'm still going to go with the stopgap outlined in the last comment but I'm now less sure what a proper solution to this looks like.

@AmrEl-Azizi @pranav12321 @jer-zhang @gipsond Thoughts?

rrbutani · 2020-02-16T00:34:41Z

(Notes for when we eventually have test for this)

These are some of the things to test

Run -> Reset -> Run -> Reset -> Run
Run -> Run -> Run ...
Reset -> Reset -> Reset ...
Reset -> Reset -> Reset ...
Reset -> Pause
Run -> Pause -> Run (seems trivial but one of the 5 ways I tried to patch the TUI as described above broke this)

Also that last one but with breakpoints/watchpoints instead of pauses, maybe.

… our reset strategy re: #48

re: #48

Kinda hokey, but it works. And look! We found a use for `Control::get_state()`! re: #48

… our reset strategy re: #48

re: #48

Kinda hokey, but it works. And look! We found a use for `Control::get_state()`! re: #48

… our reset strategy re: #48

re: #48

Kinda hokey, but it works. And look! We found a use for `Control::get_state()`! re: #48

rrbutani referenced this issue in ut-utp/core Feb 16, 2020

control: (rpc) try spinning until all futures in the batch resolve as…

c3bd25a

… our reset strategy re: #48

rrbutani referenced this issue in ut-utp/core Feb 16, 2020

control: add some notes about the problem

d27e832

re: #48

rrbutani referenced this issue in ut-utp/core Feb 16, 2020

control: fix the run -> reset [-> run] crashes for now

8f6cc4a

Kinda hokey, but it works. And look! We found a use for `Control::get_state()`! re: #48

rrbutani referenced this issue in ut-utp/core Mar 10, 2020

control: (rpc) try spinning until all futures in the batch resolve as…

4e7d28a

… our reset strategy re: #48

rrbutani referenced this issue in ut-utp/core Mar 10, 2020

control: add some notes about the problem

96575e8

re: #48

rrbutani referenced this issue in ut-utp/core Mar 10, 2020

control: fix the run -> reset [-> run] crashes for now

4dacc07

Kinda hokey, but it works. And look! We found a use for `Control::get_state()`! re: #48

rrbutani referenced this issue in ut-utp/core Mar 16, 2020

control: (rpc) try spinning until all futures in the batch resolve as…

bba99dc

… our reset strategy re: #48

rrbutani referenced this issue in ut-utp/core Mar 16, 2020

control: add some notes about the problem

47d8c67

re: #48

rrbutani referenced this issue in ut-utp/core Mar 16, 2020

control: fix the run -> reset [-> run] crashes for now

8dcfb35

Kinda hokey, but it works. And look! We found a use for `Control::get_state()`! re: #48

jer-zhang transferred this issue from ut-utp/core May 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TUI crashes when resetting while running #2

TUI crashes when resetting while running #2

rrbutani commented Feb 12, 2020

rrbutani commented Feb 15, 2020

rrbutani commented Feb 15, 2020

rrbutani commented Feb 15, 2020

rrbutani commented Feb 15, 2020

rrbutani commented Feb 15, 2020

rrbutani commented Feb 16, 2020

TUI crashes when resetting while running #2

TUI crashes when resetting while running #2

Comments

rrbutani commented Feb 12, 2020

rrbutani commented Feb 15, 2020

rrbutani commented Feb 15, 2020

rrbutani commented Feb 15, 2020

rrbutani commented Feb 15, 2020

rrbutani commented Feb 15, 2020

rrbutani commented Feb 16, 2020