Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TUI crashes when resetting while running #2

Open
rrbutani opened this issue Feb 12, 2020 · 6 comments
Open

TUI crashes when resetting while running #2

rrbutani opened this issue Feb 12, 2020 · 6 comments

Comments

@rrbutani
Copy link
Member

To reproduce:

  • Run (r)
  • Reset (alt + r)
@rrbutani
Copy link
Member Author

This is because of our impl of EventFutureSharedState::reset; as expected our current strategy does not work and needs to be revisited.

@rrbutani
Copy link
Member Author

If we want to properly resolve this we need to change EventFutureSharedState to be able to handle Futures that are no longer polled.

Ultimately I think this means we have to abandon the "single batch" model that we have now.

Right now, we assume that all the futures registered with us are waiting on the next event and that when the event occurs (when the batch is sealed) no more futures will be registered until a new batch starts (which will happen when all the futures poll and acknowledge that an event has happened).

This is a decent assumption to make if the Control user is actually using the Futures produced by run_until_event: such users will likely know not to reset until the returned future resolves. However, in cases where Control users aren't doing anything with the returned future (like our own TUI at the moment), users don't know to wait til the future resolves to call reset and our assumption is violated.

One simple fix is to have the reset function trigger a Halt or a Pause event as a way to manually resolve the futures. Unfortunately this does not work: while it seals the batch, we cannot force the Futures we have out (or more precisely their executors) to poll themselves so now we have a synchronization issue: we can't reset the shared state until the futures have all been resolved.

Spinning in reset until this condition is true also does not work for a couple of reasons:

  • First, as part of the RPC setup the device polls the future that it can potentially have pending within the tick function; so we'd need to call tick repeatedly instead of just spinning so we aren't deadlocked. This isn't a deal breaker; just something to keep track of.
  • Second — and this is a deal breaker — we can't guarantee that the futures we have out will ever be polled. In the above example with the TUI, the future is never sent to an executor so we'd just be stuck on that batch forever.

One way to fix this is to try to detect cases like the above where the future is not actually being polled. A decent proxy for this is adding hooks to Drop; we can have each EventFuture take a wrapper type that contains a reference to an impl of EventFutureSharedState (or the porcelain) and have that notify the impl when it gets dropped (via a custom Drop impl). There are two problems with this:

  • This custom Drop would get run on futures that are never used or abandoned and also on futures that resolve! We'd need some way of not counting the futures that resolve. We'd also want to disallow Clone on this wrapper type, but that's another thing.
  • Not all abandoned futures get dropped. Having an API that just crashes when if/when users forget to drop a thing might not be a best, but this is likely to be enough of an edge case that I think it'd be reasonable to call this good enough for now.

On the other hand, the above solution still could suffer from weird race conditions that happen because of the way events get re-sequenced once they go through RPC. More importantly, it will leave some edge cases like the one above.

Another approach is to stop trying to stick to the single batch model and to support multiple batches. The downside is complexity (particularly while maintaining no-std support for the EventFutureSharedState impl). We'd also need to decide how to handle cases where the batch counter rolls over but that is definitely an extreme edge case.

@rrbutani
Copy link
Member Author

Here's a note from EventFutureSharedState::reset:

    //
    // TODO: This is problematic since the way to 'drop' a Future from the
    // client side is to simply stop polling it: when this happens we will not
    // be notified. Instead, users will either be unable to call reset or will
    // encounter a panic when they try to call reset after dropping a
    // RunUntilEvent future.
    //
    // A better system would be to have every batch have a number. Each future
    // would hold this number in addition to a reference to the
    // EventFutureSharedState instance. When futures that aren't in the current
    // batch try to poll, they should error out. The edge case that exists is
    // that if the batch number loops around, it's possible that a very old
    // future comes along and takes the place of a new future.
    //
    // An alternative solution is to use `Drop` to have futures that aren't
    // being used anymore signal to us that they're dead by calling a function
    // on us decrementing the count. A side-effect of this approach is that we
    // won't know whether the waker we currently have a reference to belongs
    // to the future that just told us that it's dead. So, we can't unregister
    // the current (if we have one) waker that we know about.
    //
    // As a result any waker that gets used with an `EventFutureSharedState`
    // backed Future must be able to handle being called on a future is no
    // longer around.
    //
    // Actually, this has an issue: `Drop` is called eventually even when the
    // future resolves. In these cases, the count will be decremented twice:
    // once through the `get_event` function and once through the `decrement`
    // function (when `Drop` occurs).
    //
    // We could just not decrement the count in `get_event` but then we'd have
    // to assume that executors go and `Drop` their futures as soon as they
    // resolve (a reasonable assumption but not required -- afaik the invariant
    // they must uphold is not _polling_ Futures once they've resolved).
    //
    // So, this won't work.

Eventually we should do something that looks like the first thing (batches with numbers).

@rrbutani
Copy link
Member Author

For now let's do this:

  • We won't force all the Futures to have resolved when Control::reset is called.
  • Good Control impls will produce a halt event interp.halt(); self.step() when triggering a reset.
  • RPC's device will catch this before processing any other requests (on the next invocation of tick which is how requests even arrive), resolving the one future that it may have out. This means that even if the very next request it processes is a RunUntilEvent, there will be no problems (i.e. no race conditions) because the old batch will have closed first. This is maintained so long as Device is the only client and sequential consistency is assumed (both of these are things we assume and need anyways).
    • This should work even with multiple RPC stages.

This leaves the user's end: there we still don't have any way of ensuring that the Future we produce is polled and that the Controller's batch is empty (we can and do ensure that it is sealed) before the next RunUntilEvent comes along.

When the user tries, the program will crash.

I think this is acceptable for now. I'll update the TUI to poll the future it gets within its inner loop.

@rrbutani
Copy link
Member Author

It's worth noting that another approach would be to just not resolve any existing futures when a reset occurs but to have them carry over. Calling run_until_event again wouldn't be a problem because we'd just have that future (in addition to any current ones) be waiting on the next event.

The problem is that is a user were to block on a future that they got before a reset without starting the machine again (i.e. calling run_until_event) they'd be stuck. Or at least, their future would never resolve since the machine is paused.

But is this a problem?

I think it depends on your answer to this: Does the existence of an unresolved EventFuture imply that the machine is running?

In crafting the API I definitely thought it did. I still do think the above would be surprising behaviour for a user but it's a decidedly better failure mode than just crashing.

I think I'm still going to go with the stopgap outlined in the last comment but I'm now less sure what a proper solution to this looks like.

@AmrEl-Azizi @pranav12321 @jer-zhang @gipsond Thoughts?

@rrbutani
Copy link
Member Author

(Notes for when we eventually have test for this)

These are some of the things to test

  • Run -> Reset -> Run -> Reset -> Run
  • Run -> Run -> Run ...
  • Reset -> Reset -> Reset ...
  • Reset -> Reset -> Reset ...
  • Reset -> Pause
  • Run -> Pause -> Run (seems trivial but one of the 5 ways I tried to patch the TUI as described above broke this)

Also that last one but with breakpoints/watchpoints instead of pauses, maybe.

rrbutani referenced this issue in ut-utp/core Feb 16, 2020
rrbutani referenced this issue in ut-utp/core Feb 16, 2020
rrbutani referenced this issue in ut-utp/core Feb 16, 2020
Kinda hokey, but it works. And look! We found a use for `Control::get_state()`!

re: #48
rrbutani referenced this issue in ut-utp/core Mar 10, 2020
rrbutani referenced this issue in ut-utp/core Mar 10, 2020
rrbutani referenced this issue in ut-utp/core Mar 10, 2020
Kinda hokey, but it works. And look! We found a use for `Control::get_state()`!

re: #48
rrbutani referenced this issue in ut-utp/core Mar 16, 2020
rrbutani referenced this issue in ut-utp/core Mar 16, 2020
rrbutani referenced this issue in ut-utp/core Mar 16, 2020
Kinda hokey, but it works. And look! We found a use for `Control::get_state()`!

re: #48
@jer-zhang jer-zhang transferred this issue from ut-utp/core May 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant