-
Notifications
You must be signed in to change notification settings - Fork 40
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[sled-agent] Give
InstanceManager
a gun (#6915)
Stacked on top of #6913 Presently, sled-agent sends requests to terminate an instance to the `InstanceRunner` task over the same `tokio::sync::mpsc` request channel as all other requests sent to that instance. This means that the `InstanceRunner` will attempt to terminate the instance only once other requests received before the termination request have been processed, and an instance cannot be terminated if its request channel has filled up. Similarly, if an instance's `InstanceRunner` task is waiting for an in-flight request to the VMM to complete, the request to terminate the instance will not be seen until the current request to Propolis has returned. This means that if the instance has gotten stuck for some reason --- e.g., because it is attempting a Crucible snapshot that cannot complete because a physical disk has gone missing, as seen in #6911 --- the instance cannot be terminated. Sadly, in this case, the only way to resolve the stuck request is to terminate the instance, but we cannot do so *because* the instance is stuck. This seems unfortunate: Ii we try to kill an instance because it's doing something that it will never be able to finish, it shouldn't be able to say "no, you can't kill me, I'm too *busy* to die!". Instead, requests to terminate the instance should be prioritized over other requests. This commit does that. Rather than sending termination requests to the `InstanceRunner` over the same channel as all other requests, we instead introduce a separate channel that's *just* for termination requests, which is preferred over the request channel in the biased `tokio::select!` in the `InstanceRunner` run loop. This means that a full request channel cannot stop a termination request from being sent. When a request to the VMM is in flight, the future that awaits that request's completion is now one branch of a similar `tokio::select!` with the termination channel. This way, if a termination request comes in while the `InstanceRunner` is awaiting an in-flight instance operation, it will still be notified immediately of the termination request, cancel whatever operation it's waiting for, and go ahead and terminate the VMM immediately. This is the correct behavior here, since the terminate operation is intended to forcefully terminate the VMM *now*, and is used internally for purposes such as `use_only_these_disks` killing instances that are using a no-longer-extant disk, or the control plane requesting that the sled-agent forcibly unregister the instance. "Normal" requests to stop the instance gracefully will go through the `instance_put_state` API instead, sending requests through the normal request channel and allowing in flight operations to complete.
- Loading branch information
Showing
1 changed file
with
199 additions
and
74 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters