-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sled-agent not serving vmm requests #6911
Comments
Probably related to #6904. I found a relevant-looking sled agent log snippet (from
I suspect what's happened is that this specific snapshot request is stuck. It's not timing out because it's being serviced via a call to Propolis, and sled agent's Propolis clients don't configure a timeout: omicron/sled-agent/src/instance.rs Lines 1654 to 1660 in ca63e9f
Things to follow up on:
|
I should add that I also took a very quick peek at the core file, but all it reveals at first glance is that all of the |
I think this seems quite doable and is almost certainly a good idea. The only thing that should be synchronous in the Happy to take a crack at this change unless you were planning to imminently. |
It's all yours if you want it! Let me know if you'd like me to pitch in. |
One additional note re this and #6904: Although transferring requests from the |
Yes, it indeed has a downstairs on sled 9 (BRM42220063) so that would explain why the snapshot task was stuck:
|
Before we have a fix to avoid the blocking behavior, what would be a reasonable workaround in this case? Would restarting |
Good call--I think this should work, at least as far as instance management is concerned. All the existing Propolis zones will be reaped when the new sled agent starts, but any affected instances should go to Failed and then be restarted. The new sled agent will otherwise have a brand new instance manager task with an empty instance table.
The next snapshot saga retry should produce a permanent error, since this request is directed at a particular VMM, and sled agent will now return "no such VMM" instead of stalling and timing out. Assuming the saga recognizes that it now has a permanent error it should unwind at that point. |
Thanks @gjcolombo. @augustuswm - There is no rush but it sounds like we can put sled 13 back in service with a sled-agent restart. |
Sounds good. I will do that now. |
That looks to have worked. VMs on sled 13 were replaced, and then stop + start operations afterwards were successful. |
Sled-agent's `InstanceManager` task is responsible for managing the table of all instances presently running on the sled. When the sled-agent receives a request relating to an individual instance on the sled, it's sent to the `InstanceManager` over a `tokio::sync::mpsc` channel, and is then dispatched by the `InstanceManager` to the `InstanceRunner` task responsible for that individual instance by sending it over a *second* `tokio::sync::mpsc` channel. This is where things start to get interesting.[^1] `tokio::sync::mpsc` is a *bounded* channel: there is a maximum number of messages which may be queued by a given MPSC channel at any given time. The `mpsc::Sender::send` method is an `async fn`, and if the channel is at capacity, that method will _wait_ until there is once again space in the channel to send the message being sent. Presently, `mpsc::Sender::send` is called by the `InstanceManager`'s main run loop when dispatching a request to an individual instance. As you may have already started to piece together, this means that if a given `InstanceRunner` task is not able to process requests fast enough to drain its channel, the entire `InstanceManager` loop will wait when dispatching a request to that instance until the queue has been drained. This means that if one instance's runner task has gotten stuck on something, like waiting for a Crucible flush that will never complete (as seen in #6911), that instance will prevent requests being dispatched to *any other instance* managed by the sled-agent. This is quite unfortunate! This commit fixes this behavior by changing the functions that send requests to an individual instance's task to instead *shed load* when that instance's request queue is full. We now use the `mpsc::Sender::try_send` method, rather than `mpsc::Sender::send`, which does not wait and instead immediately returns an error when the channel is full. This allows the `InstanceManager` to instead return an error to the client indicating the channel is full, and move on to processing requests to other instances which may not be stuck. Thus, a single stuck instance can no longer block requests from being dispatched to other, perfectly fine instances. The error returned when the channel is at capacity is converted to an HTTP 503 Service Unavailable error by the API. This indicates to the client that their request to that instance was not able to be processed at this time, but that it may be processed successfully in the future.[^2] Now, we can shed load while allowing clients to retry later, which seems much better than the present situation. [^1]: In the sense of "may you live in interesting times", naturally. [^2]: I also considered returning 429 Too Many Requests here, but my understanding is that that status code is supposed to indicate that too many requests have been received from *that specific client*. In this case, we haven't hit a per-client rate limit; we're just overloaded by requests more broadly, so it's not that particular client's fault.
Sled-agent's `InstanceManager` task is responsible for managing the table of all instances presently running on the sled. When the sled-agent receives a request relating to an individual instance on the sled, it's sent to the `InstanceManager` over a `tokio::sync::mpsc` channel, and is then dispatched by the `InstanceManager` to the `InstanceRunner` task responsible for that individual instance by sending it over a *second* `tokio::sync::mpsc` channel. This is where things start to get interesting.[^1] `tokio::sync::mpsc` is a *bounded* channel: there is a maximum number of messages which may be queued by a given MPSC channel at any given time. The `mpsc::Sender::send` method is an `async fn`, and if the channel is at capacity, that method will _wait_ until there is once again space in the channel to send the message being sent. Presently, `mpsc::Sender::send` is called by the `InstanceManager`'s main run loop when dispatching a request to an individual instance. As you may have already started to piece together, this means that if a given `InstanceRunner` task is not able to process requests fast enough to drain its channel, the entire `InstanceManager` loop will wait when dispatching a request to that instance until the queue has been drained. This means that if one instance's runner task has gotten stuck on something, like waiting for a Crucible flush that will never complete (as seen in #6911), that instance will prevent requests being dispatched to *any other instance* managed by the sled-agent. This is quite unfortunate! This commit fixes this behavior by changing the functions that send requests to an individual instance's task to instead *shed load* when that instance's request queue is full. We now use the `mpsc::Sender::try_send` method, rather than `mpsc::Sender::send`, which does not wait and instead immediately returns an error when the channel is full. This allows the `InstanceManager` to instead return an error to the client indicating the channel is full, and move on to processing requests to other instances which may not be stuck. Thus, a single stuck instance can no longer block requests from being dispatched to other, perfectly fine instances. The error returned when the channel is at capacity is converted to an HTTP 503 Service Unavailable error by the API. This indicates to the client that their request to that instance was not able to be processed at this time, but that it may be processed successfully in the future.[^2] Now, we can shed load while allowing clients to retry later, which seems much better than the present situation. [^1]: In the sense of "may you live in interesting times", naturally. [^2]: I also considered returning 429 Too Many Requests here, but my understanding is that that status code is supposed to indicate that too many requests have been received from *that specific client*. In this case, we haven't hit a per-client rate limit; we're just overloaded by requests more broadly, so it's not that particular client's fault.
Stacked on top of #6913 Presently, sled-agent sends requests to terminate an instance to the `InstanceRunner` task over the same `tokio::sync::mpsc` request channel as all other requests sent to that instance. This means that the `InstanceRunner` will attempt to terminate the instance only once other requests received before the termination request have been processed, and an instance cannot be terminated if its request channel has filled up. Similarly, if an instance's `InstanceRunner` task is waiting for an in-flight request to the VMM to complete, the request to terminate the instance will not be seen until the current request to Propolis has returned. This means that if the instance has gotten stuck for some reason --- e.g., because it is attempting a Crucible snapshot that cannot complete because a physical disk has gone missing, as seen in #6911 --- the instance cannot be terminated. Sadly, in this case, the only way to resolve the stuck request is to terminate the instance, but we cannot do so *because* the instance is stuck. This seems unfortunate: Ii we try to kill an instance because it's doing something that it will never be able to finish, it shouldn't be able to say "no, you can't kill me, I'm too *busy* to die!". Instead, requests to terminate the instance should be prioritized over other requests. This commit does that. Rather than sending termination requests to the `InstanceRunner` over the same channel as all other requests, we instead introduce a separate channel that's *just* for termination requests, which is preferred over the request channel in the biased `tokio::select!` in the `InstanceRunner` run loop. This means that a full request channel cannot stop a termination request from being sent. When a request to the VMM is in flight, the future that awaits that request's completion is now one branch of a similar `tokio::select!` with the termination channel. This way, if a termination request comes in while the `InstanceRunner` is awaiting an in-flight instance operation, it will still be notified immediately of the termination request, cancel whatever operation it's waiting for, and go ahead and terminate the VMM immediately. This is the correct behavior here, since the terminate operation is intended to forcefully terminate the VMM *now*, and is used internally for purposes such as `use_only_these_disks` killing instances that are using a no-longer-extant disk, or the control plane requesting that the sled-agent forcibly unregister the instance. "Normal" requests to stop the instance gracefully will go through the `instance_put_state` API instead, sending requests through the normal request channel and allowing in flight operations to complete.
PR #6913 fixes the issue where a sled-agent with one instance which has gotten stuck will not handle requests for any other instances on the sled. PR #6915 (which just merged) which fixes the sled-agent not being able to forcefully terminate instances when it is waiting for an in-flight operation to complete. Collectively, those two changes fix this issue as far as the sled-agent is concerned. Unless we want to keep this issue open until the underlying Crucible issue is resolved, I think we can close this --- and I think #6932 picked up the Crucible fix, but I'd want @leftwo to confirm? |
The underlying issue was because a sled has gone away (potentially an electrical issue), not a crucible problem. We can close this ticket. |
Hm, I thought it was a Crucible problem insofar as the instance had gotten stuck because it was trying to take a snapshot while in a degraded state because one of its downstairs replicas was on the sled that had gone away, and that condition was being handled by retrying the snapshot in a loop forever? In any case, happy to close this! |
The situation was a bit unusal as we chose not to expunge the sled. If we had gone through expungement, the missing region downstairs would have been replaced on another sled and brought the disk back to a healthy state which in turn allowed the snapshot to complete. If @leftwo thinks we should handle the prolonged degraded state in a better way, we probably want to have a new crucible ticket filed for that (so closing this ticket is still the right thing to do!). |
Yup, that makes sense! Thanks for clarifying! |
The issue was seen on sled 13 of rack3. There was a problem with mupdate during that time and we're not sure if it's related at all. Other sleds do not have this particular issue AFAICT. https://github.com/oxidecomputer/colo/issues/88#issuecomment-2423940805 is where we noted the problem, also cloned below:
The recent sled-agent logs and core files have been uploaded to
/staff/rack3/colo-88
.The text was updated successfully, but these errors were encountered: