[sled-agent] Give `InstanceManager` a gun #6915

hawkw · 2024-10-22T19:31:20Z

Stacked on top of #6913

Presently, sled-agent sends requests to terminate an instance to the InstanceRunner task over the same request channel as all other requests sent to that instance. This means that the InstanceRunner will attempt to terminate the instance only once other requests received before the termination request have been processed, and an instance cannot bew terminated if its request channel has filled up.

This seems unfortunate. If an instance gets stuck, the fact that it's stuck should not prevent it from being stopped. Instead, requests to terminate the instance should be prioritized over other requests. This commit does that.

hawkw · 2024-10-22T19:32:33Z

This is presently a draft, partially because it depends on #6911, but moreso because I'm wondering if it's actually the complete solution. Potentially, we might want to make a best-effort attempt to finish processing other requests, like creating a zone bundle, with a timeout so that the terminate request is always honored eventually. I'll keep working on this.

gjcolombo · 2024-10-22T20:44:16Z

sled-agent/src/instance.rs

                },
+                // Requests to terminate the instance take priority over any


Maybe a silly question, but: What happens if this task is already stuck awaiting a response to one of the commands below? IIUC that's what we're seeing in #6911--the InstanceRunner loop for some instance is stuck waiting on a Propolis request, so it's not looking at any of its message queues.

Ah, yeah, you're right, this will need to also select over any operation we do in any of these branches and the termination channel firing. I'll fix that.

Okay, @gjcolombo, commit 2a331ab changes this so that the termination channel firing will take priority over any in flight request to the instance, so even if that gets stuck, we will still pull the plug immediately.

I wondered if we wanted to add a grace period to allow the in-flight operation to finish, but after talking to @smklein, I don't think that's actually necessary, because terminate is only used by the most forceful attempts to stop an instance (the vmm_unregister API and killing an instance that was using an expunged disk). A "normal" attempt to stop the instance should go through instance_put_state with the Stopped state, which goes in the normal request queue.

sled-agent/src/instance.rs

@smklein

as suggested by @smklein

hawkw requested a review from smklein October 22, 2024 19:31

gjcolombo reviewed Oct 22, 2024

View reviewed changes

hawkw force-pushed the eliza/give-instance-manager-a-gun branch from 7be3c00 to 2a331ab Compare October 23, 2024 18:34

Base automatically changed from eliza/instance-mangling to main October 25, 2024 17:22

smklein approved these changes Oct 28, 2024

View reviewed changes

sled-agent/src/instance.rs Outdated Show resolved Hide resolved

hawkw added 2 commits October 28, 2024 10:39

[sled-agent] Give InstanceManager a gun

176ea5a

allow termination to interrupt stuck instance ops

da53ce9

hawkw force-pushed the eliza/give-instance-manager-a-gun branch from 2a331ab to da53ce9 Compare October 28, 2024 17:39

factor out termination channel handling

608b336

as suggested by @smklein

hawkw marked this pull request as ready for review October 29, 2024 22:28

hawkw requested a review from gjcolombo October 29, 2024 22:28

gjcolombo approved these changes Oct 29, 2024

View reviewed changes

hawkw merged commit e313b65 into main Oct 30, 2024
16 checks passed

hawkw deleted the eliza/give-instance-manager-a-gun branch October 30, 2024 18:55

hawkw mentioned this pull request Oct 30, 2024

sled-agent not serving vmm requests #6911

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sled-agent] Give `InstanceManager` a gun #6915

[sled-agent] Give `InstanceManager` a gun #6915

hawkw commented Oct 22, 2024

hawkw commented Oct 22, 2024

gjcolombo Oct 22, 2024

hawkw Oct 22, 2024

hawkw Oct 23, 2024

		},
		// Requests to terminate the instance take priority over any

[sled-agent] Give InstanceManager a gun #6915

[sled-agent] Give InstanceManager a gun #6915

Conversation

hawkw commented Oct 22, 2024

hawkw commented Oct 22, 2024

gjcolombo Oct 22, 2024

Choose a reason for hiding this comment

hawkw Oct 22, 2024

Choose a reason for hiding this comment

hawkw Oct 23, 2024

Choose a reason for hiding this comment

[sled-agent] Give `InstanceManager` a gun #6915

[sled-agent] Give `InstanceManager` a gun #6915