Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sled-agent] Give InstanceManager a gun #6915

Merged
merged 3 commits into from
Oct 30, 2024
Merged

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Oct 22, 2024

Stacked on top of #6913

Presently, sled-agent sends requests to terminate an instance to the InstanceRunner task over the same request channel as all other requests sent to that instance. This means that the InstanceRunner will attempt to terminate the instance only once other requests received before the termination request have been processed, and an instance cannot bew terminated if its request channel has filled up.

This seems unfortunate. If an instance gets stuck, the fact that it's stuck should not prevent it from being stopped. Instead, requests to terminate the instance should be prioritized over other requests. This commit does that.

@hawkw hawkw requested a review from smklein October 22, 2024 19:31
@hawkw
Copy link
Member Author

hawkw commented Oct 22, 2024

This is presently a draft, partially because it depends on #6911, but moreso because I'm wondering if it's actually the complete solution. Potentially, we might want to make a best-effort attempt to finish processing other requests, like creating a zone bundle, with a timeout so that the terminate request is always honored eventually. I'll keep working on this.

},
// Requests to terminate the instance take priority over any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a silly question, but: What happens if this task is already stuck awaiting a response to one of the commands below? IIUC that's what we're seeing in #6911--the InstanceRunner loop for some instance is stuck waiting on a Propolis request, so it's not looking at any of its message queues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, you're right, this will need to also select over any operation we do in any of these branches and the termination channel firing. I'll fix that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, @gjcolombo, commit 2a331ab changes this so that the termination channel firing will take priority over any in flight request to the instance, so even if that gets stuck, we will still pull the plug immediately.

I wondered if we wanted to add a grace period to allow the in-flight operation to finish, but after talking to @smklein, I don't think that's actually necessary, because terminate is only used by the most forceful attempts to stop an instance (the vmm_unregister API and killing an instance that was using an expunged disk). A "normal" attempt to stop the instance should go through instance_put_state with the Stopped state, which goes in the normal request queue.

@hawkw hawkw force-pushed the eliza/give-instance-manager-a-gun branch from 7be3c00 to 2a331ab Compare October 23, 2024 18:34
Base automatically changed from eliza/instance-mangling to main October 25, 2024 17:22
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
@hawkw hawkw force-pushed the eliza/give-instance-manager-a-gun branch from 2a331ab to da53ce9 Compare October 28, 2024 17:39
@hawkw hawkw marked this pull request as ready for review October 29, 2024 22:28
@hawkw hawkw requested a review from gjcolombo October 29, 2024 22:28
@hawkw hawkw merged commit e313b65 into main Oct 30, 2024
16 checks passed
@hawkw hawkw deleted the eliza/give-instance-manager-a-gun branch October 30, 2024 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants