Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle unexpected node reboots #336

Open
empovit opened this issue Dec 16, 2024 · 2 comments
Open

Handle unexpected node reboots #336

empovit opened this issue Dec 16, 2024 · 2 comments

Comments

@empovit
Copy link
Contributor

empovit commented Dec 16, 2024

On NVIDIA GPUs, MIG partitions are not persisted between node reboots.

If a node crashes and InstaSlice does not have a chance to gracefully de-allocate and delete MIG partitions, it will end up listing "dangling" allocations and trying to assign them to re-started pods according to the outdated InstaSlice object. In that case the workloads will fail to resume running.

@rphillips
Copy link
Contributor

Can the daemonset remove all the mig partitions on boot?

@empovit
Copy link
Contributor Author

empovit commented Dec 22, 2024

@rphillips if you mean MIG partitions on the GPU, that's not needed as they don't survive reboots.

What I meant was deleting all the existing allocation entries in InstaSlice (precisely because the respective MIG partitions don't exist anymore), and updating all previously running MIG workloads to assign new partitions so that the pods can resume running. I think that should be the controller's job.

Sorry if the issue description is confusing. Feel free to suggest a better wording.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants