Handle unexpected node reboots #336

empovit · 2024-12-16T11:26:01Z

On NVIDIA GPUs, MIG partitions are not persisted between node reboots.

If a node crashes and InstaSlice does not have a chance to gracefully de-allocate and delete MIG partitions, it will end up listing "dangling" allocations and trying to assign them to re-started pods according to the outdated InstaSlice object. In that case the workloads will fail to resume running.

rphillips · 2024-12-21T00:07:51Z

Can the daemonset remove all the mig partitions on boot?

empovit · 2024-12-22T07:58:58Z

@rphillips if you mean MIG partitions on the GPU, that's not needed as they don't survive reboots.

What I meant was deleting all the existing allocation entries in InstaSlice (precisely because the respective MIG partitions don't exist anymore), and updating all previously running MIG workloads to assign new partitions so that the pods can resume running. I think that should be the controller's job.

Sorry if the issue description is confusing. Feel free to suggest a better wording.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle unexpected node reboots #336

Handle unexpected node reboots #336

empovit commented Dec 16, 2024

rphillips commented Dec 21, 2024

empovit commented Dec 22, 2024

Handle unexpected node reboots #336

Handle unexpected node reboots #336

Comments

empovit commented Dec 16, 2024

rphillips commented Dec 21, 2024

empovit commented Dec 22, 2024