You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On NVIDIA GPUs, MIG partitions are not persisted between node reboots.
If a node crashes and InstaSlice does not have a chance to gracefully de-allocate and delete MIG partitions, it will end up listing "dangling" allocations and trying to assign them to re-started pods according to the outdated InstaSlice object. In that case the workloads will fail to resume running.
The text was updated successfully, but these errors were encountered:
@rphillips if you mean MIG partitions on the GPU, that's not needed as they don't survive reboots.
What I meant was deleting all the existing allocation entries in InstaSlice (precisely because the respective MIG partitions don't exist anymore), and updating all previously running MIG workloads to assign new partitions so that the pods can resume running. I think that should be the controller's job.
Sorry if the issue description is confusing. Feel free to suggest a better wording.
On NVIDIA GPUs, MIG partitions are not persisted between node reboots.
If a node crashes and InstaSlice does not have a chance to gracefully de-allocate and delete MIG partitions, it will end up listing "dangling" allocations and trying to assign them to re-started pods according to the outdated InstaSlice object. In that case the workloads will fail to resume running.
The text was updated successfully, but these errors were encountered: