Skip to content
This repository has been archived by the owner on Sep 28, 2024. It is now read-only.

[Question] Distribution, Packaging and DKMS support? #703

Open
ernstae opened this issue May 5, 2019 · 3 comments
Open

[Question] Distribution, Packaging and DKMS support? #703

ernstae opened this issue May 5, 2019 · 3 comments

Comments

@ernstae
Copy link

ernstae commented May 5, 2019

I run multiple instances in Azure with Standard_NV and Standard_NC series virtual machines with more than one GPU device assigned to the VM. Without the LIS RPMs, the VM doesn't see all NVIDIA GPU devices assigned to the guest, and if there's a kernel+LIS mismatch, the results can be unpredictable (e.g. 0 GPU devices, or 1 GPU device)

The patching strategy I follow is to adopt new kernels within a week of release, but the LIS packages are usually not available to match new kernels that quickly (I'm in CentOS 7 land)

I was using the OpenLogic repository to manage the LIS installation process, but that repository hasn't been updated since version 4.2.6, and it's no longer possible to reliably execute yum install kmod-microsoft-hyper-v microsoft-hyper-v to install the LIS rpms, because there is a specific set of RPMs for each small patch-level of every kernel.

That brings me to the impracticality of having to download the >400mb .tar or ISO and run the shell scripts to install this set of packages. (also makes it more complicated in an airgapped environment where http://aka.ms/LIS)

My questions about LIS and how it relates to Azure VMs running Linux...

  • Can LIS be distributed as a single set of RPMs for each operating system distribution?
  • If so, can the packages be added to the microsoft-prod yum repository?
  • Can you rely on dkms to compile automatically, based on a kernel change (e.g. 3.10.0-957.10.1 vs 3.10.0-957.12.1) so the current installation could be streamlined?
  • Can you improve testing for Azure GPU VMs, ensuring that a Standard_NC12 host has 2 reported GPUs when LIS has been installed (e.g. LIS 4.3.0 was broken and only revealed 1 GPU)
@ernstae
Copy link
Author

ernstae commented May 5, 2019

I did some follow-on testing today, and it looks like adding this CentOS repository does solve a multitude of the problems I was experiencing:

http://mirror.centos.org/centos/7/virt/x86_64/azure/

It does require switching to the kernel-azure package and associated kernel-azure-tools package (which appears to be the LIS software)

I was able to see multiple GPUs with no fuss on the latest 3.10.0-957.12.1 kernel release, which was not functional using a vanilla kernel + LIS 4.3.1, so that's an improvement. Is this the proposed path forward for CentOS/RHEL users?

@santoshx
Copy link
Contributor

FYI, latest LIS release (lis-rpms-4.3.3.tar.gz) is size optimized and is 170MB.

@ernstae
Copy link
Author

ernstae commented Jun 27, 2019

@santoshx - I appreciate the smaller size download. That is definitely helpful. However, I have found that relying on the CentOS Virtualization SIG kernel-azure to be a much more reliable method of ensuring support for multi-GPU Linux VMs on Azure. I now have a more predictable (and less manual) approach. It reduces my dependency on another package, and has offered a cleaner experience with a much faster ability to patch after a CVE kernel update (a few extra days, as opposed to a couple weeks)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants