Clarify what Kata mgr does

Signed-off-by: Mike McKiernan <[email protected]>
NVIDIA · Sep 10, 2024 · ae396a3 · ae396a3
1 parent 2c5c3ac
commit ae396a3
Showing 1 changed file with 40 additions and 90 deletions.
diff --git a/gpu-operator/gpu-operator-kata.rst b/gpu-operator/gpu-operator-kata.rst
@@ -69,56 +69,19 @@ The following diagram shows the software components that Kubernetes uses to run
 
 NVIDIA supports Kata Containers by using Helm to run a daemon set that installs the Kata runtime and QEMU.
 
-The daemon set runs the `kata-deploy.sh` script and configures each worker node with a runtime class, ``kata-qemu-nvidia-gpu``,
-and configures containerd for the runtime class.
+The daemon set runs the ``kata-deploy.sh`` script and configures each worker node with a runtime class, ``kata-qemu-nvidia-gpu``.
 
 About NVIDIA Kata Manager
 =========================
 
 When you configure the GPU Operator for Kata Containers, the Operator
 deploys NVIDIA Kata Manager as an operand.
 
-The manager downloads an NVIDIA optimized Linux kernel image and initial RAM disk that
-provides the lightweight operating system for the virtual machines that run in QEMU.
-These artifacts are downloaded from the NVIDIA container registry, nvcr.io, on each worker node.
-
-.. comment
-
-   NVIDIA Kata Manager Configuration
-   =================================
-
-   The following part of the cluster policy shows the fields related to the manager:
-
-   .. code-block:: yaml
-
-      kataManager:
-        enabled: true
-        config:
-          artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
-          runtimeClasses:
-          - artifacts:
-              pullSecret: ""
-              url: nvcr.io/nvidia/cloud-native/kata-gpu-artifacts:ubuntu22.04-525
-            name: kata-qemu-nvidia-gpu
-            nodeSelector: {}
-          - artifacts:
-              pullSecret: ""
-              url: nvcr.io/nvidia/cloud-native/kata-gpu-artifacts:ubuntu22.04-535-snp
-            name: kata-qemu-nvidia-gpu-snp
-            nodeSelector: {}
-        repository: nvcr.io/nvidia/cloud-native
-        image: k8s-kata-manager
-        version: v0.1.0
-        imagePullPolicy: IfNotPresent
-        imagePullSecrets: []
-        env: []
-        resources: {}
-
-   The ``kata-qemu-nvidia-gpu`` runtime class is used with Kata Containers.
-
-   The ``kata-qemu-nvidia-gpu-snp`` runtime class is used with Confidential Containers
-   and is installed by default even though it is not used with this configuration.
+The manager performs the following actions on each node that is labeled to run Kata Containers:
 
+- Configure containerd with the ``kata-qemu-nvidia-gpu`` runtime class.
+- Create a CDI specification, ``/var/run/cdi/nvidia.com-pgpu.yaml``, for each GPU on the node.
+- Loads the vhost-sock and vhost-net Linux kernel modules.
 
 *********************************
 Benefits of Using Kata Containers
@@ -135,6 +98,7 @@ The primary benefits of Kata Containers are as follows:
 
 * Transparent deployment of unmodified containers.
 
+
 ****************************
 Limitations and Restrictions
 ****************************
@@ -148,8 +112,8 @@ Limitations and Restrictions
 * Support for Kata Containers is limited to the implementation described on this page.
   The Operator does not support Red Hat OpenShift sandbox containers.
 
-* Uninstalling the GPU Operator or the NVIDIA Kata Manager does not remove the files
-  that the manager downloads and installs in the ``/opt/nvidia-gpu-operator/artifacts/runtimeclasses/kata-qemu-nvidia-gpu/``
+* Uninstalling the GPU Operator or the NVIDIA Kata Manager does not remove the
+  ``/opt/nvidia-gpu-operator/artifacts/runtimeclasses/``
   directory on the worker nodes.
 
 * NVIDIA supports the Operator and Kata Containers with the containerd runtime only.
@@ -196,7 +160,7 @@ Prerequisites
 
 * Your hosts are configured to support IOMMU.
 
-  If the output from running ``ls /sys/kernel/iommu_groups`` includes a value greater than ``0``,
+  If the output from running ``ls -1 /sys/kernel/iommu_groups | wc -l`` includes a value greater than ``0``,
   then your host is configured for IOMMU.
 
   If a host is not configured or you are unsure, add the ``intel_iommu=on`` Linux kernel command-line argument.
@@ -260,7 +224,7 @@ The following table shows the configurable values from the Kata Deploy Helm char
 
    * - ``kataDeploy.createRuntimeClasses``
      - When set to ``true``, the ``kata-deploy.sh`` script installs the runtime classes on the nodes.
-     - ``true``
+     - ``false``
 
    * - ``kataDeploy.createDefaultRuntimeClass``
      - When set to ``true``, the ``kata-deploy.sh`` script sets the runtime class specified in the ``defaultShim`` field as the default Kata runtime class.
@@ -273,14 +237,17 @@ The following table shows the configurable values from the Kata Deploy Helm char
    * - ``kataDeploy.defaultShim``
      - Specifies the shim to set as the default Kata runtime class.
        This field is ignored unless you specify ``createDefaultRuntimeClass: true``.
-     - ``qemu-nvidia-gpu``
+     - None
 
    * - ``kataDeploy.imagePullPolicy``
      - Specifies the image pull policy for the ``kata-deploy`` container.
      - ``Always``
 
    * - ``kataDeploy.k8sDistribution``
-     - FIXME
+     - Specifies the Kubernetes platform.
+       The Helm chart uses the value to set the platform-specific location of the containerd configuration file.
+
+       Supported values are ``k8s``, ``k3s``, ``rke2``, and ``k0s``.
      - ``k8s``
 
    * - ``kataDeploy.repository``
@@ -303,28 +270,35 @@ Install the Kata Deploy Helm Chart
 
 Perform the following steps to install the Helm chart:
 
+#. Label the nodes to run virtual machines in containers. Label only the nodes that you want to run with Kata Containers:
+
+   ```console
+   $ kubectl label node <node-name> nvidia.com/gpu.workload.config=vm-passthrough
+   ```
+
 #. Add and update the NVIDIA Helm repository:
 
    .. code-block:: console
 
       $ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
          && helm repo update
 
-#. Specify at least the following options when you install the chart.
+#. Specify at least the following options when you install the chart:
 
    .. code-block:: console
 
       $ helm install --wait --generate-name \
          -n kube-system \
-         nvidia/kata-deploy
+         nvidia/kata-deploy \
+         --set kataDeploy.createRuntimeClasses=true
 
 #. Optional: Verify the installation.
 
    - Confirm the ``kata-deploy`` containers are running:
 
      .. code-block:: console
 
-        $ kubectl get pods -n kube-system -l FIXME
+        $ kubectl get pods -n kube-system -l name=kata-deploy
 
    - Confirm the runtime class is installed:
 
@@ -336,7 +310,8 @@ Perform the following steps to install the Helm chart:
 
      .. code-block:: output
 
-        FIXME
+        NAME                   HANDLER                AGE
+        kata-qemu-nvidia-gpu   kata-qemu-nvidia-gpu   23s
 
 *******************************
 Install the NVIDIA GPU Operator
@@ -363,7 +338,8 @@ Perform the following steps to install the Operator for use with Kata Containers
          -n gpu-operator --create-namespace \
          nvidia/gpu-operator \
          --set sandboxWorkloads.enabled=true \
-         --set kataManager.enabled=true
+         --set kataManager.enabled=true \
+         --set kataManager.config.runtimeClasses=null
 
    *Example Output*
 
@@ -400,7 +376,7 @@ Verification
       nvidia-sandbox-validator-9wjm4                               1/1     Running     0          2m37s
       nvidia-vfio-manager-vg4wp                                    1/1     Running     0          3m36s
 
-#. Verify that the ``kata-qemu-nvidia-gpu`` and ``kata-qemu-nvidia-gpu-snp`` runtime classes are available:
+#. Verify that the ``kata-qemu-nvidia-gpu`` runtime classes is available:
 
    .. code-block:: console
 
@@ -409,53 +385,27 @@ Verification
    *Example Output*
 
    .. code-block:: output
-      :emphasize-lines: 6, 7
 
       NAME                       HANDLER                    AGE
-      kata                       kata                       37m
-      kata-clh                   kata-clh                   37m
-      kata-clh-tdx               kata-clh-tdx               37m
-      kata-qemu                  kata-qemu                  37m
       kata-qemu-nvidia-gpu       kata-qemu-nvidia-gpu       96s
-      kata-qemu-nvidia-gpu-snp   kata-qemu-nvidia-gpu-snp   96s
-      kata-qemu-sev              kata-qemu-sev              37m
-      kata-qemu-snp              kata-qemu-snp              37m
-      kata-qemu-tdx              kata-qemu-tdx              37m
       nvidia                     nvidia                     97s
 
 
-#. Optional: If you have host access to the worker node, you can perform the following steps:
-
-   #. Confirm that the host uses the ``vfio-pci`` device driver for GPUs:
-
-      .. code-block:: console
-
-         $ lspci -nnk -d 10de:
-
-      *Example Output*
-
-      .. code-block:: output
-         :emphasize-lines: 3
+#. Optional: If you have host access to the worker node, confirm that the host uses the ``vfio-pci`` device driver for GPUs:
 
-         65:00.0 3D controller [0302]: NVIDIA Corporation GA102GL [A10] [10de:2236] (rev a1)
-                 Subsystem: NVIDIA Corporation GA102GL [A10] [10de:1482]
-                 Kernel driver in use: vfio-pci
-                 Kernel modules: nvidiafb, nouveau
-
-   #. Confirm that NVIDIA Kata Manager installed the ``kata-qemu-nvidia-gpu`` runtime class files:
-
-      .. code-block:: console
+   .. code-block:: console
 
-         $ ls -1 /opt/nvidia-gpu-operator/artifacts/runtimeclasses/kata-qemu-nvidia-gpu/
+      $ lspci -nnk -d 10de:
 
-      *Example Output*
+   *Example Output*
 
-      .. code-block:: output
+   .. code-block:: output
+      :emphasize-lines: 3
 
-         configuration-nvidia-gpu-qemu.toml
-         kata-ubuntu-jammy-nvidia-gpu.initrd
-         vmlinuz-5.xx.x-xxx-nvidia-gpu
-         ...
+      65:00.0 3D controller [0302]: NVIDIA Corporation GA102GL [A10] [10de:2236] (rev a1)
+              Subsystem: NVIDIA Corporation GA102GL [A10] [10de:1482]
+              Kernel driver in use: vfio-pci
+              Kernel modules: nvidiafb, nouveau
 
 
 *********************