Skip to content

Commit

Permalink
Clarify what Kata mgr does
Browse files Browse the repository at this point in the history
Signed-off-by: Mike McKiernan <[email protected]>
  • Loading branch information
mikemckiernan committed Sep 10, 2024
1 parent 2c5c3ac commit ae396a3
Showing 1 changed file with 40 additions and 90 deletions.
130 changes: 40 additions & 90 deletions gpu-operator/gpu-operator-kata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,56 +69,19 @@ The following diagram shows the software components that Kubernetes uses to run

NVIDIA supports Kata Containers by using Helm to run a daemon set that installs the Kata runtime and QEMU.

The daemon set runs the `kata-deploy.sh` script and configures each worker node with a runtime class, ``kata-qemu-nvidia-gpu``,
and configures containerd for the runtime class.
The daemon set runs the ``kata-deploy.sh`` script and configures each worker node with a runtime class, ``kata-qemu-nvidia-gpu``.

About NVIDIA Kata Manager
=========================

When you configure the GPU Operator for Kata Containers, the Operator
deploys NVIDIA Kata Manager as an operand.

The manager downloads an NVIDIA optimized Linux kernel image and initial RAM disk that
provides the lightweight operating system for the virtual machines that run in QEMU.
These artifacts are downloaded from the NVIDIA container registry, nvcr.io, on each worker node.

.. comment
NVIDIA Kata Manager Configuration
=================================
The following part of the cluster policy shows the fields related to the manager:
.. code-block:: yaml
kataManager:
enabled: true
config:
artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
runtimeClasses:
- artifacts:
pullSecret: ""
url: nvcr.io/nvidia/cloud-native/kata-gpu-artifacts:ubuntu22.04-525
name: kata-qemu-nvidia-gpu
nodeSelector: {}
- artifacts:
pullSecret: ""
url: nvcr.io/nvidia/cloud-native/kata-gpu-artifacts:ubuntu22.04-535-snp
name: kata-qemu-nvidia-gpu-snp
nodeSelector: {}
repository: nvcr.io/nvidia/cloud-native
image: k8s-kata-manager
version: v0.1.0
imagePullPolicy: IfNotPresent
imagePullSecrets: []
env: []
resources: {}
The ``kata-qemu-nvidia-gpu`` runtime class is used with Kata Containers.
The ``kata-qemu-nvidia-gpu-snp`` runtime class is used with Confidential Containers
and is installed by default even though it is not used with this configuration.
The manager performs the following actions on each node that is labeled to run Kata Containers:

- Configure containerd with the ``kata-qemu-nvidia-gpu`` runtime class.
- Create a CDI specification, ``/var/run/cdi/nvidia.com-pgpu.yaml``, for each GPU on the node.
- Loads the vhost-sock and vhost-net Linux kernel modules.

*********************************
Benefits of Using Kata Containers
Expand All @@ -135,6 +98,7 @@ The primary benefits of Kata Containers are as follows:

* Transparent deployment of unmodified containers.


****************************
Limitations and Restrictions
****************************
Expand All @@ -148,8 +112,8 @@ Limitations and Restrictions
* Support for Kata Containers is limited to the implementation described on this page.
The Operator does not support Red Hat OpenShift sandbox containers.

* Uninstalling the GPU Operator or the NVIDIA Kata Manager does not remove the files
that the manager downloads and installs in the ``/opt/nvidia-gpu-operator/artifacts/runtimeclasses/kata-qemu-nvidia-gpu/``
* Uninstalling the GPU Operator or the NVIDIA Kata Manager does not remove the
``/opt/nvidia-gpu-operator/artifacts/runtimeclasses/``
directory on the worker nodes.

* NVIDIA supports the Operator and Kata Containers with the containerd runtime only.
Expand Down Expand Up @@ -196,7 +160,7 @@ Prerequisites

* Your hosts are configured to support IOMMU.

If the output from running ``ls /sys/kernel/iommu_groups`` includes a value greater than ``0``,
If the output from running ``ls -1 /sys/kernel/iommu_groups | wc -l`` includes a value greater than ``0``,
then your host is configured for IOMMU.

If a host is not configured or you are unsure, add the ``intel_iommu=on`` Linux kernel command-line argument.
Expand Down Expand Up @@ -260,7 +224,7 @@ The following table shows the configurable values from the Kata Deploy Helm char

* - ``kataDeploy.createRuntimeClasses``
- When set to ``true``, the ``kata-deploy.sh`` script installs the runtime classes on the nodes.
- ``true``
- ``false``

* - ``kataDeploy.createDefaultRuntimeClass``
- When set to ``true``, the ``kata-deploy.sh`` script sets the runtime class specified in the ``defaultShim`` field as the default Kata runtime class.
Expand All @@ -273,14 +237,17 @@ The following table shows the configurable values from the Kata Deploy Helm char
* - ``kataDeploy.defaultShim``
- Specifies the shim to set as the default Kata runtime class.
This field is ignored unless you specify ``createDefaultRuntimeClass: true``.
- ``qemu-nvidia-gpu``
- None

* - ``kataDeploy.imagePullPolicy``
- Specifies the image pull policy for the ``kata-deploy`` container.
- ``Always``

* - ``kataDeploy.k8sDistribution``
- FIXME
- Specifies the Kubernetes platform.
The Helm chart uses the value to set the platform-specific location of the containerd configuration file.

Supported values are ``k8s``, ``k3s``, ``rke2``, and ``k0s``.
- ``k8s``

* - ``kataDeploy.repository``
Expand All @@ -303,28 +270,35 @@ Install the Kata Deploy Helm Chart

Perform the following steps to install the Helm chart:

#. Label the nodes to run virtual machines in containers. Label only the nodes that you want to run with Kata Containers:

```console
$ kubectl label node <node-name> nvidia.com/gpu.workload.config=vm-passthrough
```

#. Add and update the NVIDIA Helm repository:

.. code-block:: console
$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
#. Specify at least the following options when you install the chart.
#. Specify at least the following options when you install the chart:

.. code-block:: console
$ helm install --wait --generate-name \
-n kube-system \
nvidia/kata-deploy
nvidia/kata-deploy \
--set kataDeploy.createRuntimeClasses=true
#. Optional: Verify the installation.

- Confirm the ``kata-deploy`` containers are running:

.. code-block:: console
$ kubectl get pods -n kube-system -l FIXME
$ kubectl get pods -n kube-system -l name=kata-deploy
- Confirm the runtime class is installed:

Expand All @@ -336,7 +310,8 @@ Perform the following steps to install the Helm chart:

.. code-block:: output
FIXME
NAME HANDLER AGE
kata-qemu-nvidia-gpu kata-qemu-nvidia-gpu 23s
*******************************
Install the NVIDIA GPU Operator
Expand All @@ -363,7 +338,8 @@ Perform the following steps to install the Operator for use with Kata Containers
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set sandboxWorkloads.enabled=true \
--set kataManager.enabled=true
--set kataManager.enabled=true \
--set kataManager.config.runtimeClasses=null
*Example Output*

Expand Down Expand Up @@ -400,7 +376,7 @@ Verification
nvidia-sandbox-validator-9wjm4 1/1 Running 0 2m37s
nvidia-vfio-manager-vg4wp 1/1 Running 0 3m36s
#. Verify that the ``kata-qemu-nvidia-gpu`` and ``kata-qemu-nvidia-gpu-snp`` runtime classes are available:
#. Verify that the ``kata-qemu-nvidia-gpu`` runtime classes is available:

.. code-block:: console
Expand All @@ -409,53 +385,27 @@ Verification
*Example Output*

.. code-block:: output
:emphasize-lines: 6, 7
NAME HANDLER AGE
kata kata 37m
kata-clh kata-clh 37m
kata-clh-tdx kata-clh-tdx 37m
kata-qemu kata-qemu 37m
kata-qemu-nvidia-gpu kata-qemu-nvidia-gpu 96s
kata-qemu-nvidia-gpu-snp kata-qemu-nvidia-gpu-snp 96s
kata-qemu-sev kata-qemu-sev 37m
kata-qemu-snp kata-qemu-snp 37m
kata-qemu-tdx kata-qemu-tdx 37m
nvidia nvidia 97s
#. Optional: If you have host access to the worker node, you can perform the following steps:

#. Confirm that the host uses the ``vfio-pci`` device driver for GPUs:

.. code-block:: console
$ lspci -nnk -d 10de:
*Example Output*

.. code-block:: output
:emphasize-lines: 3
#. Optional: If you have host access to the worker node, confirm that the host uses the ``vfio-pci`` device driver for GPUs:

65:00.0 3D controller [0302]: NVIDIA Corporation GA102GL [A10] [10de:2236] (rev a1)
Subsystem: NVIDIA Corporation GA102GL [A10] [10de:1482]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
#. Confirm that NVIDIA Kata Manager installed the ``kata-qemu-nvidia-gpu`` runtime class files:

.. code-block:: console
.. code-block:: console
$ ls -1 /opt/nvidia-gpu-operator/artifacts/runtimeclasses/kata-qemu-nvidia-gpu/
$ lspci -nnk -d 10de:
*Example Output*
*Example Output*

.. code-block:: output
.. code-block:: output
:emphasize-lines: 3
configuration-nvidia-gpu-qemu.toml
kata-ubuntu-jammy-nvidia-gpu.initrd
vmlinuz-5.xx.x-xxx-nvidia-gpu
...
65:00.0 3D controller [0302]: NVIDIA Corporation GA102GL [A10] [10de:2236] (rev a1)
Subsystem: NVIDIA Corporation GA102GL [A10] [10de:1482]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
*********************
Expand Down

0 comments on commit ae396a3

Please sign in to comment.