Skip to content

Commit

Permalink
Add alias steps
Browse files Browse the repository at this point in the history
Signed-off-by: Mike McKiernan <[email protected]>
  • Loading branch information
mikemckiernan committed Oct 24, 2024
1 parent 8dc1a54 commit 0eb3e6c
Showing 1 changed file with 43 additions and 3 deletions.
46 changes: 43 additions & 3 deletions gpu-operator/gpu-operator-kata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,11 @@ The following diagram shows the software components that Kubernetes uses to run

NVIDIA supports Kata Containers by using Helm to run a daemon set that installs the Kata runtime and QEMU.

The daemon set runs the ``kata-deploy.sh`` script and configures each worker node with a runtime class, ``kata-qemu-nvidia-gpu``.
The daemon set runs the ``kata-deploy.sh`` script that performs the following actions on each node htat is labeled to run Kata Containers:

- Downloads an NVIDIA optimized Linux kernel image and initial RAM disk that provides the lightweight operating system for the virtual machines that run in QEMU.
These artifacts are downloaded from the NVIDIA container registry, nvcr.io, on each worker node.
- Configures each worker node with a runtime class, ``kata-qemu-nvidia-gpu``.

About NVIDIA Kata Manager
=========================
Expand All @@ -79,8 +83,8 @@ deploys NVIDIA Kata Manager as an operand.

The manager performs the following actions on each node that is labeled to run Kata Containers:

- Configure containerd with the ``kata-qemu-nvidia-gpu`` runtime class.
- Create a CDI specification, ``/var/run/cdi/nvidia.com-pgpu.yaml``, for each GPU on the node.
- Configures containerd with the ``kata-qemu-nvidia-gpu`` runtime class.
- Creates a CDI specification, ``/var/run/cdi/nvidia.com-pgpu.yaml``, for each GPU on the node.
- Loads the vhost-sock and vhost-net Linux kernel modules.

*********************************
Expand Down Expand Up @@ -488,6 +492,42 @@ A pod specification for a Kata container requires the following:
$ kubectl delete -f cuda-vectoradd-kata.yaml
******************************************
Optional: Configuring a GPU Resource Alias
******************************************

By default, GPU resources are exposed on nodes with a name like ``nvidia.com/GA102GL_A10``.
You can configure the NVIDIA Sandbox Device Plugin so that nodes also expose GPUs with an alias like ``nvidia.com/pgpu``.

#. Patch the cluster policy with a command like the following example:

.. code-block:: console
$ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type=merge \
-p '{"spec": {"sandboxDevicePlugin": {"env":[{"name": "P_GPU_ALIAS", "value":"pgpu"}]}}}'
The sandbox device plugin daemon set pods restart.

#. Optional: Describe a node to confirm the alias:

.. code-block:: console
$ kubectl describe node <node-name>
*Partial Output*

.. code-block:: output
...
Capacity:
cpu: 16
ephemeral-storage: 1922145660Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65488292Ki
nvidia.com/GA102GL_A10: 1
nvidia.com/pgpu: 1
Troubleshooting Workloads
=========================
Expand Down

0 comments on commit 0eb3e6c

Please sign in to comment.