Skip to content

Commit

Permalink
Merge branch 'cnt-docs-TELCODOCS-1044' into 'master'
Browse files Browse the repository at this point in the history
TELCODOCS-1044

See merge request nvidia/cloud-native/cnt-docs!327
  • Loading branch information
mikemckiernan committed Oct 3, 2023
2 parents fdc8320 + 2f9547e commit d747d55
Show file tree
Hide file tree
Showing 5 changed files with 281 additions and 0 deletions.
279 changes: 279 additions & 0 deletions openshift/gpu-operator-with-precompiled-drivers.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,279 @@
.. Date: Apr192023
.. Author: stesmith
.. headings are # * - =
.. _gpu-operator-with-precompiled-drivers:

###############################################################
Precompiled Drivers for the NVIDIA GPU Operator for RHCOS
###############################################################

.. contents::
:depth: 2
:local:
:backlinks: none

.. note:: Technology Preview features are not supported in production environments and are not functionally complete. Technology Preview features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. These releases may not have any documentation, and testing is limited.


***********************************
About Precompiled Driver Containers
***********************************

By default, NVIDIA GPU drivers are built on the cluster nodes when you deploy the GPU Operator.
Driver compilation and packaging is done on every Kubernetes node, which leads to bursts of compute demand, waste of resources, and long provisioning times.
In contrast, using container images with precompiled drivers makes the drivers immediately available on all nodes, resulting in faster provisioning and cost savings in public cloud deployments.

***********************************
Limitations and Restrictions
***********************************

* NVIDIA does not provide precompiled driver images for Red Hat OpenShift. Such images have to be custom built and hosted in a public or private image registry.

* NVIDIA provides limited support for custom driver container images.

* Precompiled driver containers do not support NVIDIA vGPU or GPUDirect Storage (GDS).

* Precompiled driver container images for Red Hat OpenShift Container Platform have been tested with the following versions:

* 4.12 (RHEL 8.6)
* 4.13 (RHEL 9.2)

***********************************
Building a Precompiled Driver Image
***********************************

Perform the following steps to build a custom driver image for use with Red Hat OpenShift Contain Platform.

.. rubric:: Prerequisites

* You have access to a container registry, such as NVIDIA NGC Private Registry, Red Hat Quay, or the OpenShift internal container registry, and can push container images to the registry.

* You have a valid Red Hat subscription with an activation key.

* You have a Red Hat OpenShift pull secret.

* Your build machine has access to the internet to download operating system packages.

* You know a CUDA version, such as ``12.1.0``, that you want to use.

One way to find a supported CUDA version for your operating system is to access the NVIDIA GPU Cloud registry at `CUDA | NVIDIA NGC <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda/tags>`_ and view the tags. Use the search field to filter the tags, such as ``base-ubi8`` for RHEL 8 and ``base-ubi9`` for RHEL 9. The filtered results show the CUDA versions, such as ``12.1.0``, ``12.0.1``, ``12.0.0``, and so on.

* You know the GPU driver version, such as ``525.105.17``, that you want to use.

.. rubric:: Procedure

#. Clone the driver container repository:

.. code-block:: console
$ git clone https://gitlab.com/nvidia/container-images/driver
#. Change the directory to ``rhel8/precompiled`` under the cloned repository. You can build precompiled driver images for versions 8 and 9 of RHEL from this directory:

.. code-block:: console
$ cd driver/rhel8/precompiled
#. Create a Red Hat Customer Portal Activation Key and note your Red Hat Subscription Management (RHSM) organization ID. These are to install packages during a build. Save the values to files, for example, ``$HOME/rhsm_org`` and ``$HOME/rhsm_activationkey``:

.. code-block:: console
export RHSM_ORG_FILE=$HOME/rhsm_org
export RHSM_ACTIVATIONKEY_FILE=$HOME/rhsm_activationkey
#. Download your Red Hat OpenShift pull secret and store it in a file, for example, ``${HOME}/pull-secret``:

.. code-block:: console
export PULL_SECRET_FILE=$HOME/pull-secret.txt
#. Set the Red Hat OpenShift version and target architecture of your cluster, for example, ``x86_64``:

.. code-block:: console
export OPENSHIFT_VERSION="4.12.13"
export TARGET_ARCH="x86_64"
#. Determine the Driver Toolkit (DTK) image for your target Red Hat OpenShift version and architecture:

.. code-block:: console
export DRIVER_TOOLKIT_IMAGE=$(oc adm release info -a $HOME/pull-secret.txt --image-for=driver-toolkit quay.io/openshift-release-dev/ocp-release:${OPENSHIFT_VERSION}-${TARGET_ARCH})
#. Determine the RHEL and kernel versions of your target OpenShift cluster:

.. code-block:: console
export RHEL_VERSION=$(podman run --authfile $HOME/pull-secret.txt --rm -ti ${DRIVER_TOOLKIT_IMAGE} cat /etc/driver-toolkit-release.json | jq -r '.RHEL_VERSION')
.. code-block:: console
export RHEL_MAJOR=$(echo "${RHEL_VERSION}" | cut -d '.' -f 1)
.. code-block:: console
export KERNEL_VERSION=$(podman run --authfile $HOME/pull-secret.txt --rm -ti ${DRIVER_TOOLKIT_IMAGE} cat /etc/driver-toolkit-release.json | jq -r '.KERNEL_VERSION')
#. Set environment variables for the driver and CUDA versions, as well as the image:

.. code-block:: console
export CUDA_VERSION=12.1.0
export CUDA_DIST=ubi${RHEL_MAJOR}
export DRIVER_EPOCH=1
export DRIVER_VERSION=525.105.17
export OS_TAG=rhcos4.12
#. Build and push the image:

.. code-block:: console
make image image-push
Optionally, override the ``IMAGE_REGISTRY``, ``IMAGE_NAME``, and ``CONTAINER_TOOL``. You can also override ``BUILDER_USER`` and ``BUILDER_EMAIL`` if you want, otherwise your Git username and email are used. See the Makefile for all available variables.

.. note:: Do not set the ``DRIVER_TYPE``. The only supported value is currently ``passthrough``, which is set by default.

*********************************************
Enabling Precompiled Driver Container Support
*********************************************

.. rubric:: Prerequisites

* You installed the NVIDIA GPU Operator. Refer to :doc:`install-gpu-ocp`.

---------------------
Using the Web Console
---------------------

#. In the OpenShift Container Platform web console, from the side menu, select **Operators** > **Installed Operators**, and click **NVIDIA GPU Operator**.

#. Select the **ClusterPolicy** tab, then click **Create ClusterPolicy**. The platform assigns the default name *gpu-cluster-policy*.

#. Open the **Driver** section.

#. Check the **usePrecompiled** checkbox.

#. Specify values for **repository**, **version**, and **image**.

.. image:: graphics/precompiled_driver_config_repository.png
:width: 600

.. image:: graphics/precompiled_driver_config_version_and_image.png
:width: 600

#. Select **Create**.

-------------------
Using the YAML File
-------------------

#. Use the procedure :ref:`create-cluster-policy-web-console` to create a cluster policy. Switch to the YAML view while creating the ``ClusterPolicy`` resource.

#. Add precompiled driver image properties:

.. code-block:: yaml
spec:
driver:
usePrecompiled: true
image: <image_name>
repository: <image_registry>
version: <driver_version>
#. Provide values for ``image``, ``repository``, and ``version``. For example:

.. code-block:: yaml
spec:
driver:
usePrecompiled: true
image: nvidia-gpu-driver
repository: quay.io/nvidia-gpu-driver-example
version: 525.105.17
-------------
Using the CLI
-------------

#. Create a template for the ``ClusterPolicy`` resource. Replace the NVIDIA GPU operator version with your value:

.. code-block:: console
$ oc get csv -n nvidia-gpu-operator gpu-operator-certified.v23.6.1 -ojsonpath={.metadata.annotations.alm-examples} | jq '.[0]' > clusterpolicy.json
#. Modify the ``clusterpolicy.json`` file to specify values for ``driver.usePrecompiled``, ``driver.repository``, ``driver.image`` and ``driver.version``. For example:

.. code-block:: json
"driver": {
"usePrecompiled": true,
"repository": "quay.io/nvidia-gpu-driver-example",
"image": "nvidia-gpu-driver",
"version": "525.105.17"
}
#. Create a ``ClusterPolicy`` resource from the modified ``clusterpolicy.json`` file:

.. code-block:: console
$ oc apply -f clusterpolicy.json
*Example Output*

.. code-block:: console
clusterpolicy.nvidia.com/gpu-cluster-policy created
#. Confirm that the driver container pods are running:

.. code-block:: console
$ oc get pods -l app=nvidia-driver-daemonset -n nvidia-gpu-operator
*Example Output*

.. code-block:: console
NAME READY STATUS RESTARTS AGE
nvidia-driver-daemonset-4.18.0-372.51.1.el8_6-rhcos4.12-mlpd4 1/1 Running 0 44s
Ensure that the pod names include a Linux kernel version number like ``4.18.0-372.51.1.el8_6``.

***************************************************
Disabling Support for Precompiled Driver Containers
***************************************************

Perform the following steps to disable support for precompiled driver containers.

#. Disable precompiled driver support by modifying the cluster policy:

.. code-block:: console
$ oc patch clusterpolicy/gpu-cluster-policy --type='json' \
-p='[{"op": "replace", "path": "/spec/driver/usePrecompiled", "value":false},{"op": "remove", "path": "/spec/driver/version"},{"op": "remove", "path": "/spec/driver/image"},{"op": "remove", "path": "/spec/driver/repository"}]'
*Example Output*

.. code-block:: console
clusterpolicy.nvidia.com/gpu-cluster-policy patched
#. Confirm that the conventional driver container pods are running:

.. code-block:: console
$ oc get pods -l openshift.driver-toolkit=true -n nvidia-gpu-operator
*Example Output*

.. code-block:: console
NAME READY STATUS RESTARTS AGE
nvidia-driver-daemonset-412.86.202303241612-0-f7v4t 2/2 Running 0 4m20s
Ensure that the pod names do not include a Linux kernel semantic version number.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions openshift/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,5 +17,6 @@ NVIDIA GPU Operator on Red Hat OpenShift Container Platform
enable-gpu-monitoring-dashboard.rst
time-slicing-gpus-in-openshift.rst
openshift-virtualization.rst
gpu-operator-with-precompiled-drivers.rst
troubleshooting-gpu-ocp.rst
appendix-ocp.rst
1 change: 1 addition & 0 deletions openshift/install-gpu-ocp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -207,6 +207,7 @@ When you install the **NVIDIA GPU Operator** in the OpenShift Container Platform
As a cluster administrator, you can create a ClusterPolicy using the OpenShift Container Platform CLI or the web console. Also, these steps differ
when using **NVIDIA vGPU**. Please refer to appropriate sections below.

.. _create-cluster-policy-web-console:

Create the cluster policy using the web console
-----------------------------------------------
Expand Down

0 comments on commit d747d55

Please sign in to comment.