Skip to content

Commit

Permalink
Merge branch 'rn-23.9.1' into 'master'
Browse files Browse the repository at this point in the history
v23.9.1 RNs

See merge request nvidia/cloud-native/cnt-docs!350
  • Loading branch information
mikemckiernan committed Dec 8, 2023
2 parents 235a3b9 + 4dd1c77 commit 4aa1220
Showing 1 changed file with 143 additions and 0 deletions.
143 changes: 143 additions & 0 deletions gpu-operator/release-notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,149 @@ See the :ref:`GPU Operator Component Matrix` for a list of components included i

----

.. _v23.9.1:

23.9.1
======

.. _v23.9.1-new-features:

New Features
------------

* Added support for NVIDIA GH200 Grace Hopper Superchip.
Refer to :ref:`supported-nvidia-gpus-and-systems`.

The following prerequisites are required for using the Operator with GH200:

- Run Ubuntu 22.04 and an NVIDIA Linux kernel, such as one provided with a ``linux-nvidia-<x.x>`` package.
- Add ``init_on_alloc=0`` and ``memhp_default_state=online_movable`` as Linux kernel boot parameters.

* Added support for configuring the driver container to use the NVIDIA open kernel modules.
For clusters that use GPUDirect Storage (GDS), beginning with CUDA toolkit 12.2.2 and
the NVIDIA GPUDirect Storage kernel driver version v2.17.5, are only supported
with the open kernel modules.

- Refer to :ref:`gpu-operator-helm-chart-options` for information about setting
``useOpenKernelModules`` if you manage the driver containers with the NVIDIA cluster policy custom resource definition.
- Refer to :doc:`gpu-driver-configuration` for information about setting ``spec.useOpenKernelModules``
if you manage the driver containers with the technology preview NVIDIA driver custom resource.

* Added support for the following software component versions:

- NVIDIA Data Center GPU Driver version 535.129.03
- NVIDIA Driver Manager for Kubernetes v0.6.5
- NVIDIA Kubernetes Device Plugin v1.14.3
- NVIDIA DCGM Exporter 3.3.0-3.2.0
- NVIDIA Data Center GPU Manager (DCGM) v3.3.0-1
- NVIDIA KubeVirt GPU Device Plugin v1.2.4
- NVIDIA GPUDirect Storage (GDS) Driver v2.17.5

.. important::

This version, and newer versions of the NVIDIA GDS kernel driver, require that you use the NVIDIA open kernel modules.

Refer to the :ref:`GPU Operator Component Matrix`
on the platform support page.

.. _v23.9.1-improvements:

Improvements
------------

* The ``must-gather.sh`` script that is used for support is enhanced to collect logs
from NVIDIA vGPU Manager pods.

.. _v23.9.1-fixed-issues:

Fixed issues
------------

* Previously, the technical preview NVIDIA driver custom resource did not support adding
custom labels, annotations, or tolerations to the pods that run as part of the driver daemon set.
This limitation prevented scheduling the driver daemon set in some environments.
Refer to GitHub `issue #602 <https://github.com/NVIDIA/gpu-operator/issues/602>`_ for more details.

* Previously, when you specified the ``operator.upgradeCRD=true`` argument to the ``helm upgrade``
command, the pre-upgrade hook ran with the ``gpu-operator`` service account.
Now, the Operator is enhanced to run the hook with a new service account, ``gpu-operator-upgrade-crd-hook-sa``.
This fix creates the new service account, a new cluster role, and a new cluster role binding.

* Previously, duplicate image pull secrets were added to some daemon sets and caused an error
like the following when a node is deleted and the controller manager deleted the pods.

.. code-block:: output
I1031 00:09:44.553742 1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/nvidia-driver-daemonset-k69f2"
E1031 00:09:44.556500 1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/nvidia-driver-daemonset-k69f2; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="ngc-secret"]
* Previously, adding an NVIDIA driver custom resource with a node selector that conflicts with another
driver custom resource, the controller failed to set the error condition in the custom resource status.
The issue produced an error message like the following example:

.. code-block:: output
{"level":"error","ts":1698702848.8472972,"msg":"NVIDIADriver.nvidia.com \"<conflicting-cr-name>"\" is invalid: state: Unsupported value: \"\": supported values: \"ignored\", \"ready\", \"notReady\"","controller":"nvidia-driver-\
controller","object":{"name":"<conflicting-cr-name>"},"namespace":"","name":"<conflicting-cr-name>","reconcileID":"78d58d7b-cd94-4849-a292-391da9a0b049"}
* Previously, the NVIDIA KubeVirt GPU Device Plugin could have a GLIBC mismatch error and produce a log
message like the following example:

.. code-block:: output
nvidia-kubevirt-gpu-device-plugin: /lib64/libc.so.6: version `GLIBC_2.32` not found (required by nvidia-kubevirt-gpu-device-plugin)
This issue is fixed by including v1.2.4 of the plugin in this release.

* Previously, on some machines and Linux kernel versions, GPU Feature Discovery was unable to determine
the machine type because the ``/sys/class/dmi/id/product_name`` file did not exist on the host.
Now, accessing the file is performed by mounting ``/sys`` instead of the fully-qualified path and
if the file does not exist, GPU Feature Discovery is able to label the node with ``nvidia.com/gpu.machine=unknown``.

* Previously, enabling GPUDirect RDMA on Red Hat OpenShift Container Platform clusters could
experience an error with the nvidia-peermem container.
The error was related to the ``RHEL_VERSION`` variable being unbound.

.. _v23.9.1-known-limitations:

Known Limitations
------------------

* The ``1g.12gb`` MIG profile does not operate as expected on the NVIDIA GH200 GPU when the MIG configuration is set to ``all-balanced``.
* The GPU Driver container does not run on hosts that have a custom kernel with the SEV-SNP CPU feature
because of the missing ``kernel-headers`` package within the container.
With a custom kernel, NVIDIA recommends pre-installing the NVIDIA drivers on the host if you want to
run traditional container workloads with NVIDIA GPUs.
* If you cordon a node while the GPU driver upgrade process is already in progress,
the Operator uncordons the node and upgrades the driver on the node.
You can determine if an upgrade is in progress by checking the node label
``nvidia.com/gpu-driver-upgrade-state != upgrade-done``.
* NVIDIA vGPU is incompatible with KubeVirt v0.58.0, v0.58.1, and v0.59.0, as well
as OpenShift Virtualization 4.12.0---4.12.2.
* Using NVIDIA vGPU on bare metal nodes and NVSwitch is not supported.
* When installing the Operator on Amazon EKS and using Kubernetes versions lower than
``1.25``, specify the ``--set psp.enabled=true`` Helm argument because EKS enables
pod security policy (PSP).
If you use Kubernetes version ``1.25`` or higher, do not specify the ``psp.enabled``
argument so that the default value, ``false``, is used.
* All worker nodes in the Kubernetes cluster must run the same operating system version to use the NVIDIA GPU Driver container.
Alternatively, if you pre-install the NVIDIA GPU Driver on the nodes, then you can run different operating systems.
The technical preview feature that provides :doc:`gpu-driver-configuration` is also an alternative.
* NVIDIA GPUDirect Storage (GDS) is not supported with secure boot enabled systems.
* Driver Toolkit images are broken with Red Hat OpenShift version ``4.11.12`` and require cluster-level entitlements to be enabled
in this case for the driver installation to succeed.
* The NVIDIA GPU Operator can only be used to deploy a single NVIDIA GPU Driver type and version.
The NVIDIA vGPU and Data Center GPU Driver cannot be used within the same cluster.
The technical preview feature that provides :doc:`gpu-driver-configuration` is an alternative.
* The ``nouveau`` driver must be blacklisted when using NVIDIA vGPU.
Otherwise the driver fails to initialize the GPU with the error ``Failed to enable MSI-X`` in the system journal logs.
Additionally, all GPU operator pods become stuck in the ``Init`` state.
* When using RHEL 8 with containerd as the runtime and SELinux is enabled (either in permissive or enforcing mode)
at the host level, containerd must also be configured for SELinux, such as setting the ``enable_selinux=true``
configuration option.
Additionally, network-restricted environments are not supported.


23.9.0
======

Expand Down

0 comments on commit 4aa1220

Please sign in to comment.