Skip to content

Commit

Permalink
Merge branch 'update-inst-cmd-rdma' into 'master'
Browse files Browse the repository at this point in the history
Add GDS specific arg to sample commands

See merge request nvidia/cloud-native/cnt-docs!380
  • Loading branch information
mikemckiernan committed Jan 26, 2024
2 parents 55f69a7 + 77a6b76 commit 7e6a0bb
Show file tree
Hide file tree
Showing 2 changed files with 35 additions and 25 deletions.
49 changes: 25 additions & 24 deletions gpu-operator/gpu-operator-rdma.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@
.. headings (h1/h2/h3/h4/h5) are # * = -
.. _net-op: https://docs.nvidia.com/networking/display/cokan10/network+operator
.. |net-op| replace:: *NVIDIA Network Operator Deployment Guide*

.. _operator-rdma:

####################################
Expand Down Expand Up @@ -31,6 +34,11 @@ To support GPUDirect RDMA, a userspace CUDA APIs and kernel mode drivers are req
new kernel module ``nvidia-peermem`` is included in the standard NVIDIA driver installers (e.g. ``.run``). The
kernel module provides Mellanox Infiniband-based HCAs direct peer-to-peer read and write access to the GPU's memory.

Starting with v23.9.1 of the Operator, the Operator uses GDS driver version 2.17.5 or newer.
This version and higher is only supported with the NVIDIA open kernel driver.
The sample commands for installing the Operator include the ``--set useOpenKernelModules=true``
command-line argument for Helm.

In conjunction with the `Network Operator <https://github.com/Mellanox/network-operator>`_, the GPU Operator can be used to
set up the networking related components such as Mellanox drivers, ``nvidia-peermem`` and Kubernetes device plugins to enable
workloads to take advantage of GPUDirect RDMA and GPUDirect Storage. Refer to the Network Operator `documentation <https://docs.nvidia.com/networking/display/COKAN10>`_
Expand Down Expand Up @@ -366,7 +374,8 @@ See :ref:`Support for GPUDirect Storage` on the platform support page.
Prerequisites
===============

Make sure that `MOFED <https://github.com/Mellanox/ofed-docker>`_ drivers are installed through `Network Operator <https://github.com/Mellanox/network-operator>`_.
Make sure that MLNX_OFED drivers are installed by NVIDIA Network Operator.
Refer to the |net-op|_.


Installation
Expand All @@ -376,38 +385,37 @@ The following section is applicable to the following configurations and describe

* Kubernetes on bare metal and on vSphere VMs with GPU passthrough and vGPU.


Starting with v22.9.1, the GPU Operator provides an option to load the ``nvidia-fs`` kernel module during the bootstrap of the NVIDIA driver daemonset.
Please refer to below install commands based on Mellanox OFED (MOFED) drivers are installed through Network-Operator.
Starting with v23.9.1, the GPU Operator deploys a version of GDS that requires using the NVIDIA open kernel driver.


MOFED drivers installed with Network-Operator:
The following sample command applies to clusters that use the Network Operator to install the MLNX_OFED drivers.

.. code-block:: console
$ helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.rdma.enabled=true
--set driver.rdma.enabled=true \
--set driver.useOpenKernelModules=true \
--set gds.enabled=true
For detailed information on how to deploy Network Operator and GPU Operator for GPU Direct Storage, please use this `link <https://docs.nvidia.com/ai-enterprise/deployment-guide-bare-metal/0.1.0/gds-overview.html>`_.


Verification
==============

During the installation, an `initContainer` is used with the driver daemonset to wait on the Mellanox OFED (MOFED) drivers to be ready.
This initContainer checks for Mellanox NICs on the node and ensures that the necessary kernel symbols are exported MOFED kernel drivers.
Once everything is in place, the containers nvidia-peermem-ctr and nvidia-fs-ctr will be instantiated inside the driver daemonset.

During the installation, an init container is used with the driver daemon set to wait on the Mellanox OFED (MLNX_OFED) drivers to be ready.
This init container checks for Mellanox NICs on the node and ensures that the necessary kernel symbols are exported by the MLNX_OFED kernel drivers.
After the verification completes, the nvidia-peermem-ctr and nvidia-fs-ctr containers start inside the driver pods.


.. code-block:: console
$ kubectl get pod -n gpu-operator
*Example Output*

.. code-block:: output
gpu-operator gpu-feature-discovery-pktzg 1/1 Running 0 11m
gpu-operator gpu-operator-1672257888-node-feature-discovery-master-7ccb7txmc 1/1 Running 0 12m
gpu-operator gpu-operator-1672257888-node-feature-discovery-worker-bqhrl 1/1 Running 0 11m
Expand All @@ -422,12 +430,9 @@ Once everything is in place, the containers nvidia-peermem-ctr and nvidia-fs-ctr
gpu-operator nvidia-operator-validator-b8nz2 1/1 Running 0 11m
.. code-block:: console
$ kubectl describe pod -n <Operator Namespace> nvidia-driver-daemonset-xxxx
$ kubectl describe pod -n gpu-operator nvidia-driver-daemonset-xxxx
<snip>
Init Containers:
mofed-validation:
Expand Down Expand Up @@ -474,13 +479,9 @@ Lastly, verify that NVIDIA kernel modules have been successfully loaded on the w
drm 491520 6 drm_kms_helper,drm_vram_helper,nvidia,mgag200,ttm
*****************
Further Reading
*****************
*******************
Related Information
*******************

Refer to the following resources for more information:

Expand Down
11 changes: 10 additions & 1 deletion gpu-operator/life-cycle-policy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,9 @@ The product life cycle and versioning are subject to change in the future.
GPU Operator Component Matrix
*****************************

.. _gds: #gds-open-kernel
.. |gds| replace:: :sup:`1`

The following table shows the operands and default operand versions that correspond to a GPU Operator version.

When post-release testing confirms support for newer versions of operands, these updates are identified as *recommended updates* to a GPU Operator version.
Expand Down Expand Up @@ -143,7 +146,7 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
* - NVIDIA vGPU Device Manager
- v0.2.4

* - NVIDIA GDS Driver
* - NVIDIA GDS Driver |gds|_
- `2.17.5 <https://github.com/NVIDIA/gds-nvidia-fs/releases>`_

* - NVIDIA Kata Manager for Kubernetes
Expand All @@ -153,6 +156,12 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
| Manager for Kubernetes
- v0.1.1

.. _gds-open-kernel:

:sup:`1`
This release of the GDS driver requires that you use the NVIDIA open kernel driver for the GPUs.
Refer to :doc:`gpu-operator-rdma` for more information.

.. note::

- Driver version could be different with NVIDIA vGPU, as it depends on the driver
Expand Down

0 comments on commit 7e6a0bb

Please sign in to comment.