Performance Tests for Multinode NGC.Ready Certification

Prerequisites

numactl
jq
iperf3 version >= 3.5
sysstat
perftest version >= 4.5-0.11 (if running RDMA tests)
nvidia-utils (if running RDMA test with CUDA)
mlnx-tools (or at least the common_irq_affinity.sh and set_irq_affinity_cpulist.sh scripts from it). This package is available here, and is also installed as a part of OFED.

Access requirements

In order to run this, you will need passwordless root access to all the involved servers and DPUs. This can be achieved in several ways:

Firstly, generate a passwordless SSH key (e.g. using ssh-keygen) and copy it to all the entities involved (e.g. using ssh-copy-id).

If running as root: no further action is required.
If running as a non-root user:
- Make sure that your non-root user name is the same on the server, the client, and both the DPUs.
- Make sure that your non-root user is able to use sudo (is in administrative group, such as sudo or wheel, or is mentioned directly in the sudoers file).
- Make sure that your non-root user is able to use sudo either without a password at all (for example, this configuration in the sudoers file will do, assuming your user is a member of a group named sudo: %sudo ALL=(ALL:ALL) NOPASSWD: ALL) or, for a more granular approach, the following line can be used, allowing the non-root user to run only the needed binaries without a password:
```
%sudo ALL=(ALL) ALL, NOPASSWD: /usr/bin/bash,/usr/sbin/ip,/opt/mellanox/iproute2/sbin/ip,/usr/bin/mlxprivhost,/usr/bin/mst,/usr/bin/systemctl,/usr/sbin/ethtool,/usr/sbin/set_irq_affinity_cpulist.sh,/usr/bin/tee,/usr/bin/numactl,/usr/bin/awk,/usr/bin/taskset,/usr/bin/setpci,/usr/bin/rm -f /tmp/*
```

RDMA test

Will automatically detect device local NUMA node and run write/read/send bidirectional tests. Pass criterion is 90% of the port link speed.

Usage:

./ngc_rdma_test.sh \
    [<client username>@]<client hostname/ip> <client ib device>[,<client ib device2>,...] \
    [<server username>@]<server hostname/ip> <server ib device>[,<server ib device2>,...] \
    [--use_cuda] \
    [--qp=<num of QPs, default: total 4>] \
    [--all_connection_types | --conn=<list of connection types>] \
    [--tests=<list of ib perftests>] \
    [--duration=<time in seconds, default: 30 per test>]\
    [--bw_message_size_list=<list of message sizes>] \
    [--lat_message_size_list=<list of message sizes>] \
    [--server_cuda=<cuda_device>] \
    [--client_cuda=<cuda_device>] \
    [--unidir] \
    [--sd] \
    [--ipsec <list of DPU clients> <list of PFs associated to list of DPU clients> \
    <list of DPU servers> <list of PFs associated to list of DPU servers>]

If running with CUDA:
- The nvidia-peermem driver should be loaded.
- Perftest should be built with CUDA support.

RDMA Wrapper

Automatically detect HCAs on the host(s) and run ngc_rdma_test.sh for each device.

Usage:

./ngc_rdma_wrapper.sh [<client username>@]<client hostname/ip> [<server username>@]<server hostname/ip> \
    [--with_cuda, default: without cuda] \
    [--cuda_only] \
    [--write] \
    [--read] \
    [--vm] \
    [--aff <file>] \
    [--pairs <file>]

./ngc_internal_lb_rdma_wrapper.sh <hostname/ip> \
    [--with_cuda, default: without cuda] \
    [--cuda_only] \
    [--write] \ 
    [--read] \
    [--vm] \
    [--aff <file>]

TCP test

Will automatically detect device local NUMA node, disable IRQ balancer, increase MTU to max and run iperf3 on the closest NUMA nodes. Report aggregated throughput is in Gb/s.

Usage:

./ngc_tcp_test.sh \
    [<client username>@]<client hostname/ip> <client ib device1>[,<client ib device2>,...] \
    [<server username>@]<server hostname/ip> <server ib device1>[,<server ib device2>,...] \
    [--duplex=<"HALF" (default) or "FULL">] \
    [--change_mtu=<"CHANGE" (default) or "DONT_CHANGE">] \
    [--duration=<in seconds, default: 120>]
    [--ipsec <list of DPU clients> <list of PFs associated to list of DPU clients> \
    <list of DPU servers> <list of PFs associated to list of DPU servers>]

IPsec full offload test

This test currently supports single port only.

Will configure IPsec full offload on both client and server DPU, and then run a TCP test.

Usage:

./ngc_ipsec_full_offload_tcp_test.sh [<client username>@]<client hostname/ip> <client ib device> \
    [<server username>@]<server hostname/ip> <server ib device> <client bluefield hostname/ip> \
    <server bluefield hostname/ip> [--mtu=<mtu size>] \
    [--duration=<in seconds, default: 120>]

IPsec crypto offload test

Relevant for new HCAs (ConnectX-6 DX and above).

Will configure IPsec crypto offload on both client and server, run TCP test, and remove IPsec configuration.

Usage:

./ngc_ipsec_crypto_offload_tcp_test.sh [<client username>@]<client hostname/ip> <client ib device> \
    [<server username>@]<server hostname/ip> <server ib device> <number of tunnels>

The number of tunnels should not exceed the number of IPs configured on the NICs.

Download the latest stable version

Besides cloning and checking out to the latest stable release, you can also use the following helper script:

curl -Lfs https://raw.githubusercontent.com/Mellanox/ngc_multinode_perf/main/helpers/dl_nmp.sh | bash

And to download the latest 'experimental' (rc) version:

curl -Lfs https://raw.githubusercontent.com/Mellanox/ngc_multinode_perf/main/helpers/dl_nmp.sh | bash -s -- rc

Tuning instructions and HW/FW requirements

Item	Description
HCA Firmware version	Latest_GA
MLNX_OFED Version	Latest_GA
Eth Switch ports	Set MTU to 9216 Enable PFC and ECN using the single "Do ROCE" command
IB Switch OpenSM	Change IPoIB MTU to 4K: [standalone: master] → en [standalone: master] # conf t [standalone: master] (config) # ib partition Default mtu 4K force
AMD CPUs: EPYC 7002 and 7003 series
BIOS Settings	CPU Power Management → Maximum Performance Memory Frequency → Maximum Performance Alg. Performance Boost Disable (ApbDis) → Enabled ApbDis Fixed Socket P-State → P0 NUMA Nodes Per Socket → 2 L3 cache as NUMA Domain → Enabled x2APIC Mode → Enabled PCIe ACS → Disabled Preferred IO → Disabled Enhanced Preferred IO → Enabled
Boot grub settings	`iommu=pt numa_balancing=disable processor.max_cstate=0`
Intel CPUs: Xeon Gold and Platinum
BIOS Settings	Out of the box
Boot grub settings	`intel_idle.max_cstate=0 processor.max_cstate=0 intel_pstate=disable`
NIC PCIe settings	For each NIC PCIe function: Change PCI MaxReadReq to 4096B Run `setpci -s $PCI_FUNCTION CAP_EXP+8.w`, it will return 4 digits ABCD → Run `setpci -s $PCI_FUNCTION CAP_EXP+8.w=5BCD`

Name		Name	Last commit message	Last commit date
Latest commit History 257 Commits
helpers		helpers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
common.sh		common.sh
ipsec_configuration.sh		ipsec_configuration.sh
ipsec_full_offload_setup.sh		ipsec_full_offload_setup.sh
ngc_internal_lb_rdma_wrapper.sh		ngc_internal_lb_rdma_wrapper.sh
ngc_ipsec_crypto_offload_tcp_test.sh		ngc_ipsec_crypto_offload_tcp_test.sh
ngc_rdma_test.sh		ngc_rdma_test.sh
ngc_rdma_wrapper.sh		ngc_rdma_wrapper.sh
ngc_tcp_test.sh		ngc_tcp_test.sh
run_iperf3_clients.sh		run_iperf3_clients.sh
run_iperf3_servers.sh		run_iperf3_servers.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Performance Tests for Multinode NGC.Ready Certification

Prerequisites

Access requirements

RDMA test

Usage:

RDMA Wrapper

Usage:

TCP test

Usage:

IPsec full offload test

Usage:

IPsec crypto offload test

Usage:

Download the latest stable version

Tuning instructions and HW/FW requirements

About

Releases

Packages

Contributors 9

Languages

License

Mellanox/ngc_multinode_perf

Folders and files

Latest commit

History

Repository files navigation

Performance Tests for Multinode NGC.Ready Certification

Prerequisites

Access requirements

RDMA test

Usage:

RDMA Wrapper

Usage:

TCP test

Usage:

IPsec full offload test

Usage:

IPsec crypto offload test

Usage:

Download the latest stable version

Tuning instructions and HW/FW requirements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Languages

Packages