Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v4.1.5 UCX_NET_DEVICES not selecting TCP devices correctly #12785

Open
bertiethorpe opened this issue Aug 30, 2024 · 4 comments
Open

v4.1.5 UCX_NET_DEVICES not selecting TCP devices correctly #12785

bertiethorpe opened this issue Aug 30, 2024 · 4 comments

Comments

@bertiethorpe
Copy link

Details of the problem

  • OS version (e.g Linux distro)
    • Rocky Linux release 9.4 (Blue Onyx)
  • Driver version:
    • rdma-core-2404mlnx51-1.2404066.x86_64
    • MLNX_OFED_LINUX-24.04-0.6.6.0

Setting UCX_NET_DEVICES to target only TCP devices when RoCE is available seems to be ignored in favour of some fallback.

I'm running a 2 node IMB_MPI PingPong to benchmark RoCE against regular TCP ethernet.

Setting UCX_NET_DEVICES=all or mlx5_0:1 gives the optimal performance and uses RDMA as expected.
Setting UCX_NET_DEVICES=eth0, eth1, or anything else still appears to use RoCE at only a slightly longer latency

HW information from ibstat or ibv_devinfo -vv command :

        hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         20.36.1010
        node_guid:                      fa16:3eff:fe4f:f5e9
        sys_image_guid:                 0c42:a103:0003:5d82
        vendor_id:                      0x02c9
        vendor_part_id:                 4124
        hw_ver:                         0x0
        board_id:                       MT_0000000224
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

How ompi is configured from ompi_info | grep Configure :

 Configured architecture: x86_64-pc-linux-gnu
 Configured by: abuild
 Configured on: Thu Aug  3 14:25:15 UTC 2023
 Configure command line: '--prefix=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5'
                                             '--disable-static' '--enable-builtin-atomics'
                                             '--with-sge' '--enable-mpi-cxx'
                                             '--with-hwloc=/opt/ohpc/pub/libs/hwloc'
                                             '--with-libfabric=/opt/ohpc/pub/mpi/libfabric/1.18.0'
                                             '--with-ucx=/opt/ohpc/pub/mpi/ucx-ohpc/1.14.0'
                                             '--without-verbs' '--with-tm=/opt/pbs/'

Following the advice from Here, it is apparently due to a higher priority of OpenMPI's btl/openib component but I don't think it can be if --without-verbs and openib is not available when searching ompi_info | grep btl.

As suggested in the UCX issue, adding -mca pml_ucx_tls any -mca pml_ucx_devices any to my mpirun has fixed this problem, but I was wondering what in the MCA precisely causes this behaviour.

Here's my batch script:

#!/usr/bin/env bash

#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=1
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.out
#SBATCH --exclusive
#SBATCH --partition=standard

module load gnu12 openmpi4 imb

export UCX_NET_DEVICES=mlx5_0:1

echo SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST
echo SLURM_JOB_ID: $SLURM_JOB_ID
echo UCX_NET_DEVICES: $UCX_NET_DEVICES

export UCX_LOG_LEVEL=data
mpirun -mca pml_ucx_tls any -mca pml_ucx_devices any IMB-MPI1 pingpong -iter_policy off
@evgeny-leksikov
Copy link

@bertiethorpe I can't reproduce described behavior with ompi and ucx bult from sources (see below), what I'm missing?

  1. I removed libfabric and pbs
  2. used osu instead of IMB
    but it should not make a difference:
$ <path>/ompi_install/bin/ompi_info | grep Configure
 Configured architecture: x86_64-pc-linux-gnu
           Configured by: evgenylek
           Configured on: Tue Oct  1 17:07:14 UTC 2024
  Configure command line: '--prefix=<path>/ompi_install' '--disable-static' '--enable-builtin-atomics' '--with-sge' '--enable-mpi-cxx' '--without-verbs'

$ ibdev2netdev | grep Up
mlx5_0 port 1 ==> ib0 (Up)
mlx5_2 port 1 ==> ib2 (Up)
mlx5_3 port 1 ==> enp129s0f1np1 (Up)
mlx5_4 port 1 ==> ib3 (Up)

$ mpirun -H host1,host2 -n 2 /osu-micro-benchmarks-5.8/mpi/pt2pt/osu_latency -m 0:128                                 
# OSU MPI Latency Test v5.8                                                                                                                                                                     
# Size          Latency (us)                                                                                                                                                                    
0                       0.89                                                                                                                                                                    
1                       0.89                                                                                                                                                                    
2                       0.89                                                                                                                                                                    
4                       0.89                                                                                                                                                                    
8                       0.88                                                                                                                                                                    
16                      0.89                                                                                                                                                                    
32                      0.91                                                                                                                                                                    
64                      1.03                                                                                                                                                                    
128                     1.07                                                                                                                                                                    
$ mpirun -x UCX_NET_DEVICES=mlx5_0:1 -H host1,host2 -n 2 /osu-micro-benchmarks-5.8/mpi/pt2pt/osu_latency -m 0:128     
# OSU MPI Latency Test v5.8                                                                                                                                                                     
# Size          Latency (us)                                                                                                                                                                    
0                       0.89                                                                                                                                                                    
1                       0.89                                                                                                                                                                    
2                       0.88                                                                                                                                                                    
4                       0.88                                                                                                                                                                    
8                       0.88                                                                                                                                                                    
16                      0.89                                                                                                                                                                    
32                      0.91                                                                                                                                                                    
64                      1.02                                                                                                                                                                    
128                     1.07                                                                                                                                                                    
$ mpirun -x UCX_NET_DEVICES=mlx5_3:1 -H host1,host2 -n 2 /osu-micro-benchmarks-5.8/mpi/pt2pt/osu_latency -m 0:128     
# OSU MPI Latency Test v5.8                                                                                                                                                                     
# Size          Latency (us)                                                                                                                                                                    
0                       1.33                                                                                                                                                                    
1                       1.34                                                                                                                                                                    
2                       1.34                                                                                                                                                                    
4                       1.34                                                                                                                                                                    
8                       1.34                                                                                                                                                                    
16                      1.34                                                                                                                                                                    
32                      1.38                                                                                                                                                                    
64                      1.60                                                                                                                                                                    
128                     1.67                                                                                                                                                                    
$ mpirun -x UCX_NET_DEVICES=enp129s0f1np1 -H host1,host2 -n 2 /osu-micro-benchmarks-5.8/mpi/pt2pt/osu_latency -m 0:128
# OSU MPI Latency Test v5.8                                                                                                                                                                     
# Size          Latency (us)                                                                                                                                                                    
0                      55.89                                                                                                                                                                    
1                      56.11                                                                                                                                                                    
2                      56.15                                                                                                                                                                    
4                      56.29                                                                                                                                                                    
8                      56.09                                                                                                                                                                    
16                     56.12                                                                                                                                                                    
32                     56.14                                                                                                                                                                    
64                     56.62                                                                                                                                                                    
128                    56.86                                                                                                                                                                    
$ mpirun -x UCX_NET_DEVICES=eno1 -H host1,host2 -n 2 /osu-micro-benchmarks-5.8/mpi/pt2pt/osu_latency -m 0:128         
# OSU MPI Latency Test v5.8                                                                                                                                                                     
# Size          Latency (us)                                                                                                                                                                    
0                      60.95                                                                                                                                                                    
1                      61.04                                                                                                                                                                    
2                      61.11                                                                                                                                                                    
4                      61.12                                                                                                                                                                    
8                      61.05                                                                                                                                                                    
16                     61.10                                                                                                                                                                    
32                     61.16                                                                                                                                                                    
64                     61.43                                                                                                                                                                    
128                    61.69                                                                                                                                                                    

@yosefe
Copy link
Contributor

yosefe commented Oct 6, 2024

@bertiethorpe can you pls increase the verbosity of OpenMPI, by adding -mca pml_ucx_verbose 99 after mpirun (along with -x UCX_NET_DEVICES=eth0), and post the resulting output?
Thanks!

@abeltre1
Copy link

abeltre1 commented Jan 12, 2025

@yosefe, @evgeny-leksikov, @jsquyres and @janjust: I am not sure if the following helps, but here it is.

I built the latest version of OpenMPI (OMPI) and an older version (5.0.2 and 5.0.5) to reproduce a nonworking and a working version. It appears that something is not working as expected with the newer versions of OMPI. However, as presented below, i have a built that works for 5.0.2. I did not proceed to test more versions.

  1. ompi_info :
                Open MPI: 5.0.2
  Open MPI repo revision: v5.0.2
   Open MPI release date: Unreleased developer copy
                 MPI API: 3.1.0
            Ident string: 5.0.2
                  Prefix: /usr/local
 Configured architecture: x86_64-pc-linux-gnu
           Configured by: root
           Configured on: Sun Jan 12 04:07:13 UTC 2025
          Configure host: 744947182c1f
  Configure command line: '--prefix=/usr/local' '--with-ucx=/usr/local'
                          '--enable-orterun-prefix-by-default'
                          '--enable-mca-no-build=btl-uct'
                Built by:
                Built on: Sun Jan 12 04:12:58 UTC 2025
              Built host: 744947182c1f
              C bindings: yes
             Fort mpif.h: yes (all)
            Fort use mpi: yes (full: ignore TKR)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
                          limitations in the gfortran compiler and/or Open
                          MPI, does not support the following: array
                          subsections, direct passthru (where possible) to
                          underlying Open MPI's C functionality
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: gcc
     C compiler absolute: /bin/gcc
  C compiler family name: GNU
      C compiler version: 8.5.0
            C++ compiler: g++
   C++ compiler absolute: /bin/g++
           Fort compiler: gfortran
       Fort compiler abs: /bin/gfortran
         Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
   Fort 08 assumed shape: yes
      Fort optional args: yes
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: yes
      Fort BIND(C) (all): yes
      Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): yes
       Fort TYPE,BIND(C): yes
 Fort T,BIND(C,name="a"): yes
            Fort PRIVATE: yes
           Fort ABSTRACT: yes
       Fort ASYNCHRONOUS: yes
          Fort PROCEDURE: yes
         Fort USE...ONLY: yes
           Fort C_FUNLOC: yes
 Fort f08 using wrappers: yes
         Fort MPI_SIZEOF: yes
             C profiling: yes
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: yes
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, Event lib: yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
          MPI extensions: affinity, cuda, ftmpi, rocm
 Fault Tolerance support: yes
          FT MPI support: yes
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
         MCA accelerator: null (MCA v2.1.0, API v1.0.0, Component v5.0.2)
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v5.0.2)
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v5.0.2)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v5.0.2)
                 MCA btl: self (MCA v2.1.0, API v3.3.0, Component v5.0.2)
                 MCA btl: sm (MCA v2.1.0, API v3.3.0, Component v5.0.2)
                 MCA btl: tcp (MCA v2.1.0, API v3.3.0, Component v5.0.2)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v5.0.2)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v5.0.2)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v5.0.2)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v5.0.2)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v5.0.2)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v5.0.2)
               MCA mpool: hugepage (MCA v2.1.0, API v3.1.0, Component v5.0.2)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
                          v5.0.2)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v5.0.2)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v5.0.2)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v5.0.2)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v5.0.2)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v5.0.2)
                MCA smsc: cma (MCA v2.1.0, API v1.0.0, Component v5.0.2)
             MCA threads: pthreads (MCA v2.1.0, API v1.0.0, Component v5.0.2)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v5.0.2)
                 MCA bml: r2 (MCA v2.1.0, API v2.1.0, Component v5.0.2)
                MCA coll: adapt (MCA v2.1.0, API v2.4.0, Component v5.0.2)
                MCA coll: basic (MCA v2.1.0, API v2.4.0, Component v5.0.2)
                MCA coll: han (MCA v2.1.0, API v2.4.0, Component v5.0.2)
                MCA coll: inter (MCA v2.1.0, API v2.4.0, Component v5.0.2)
                MCA coll: libnbc (MCA v2.1.0, API v2.4.0, Component v5.0.2)
                MCA coll: self (MCA v2.1.0, API v2.4.0, Component v5.0.2)
                MCA coll: sync (MCA v2.1.0, API v2.4.0, Component v5.0.2)
                MCA coll: tuned (MCA v2.1.0, API v2.4.0, Component v5.0.2)
                MCA coll: ftagree (MCA v2.1.0, API v2.4.0, Component v5.0.2)
                MCA coll: hcoll (MCA v2.1.0, API v2.4.0, Component v5.0.2)
                MCA coll: monitoring (MCA v2.1.0, API v2.4.0, Component
                          v5.0.2)
                MCA coll: sm (MCA v2.1.0, API v2.4.0, Component v5.0.2)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v5.0.2)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v5.0.2)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
                          v5.0.2)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
                          v5.0.2)
               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v5.0.2)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v5.0.2)
                MCA hook: comm_method (MCA v2.1.0, API v1.0.0, Component
                          v5.0.2)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v5.0.2)
                  MCA io: romio341 (MCA v2.1.0, API v2.0.0, Component v5.0.2)
                  MCA op: avx (MCA v2.1.0, API v1.0.0, Component v5.0.2)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v5.0.2)
                 MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component
                          v5.0.2)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v5.0.2)
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v5.0.2)
                MCA part: persist (MCA v2.1.0, API v4.0.0, Component v5.0.2)
                 MCA pml: cm (MCA v2.1.0, API v2.1.0, Component v5.0.2)
                 MCA pml: monitoring (MCA v2.1.0, API v2.1.0, Component
                          v5.0.2)
                 MCA pml: ob1 (MCA v2.1.0, API v2.1.0, Component v5.0.2)
                 MCA pml: ucx (MCA v2.1.0, API v2.1.0, Component v5.0.2)
                 MCA pml: v (MCA v2.1.0, API v2.1.0, Component v5.0.2)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
                          v5.0.2)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
                          v5.0.2)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v5.0.2)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v5.0.2)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
                          v5.0.2)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
                          v5.0.2)
  1. Runs on OMPI 5.0.2:
-mca pml ucx --mca btl ^vader,tcp,openib,uct -x UCX_NET_DEVICES=mlx5_0:1

# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                       4.20
2                       5.09
4                       3.68
8                       4.03
16                      4.24
32                      5.48
64                      7.37
128                     7.10
256                     7.54
512                    10.86
1024                   13.46
2048                   16.65
4096                   26.47
8192                   46.12
16384                  80.96
32768                 152.56
65536                 310.15
131072                636.13
262144               1312.73
524288               2727.59
1048576              5604.98
 -x UCX_NET_DEVICES=lo --mca coll ^hcoll --mca btl ^vader,self,tcp,openib,uct

# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                      92.76
2                      94.03
4                      71.23
8                      71.11
16                     75.24
32                     71.16
64                     80.31
128                    89.17
256                   140.97
512                   109.10
1024                  125.49
2048                  176.34
4096                  256.48
8192                  393.21
16384                 777.16
32768                1532.98
65536                3991.32
131072               7831.29
262144              15324.53
524288              30227.76
1048576             60535.43
 --mca coll ^hcoll

# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                       4.51
2                       4.42
4                       9.16
8                       9.14
16                      9.48
32                     10.74
64                     10.11
128                    10.99
256                    12.66
512                     8.53
1024                    9.82
2048                   17.10
4096                   25.33
8192                   43.60
16384                  77.65
32768                 146.68
65536                 290.46
131072                620.92
262144               1299.81
524288               2719.89
1048576              5549.75
-x UCX_NET_DEVICES=mlx5_0:1  --mca routed direct   --mca coll ^hcoll

# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                       4.54
2                       4.48
4                       9.20
8                       9.11
16                      9.43
32                      9.92
64                     10.30
128                    11.05
256                    12.90
512                     8.68
1024                    9.80
2048                   14.37
4096                   25.72
8192                   44.33
16384                  78.81
32768                 148.41
65536                 293.32
131072                621.30
262144               1301.58
524288               2713.73
1048576              5542.62
-x UCX_NET_DEVICES=all  --mca routed direct   --mca coll ^hcoll

# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                       4.48
2                       4.49
4                       9.23
8                       9.31
16                      9.51
32                      9.83
64                     10.35
128                    11.10
256                    12.70
512                     8.65
1024                    9.78
2048                   14.21
4096                   25.27
8192                   45.32
16384                  77.48
32768                 146.71
65536                 292.03
131072                619.11
262144               1297.05
524288               2728.60
1048576              5541.79
-x UCX_NET_DEVICES=eth0  --mca routed direct   --mca coll ^hcoll

# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                      88.47
2                      86.85
4                      73.91
8                      70.82
16                     76.02
32                     75.62
64                     91.96
128                   103.55
256                   139.06
512                   108.88
1024                  125.95
2048                  176.67
4096                  256.12
8192                  392.61
16384                 776.01
32768                1523.09
65536                3982.52
131072               7862.55
262144              15454.59
524288              30260.65
1048576             60475.30
-x UCX_NET_DEVICES=lo  --mca routed direct   --mca coll ^hcoll

# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                      92.04
2                      91.74
4                      71.86
8                      71.72
16                     76.30
32                     78.89
64                     92.81
128                   107.37
256                   141.70
512                   109.66
1024                  126.11
2048                  177.17
4096                  255.36
8192                  395.78
16384                 785.76
32768                1557.99
65536                4035.93
131072               7849.01
262144              15691.52
524288              32492.75
1048576             60601.98
--mca btl_tcp_if_include eth0,lo  --mca routed direct --mca coll ^hcoll

# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                      93.87
2                      92.88
4                      71.79
8                      73.78
16                     76.53
32                     79.56
64                     92.14
128                   105.27
256                   143.71
512                   107.98
1024                  125.89
2048                  177.95
4096                  258.96
8192                  398.15
16384                 804.27
32768                1524.64
65536                3975.23
131072               7806.18
262144              15415.16
524288              30361.39
1048576             64038.22
--mca btl tcp,self,vader --mca pml ^ucx --mca coll ^hcoll

# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                      99.36
2                      94.99
4                      72.44
8                      74.38
16                     79.04
32                     80.30
64                     92.98
128                   105.93
256                   143.32
512                   111.50
1024                  128.55
2048                  178.13
4096                  259.11
8192                  394.63
16384                 789.31
32768                1541.04
65536                4032.19
131072               7845.18
262144              15384.49
524288              30367.15
1048576             60630.55

@bosilca
Copy link
Member

bosilca commented Jan 13, 2025

All this looks good to me. Allowing UCX to pick the communication device gives you IB (and a latency of 4us for the all-to-all) while enforcing a specific device (mostly TCP in these examples) works but gives a much higher latency. What exactly is the question we are trying to answer here ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants