-
Notifications
You must be signed in to change notification settings - Fork 870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v4.1.5 UCX_NET_DEVICES not selecting TCP devices correctly #12785
Comments
@bertiethorpe I can't reproduce described behavior with ompi and ucx bult from sources (see below), what I'm missing?
|
@bertiethorpe can you pls increase the verbosity of OpenMPI, by adding |
@yosefe, @evgeny-leksikov, @jsquyres and @janjust: I am not sure if the following helps, but here it is. I built the latest version of OpenMPI (OMPI) and an older version (5.0.2 and 5.0.5) to reproduce a nonworking and a working version. It appears that something is not working as expected with the newer versions of OMPI. However, as presented below, i have a built that works for 5.0.2. I did not proceed to test more versions.
-mca pml ucx --mca btl ^vader,tcp,openib,uct -x UCX_NET_DEVICES=mlx5_0:1
# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 4.20
2 5.09
4 3.68
8 4.03
16 4.24
32 5.48
64 7.37
128 7.10
256 7.54
512 10.86
1024 13.46
2048 16.65
4096 26.47
8192 46.12
16384 80.96
32768 152.56
65536 310.15
131072 636.13
262144 1312.73
524288 2727.59
1048576 5604.98
-x UCX_NET_DEVICES=lo --mca coll ^hcoll --mca btl ^vader,self,tcp,openib,uct
# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 92.76
2 94.03
4 71.23
8 71.11
16 75.24
32 71.16
64 80.31
128 89.17
256 140.97
512 109.10
1024 125.49
2048 176.34
4096 256.48
8192 393.21
16384 777.16
32768 1532.98
65536 3991.32
131072 7831.29
262144 15324.53
524288 30227.76
1048576 60535.43
--mca coll ^hcoll
# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 4.51
2 4.42
4 9.16
8 9.14
16 9.48
32 10.74
64 10.11
128 10.99
256 12.66
512 8.53
1024 9.82
2048 17.10
4096 25.33
8192 43.60
16384 77.65
32768 146.68
65536 290.46
131072 620.92
262144 1299.81
524288 2719.89
1048576 5549.75
-x UCX_NET_DEVICES=mlx5_0:1 --mca routed direct --mca coll ^hcoll
# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 4.54
2 4.48
4 9.20
8 9.11
16 9.43
32 9.92
64 10.30
128 11.05
256 12.90
512 8.68
1024 9.80
2048 14.37
4096 25.72
8192 44.33
16384 78.81
32768 148.41
65536 293.32
131072 621.30
262144 1301.58
524288 2713.73
1048576 5542.62
-x UCX_NET_DEVICES=all --mca routed direct --mca coll ^hcoll
# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 4.48
2 4.49
4 9.23
8 9.31
16 9.51
32 9.83
64 10.35
128 11.10
256 12.70
512 8.65
1024 9.78
2048 14.21
4096 25.27
8192 45.32
16384 77.48
32768 146.71
65536 292.03
131072 619.11
262144 1297.05
524288 2728.60
1048576 5541.79
-x UCX_NET_DEVICES=eth0 --mca routed direct --mca coll ^hcoll
# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 88.47
2 86.85
4 73.91
8 70.82
16 76.02
32 75.62
64 91.96
128 103.55
256 139.06
512 108.88
1024 125.95
2048 176.67
4096 256.12
8192 392.61
16384 776.01
32768 1523.09
65536 3982.52
131072 7862.55
262144 15454.59
524288 30260.65
1048576 60475.30
-x UCX_NET_DEVICES=lo --mca routed direct --mca coll ^hcoll
# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 92.04
2 91.74
4 71.86
8 71.72
16 76.30
32 78.89
64 92.81
128 107.37
256 141.70
512 109.66
1024 126.11
2048 177.17
4096 255.36
8192 395.78
16384 785.76
32768 1557.99
65536 4035.93
131072 7849.01
262144 15691.52
524288 32492.75
1048576 60601.98
--mca btl_tcp_if_include eth0,lo --mca routed direct --mca coll ^hcoll
# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 93.87
2 92.88
4 71.79
8 73.78
16 76.53
32 79.56
64 92.14
128 105.27
256 143.71
512 107.98
1024 125.89
2048 177.95
4096 258.96
8192 398.15
16384 804.27
32768 1524.64
65536 3975.23
131072 7806.18
262144 15415.16
524288 30361.39
1048576 64038.22
--mca btl tcp,self,vader --mca pml ^ucx --mca coll ^hcoll
# OSU MPI All-to-All Personalized Exchange Latency Test v7.1
# Datatype: MPI_CHAR.
# Size Avg Latency(us)
1 99.36
2 94.99
4 72.44
8 74.38
16 79.04
32 80.30
64 92.98
128 105.93
256 143.32
512 111.50
1024 128.55
2048 178.13
4096 259.11
8192 394.63
16384 789.31
32768 1541.04
65536 4032.19
131072 7845.18
262144 15384.49
524288 30367.15
1048576 60630.55 |
All this looks good to me. Allowing UCX to pick the communication device gives you IB (and a latency of 4us for the all-to-all) while enforcing a specific device (mostly TCP in these examples) works but gives a much higher latency. What exactly is the question we are trying to answer here ? |
Details of the problem
Setting UCX_NET_DEVICES to target only TCP devices when RoCE is available seems to be ignored in favour of some fallback.
I'm running a 2 node IMB_MPI PingPong to benchmark RoCE against regular TCP ethernet.
Setting
UCX_NET_DEVICES=all
ormlx5_0:1
gives the optimal performance and uses RDMA as expected.Setting
UCX_NET_DEVICES=eth0
,eth1
, or anything else still appears to use RoCE at only a slightly longer latencyHW information from
ibstat
oribv_devinfo -vv
command :How ompi is configured from
ompi_info | grep Configure
:Following the advice from Here, it is apparently due to a higher priority of OpenMPI's btl/openib component but I don't think it can be if
--without-verbs
and openib is not available when searchingompi_info | grep btl
.As suggested in the UCX issue, adding
-mca pml_ucx_tls any -mca pml_ucx_devices any
to my mpirun has fixed this problem, but I was wondering what in the MCA precisely causes this behaviour.Here's my batch script:
The text was updated successfully, but these errors were encountered: