Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address Registration Error in CUDA Aware MPICH 4.2.2 + UCX 1.17.0 Application #10085

Open
cl3to opened this issue Aug 23, 2024 · 9 comments
Open
Assignees

Comments

@cl3to
Copy link

cl3to commented Aug 23, 2024

I'm running an application on a cluster that uses CUDA Aware MPICH (v4.2.2) and UCX (v1.17.0). My application consists of two binaries, a server and a client, so I use the MPMD mode of mpirun to execute it: mpirun -np 1 server : -np 1 client. The problem is that when I try to run the application, either intra-node or inter-node, I get the following error and the application hangs:

[1724426336.207066] [c066:48733:0]           ib_md.c:293  UCX  ERROR ibv_reg_mr(address=0x55bace2c02a0, length=49792, access=0xf) failed: Bad address
[1724426336.207083] [c066:48733:0]          ucp_mm.c:70   UCX  ERROR failed to register address 0x55bace2c02a0 (host) length 49792 on md[8]=mlx5_1: Input/output error (md supports: host|cuda)

After some research, I found that setting the environment variable UCX_RCACHE_ENABLE=n allows my application to run without errors. However, the application’s runtime performance is not as expected. Profiling the application revealed that most of the time is spent on data transfer between the nodes.

When running the OSU 7.4 benchmark, I observed that the bandwidth between nodes using InfiniBand is approximately 5.75 times slower when I set the variable UCX_RCACHE_ENABLE=n.

export UCX_RCACHE_ENABLE=y
mpirun -ppn 1 -np 2 osu_bw -m 100000000:1000000000 or
mpirun -ppn 1 -np 1 osu_bw -m 100000000:1000000000 : -np 1 osu_bw -m 100000000:1000000000

# OSU MPI Bandwidth Test v7.4
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)
100000000           23073.19
200000000           23037.84
400000000           23001.02
800000000           22948.29
export UCX_RCACHE_ENABLE=n
mpirun -ppn 1 -np 2 osu_bw -m 100000000:1000000000 or
mpirun -ppn 1 -np 1 osu_bw -m 100000000:1000000000 : -np 1 osu_bw -m 100000000:1000000000

# OSU MPI Bandwidth Test v7.4
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)
100000000            4083.05
200000000            4078.84
400000000            4076.13
800000000            4076.14

Any suggestions on why the application might be failing to register addresses?

Setup and versions

OS version:

  • cat /etc/redhat-release: Red Hat Enterprise Linux Server release 7.9 (Maipo)
  • Kernel uname -r: 3.10.0-1160.49.1.el7.x86_64

RDMA/IB version:

  • rpm -q libibverbs: libibverbs-54mlnx1-1.54310.x86_64
  • rpm -q rdma-core: rdma-core-devel-54mlnx1-1.54310.x86_64

IB HW:

  • Each node has 2 IB NIC.
  • ibstat:
CA 'mlx5_0'
        CA type: MT4115
        Number of ports: 1
        Firmware version: 12.27.1016
        Hardware version: 0
        Node GUID: 0x0800380300b49dac
        System image GUID: 0x0800380300b49dac
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 115
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x0800380300b49dac
                Link layer: InfiniBand
CA 'mlx5_1'
        CA type: MT4115
        Number of ports: 1
        Firmware version: 12.27.1016
        Hardware version: 0
        Node GUID: 0x0800380300b49da0
        System image GUID: 0x0800380300b49da0
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 177
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x0800380300b49da0
                Link layer: InfiniBand

CUDA 12.0:

  • Each node has four 32GB V100 GPUs

  • cuda libraries: cuda-toolkit-12-0-12.0.0-1.x86_64

  • cuda drivers: cuda-driver-devel-12-0-12.0.107-1.x86_64

  • lsmod |grep nv_peer_mem:

nv_peer_mem            13369  0 
ib_core               358225  11 rdma_cm,ib_cm,iw_cm,beegfs,nv_peer_mem,ko2iblnd,mlx5_ib,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
nvidia              56056886  55 nv_peer_mem,gdrdrv,nvidia_modeset,nvidia_uvm
  • lsmod|grep gdrdrv:
gdrdrv                 18183  0 
nvidia              56056886  55 nv_peer_mem,gdrdrv,nvidia_modeset,nvidia_uvm

ucx_info -v:

# Library version: 1.17.0
# Library path: /home/jhonatan.cleto/spack/opt/spack/linux-rhel7-cascadelake/gcc-11.4.0/ucx-1.17.0-qq5l5fowibcomrutchar7maekewkiloo/lib/libucs.so.0
# API headers version: 1.17.0
# Git branch '', revision 4ef9a09
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/home/jhonatan.cleto/spack/opt/spack/linux-rhel7-cascadelake/gcc-11.4.0/ucx-1.17.0-qq5l5fowibcomrutchar7maekewkiloo --without-go --disable-doxygen-doc --disable-assertions --enable-compiler-opt=3 --without-java --enable-shared --enable-static --disable-logging --disable-mt --with-openmp --enable-optimizations --disable-params-check --disable-gtest --with-pic --with-cuda=/home/jhonatan.cleto/spack/opt/spack/linux-rhel7-cascadelake/gcc-11.4.0/cuda-12.4.0-tddfkicmflo4uydz5vvubsl5233hiasi --enable-cma --without-dc --without-dm --with-gdrcopy=/home/jhonatan.cleto/spack/opt/spack/linux-rhel7-cascadelake/gcc-11.4.0/gdrcopy-2.4.1-i7vxfrthjgn7ojewfj5a4pwsspcsg4te --with-ib-hw-tm --with-knem=/home/jhonatan.cleto/spack/opt/spack/linux-rhel7-cascadelake/gcc-11.4.0/knem-1.1.4-bhkutyn7invsbjv3e32yg3k5fiusiah6 --without-mlx5-dv --with-rc --with-ud --with-xpmem=/home/jhonatan.cleto/spack/opt/spack/linux-rhel7-cascadelake/gcc-11.4.0/xpmem-2.6.5-36-oeerzcdtxg5h6qhtv7s2nmmsh5imj4xl --without-fuse3 --without-bfd --with-rdmacm=/home/jhonatan.cleto/spack/opt/spack/linux-rhel7-cascadelake/gcc-11.4.0/rdma-core-52.0-frbk7sgqzmo2vjgu642ryhq26e3dxma7 --with-verbs=/home/jhonatan.cleto/spack/opt/spack/linux-rhel7-cascadelake/gcc-11.4.0/rdma-core-52.0-frbk7sgqzmo2vjgu642ryhq26e3dxma7 --with-avx --without-rocm
@brminich
Copy link
Contributor

is ulimit -l set to unlimited? If yes, can you pls try to use UCX_RCACHE_MAX_REGIONS=2048 instead of UCX_RCACHE_ENABLE=n

@cl3to
Copy link
Author

cl3to commented Aug 26, 2024

Yes, it's enabled.

I tried using the UCX_RCACHE_MAX_REGIONS=2048, but I got the same error.

[jhonatan.cleto@c066 experiments]$ mpirun -hosts c099,c066 -ppn 1 -np 2 bash -c 'echo "$(hostname) $(ulimit -l)"'
c066 unlimited
c099 unlimited
[jhonatan.cleto@c066 experiments]$ unset UCX_RCACHE_MAX_REGIONS 
[jhonatan.cleto@c066 experiments]$ mpirun -hosts c066,c099 -ppn 1 -np 1 server : -np 1 client
[1724681697.899525] [c099:33611:0]           ib_md.c:293  UCX  ERROR ibv_reg_mr(address=0x5651ce0ada60, length=22800, access=0xf) failed: Bad address
[1724681697.899552] [c099:33611:0]          ucp_mm.c:70   UCX  ERROR failed to register address 0x5651ce0ada60 (host) length 22800 on md[4]=mlx5_0: Input/output error (md supports: host|cuda)
Abort(1703183) on node 1 (rank 1 in comm 464): Fatal error in internal_Isend: Other MPI error, error stack:
internal_Isend(59760): MPI_Isend(buf=0x5651ce0ada68, count=22784, MPI_BYTE, 0, 1, comm=0x84000002, request=0x5651d5dba1e0) failed
MPID_Isend(63).......: 
MPIDI_isend(35)......: 
MPIDI_UCX_send(83)...:  returned failed request in UCX netmod(ucx_send.h 83 MPIDI_UCX_send Input/output error)
[jhonatan.cleto@c066 experiments]$ export UCX_RCACHE_MAX_REGIONS=2048
[jhonatan.cleto@c066 experiments]$ mpirun -hosts c066,c099 -ppn 1 -np 1 server : -np 1 client
[1724681755.437054] [c099:36942:0]           ib_md.c:293  UCX  ERROR ibv_reg_mr(address=0x55ecbd573a60, length=22800, access=0xf) failed: Bad address
[1724681755.437077] [c099:36942:0]          ucp_mm.c:70   UCX  ERROR failed to register address 0x55ecbd573a60 (host) length 22800 on md[4]=mlx5_0: Input/output error (md supports: host|cuda)
Abort(874118415) on node 1 (rank 1 in comm 464): Fatal error in internal_Isend: Other MPI error, error stack:
internal_Isend(59760): MPI_Isend(buf=0x55ecbd573a68, count=22784, MPI_BYTE, 0, 1, comm=0x84000002, request=0x55ecc2e53700) failed
MPID_Isend(63).......: 
MPIDI_isend(35)......: 
MPIDI_UCX_send(83)...:  returned failed request in UCX netmod(ucx_send.h 83 MPIDI_UCX_send Input/output error)

@yosefe
Copy link
Contributor

yosefe commented Aug 27, 2024

@cl3to

  1. can you pls check if the buffer that failed to register is a valid GPU buffer at the time of the registration (and it wasn't released)?
  2. how is the application allocating Cuda memory?
  3. Can you pls rebuild UCX with debug logs enabled (using ./contrib/configure-devel script, run the test with UCX_LOG_LEVEL=req, and upload the logs?

@cl3to
Copy link
Author

cl3to commented Aug 31, 2024

@yosefe

To provide more context, I'm testing the implementation of an MPI plugin for LLVM/offload (The first PR of the plugin is available here). The goal is to offload OpenMP tasks to remote devices. The application I'm testing is XSBench, compiled with clang + the MPI plugin. In the test, XSBench acts as the client that offloads tasks to the remote server.

1 and 2, The example I'm running does not use CUDA Aware MPI; all the buffers being sent are in host memory. I'm using MPI_ISend and MPI_Irecv, and I do round-robin in a pool of communicators and tags.

While debugging the application, I noticed that the error occurs on the first MPI_Isend that sends a buffer larger than a few hundred bytes, for example, a 70KB buffer. The buffer is not released during the entire communication.

3, I uploaded the logs obtained from UCX here.

Here are the last lines of the logs:

[1725140201.870751] [c105:27187:0]       eager_rcv.c:27   UCX  REQ   found req 0x556b68803f40
[1725140201.870754] [c105:27187:0]   ucp_request.inl:588  UCX  REQ   req 0x556b68803f40: unpack recv_data req_len 4 data_len 4 offset 0 last
[1725140201.870756] [c105:27187:0]   ucp_request.inl:261  UCX  REQ   completing receive request 0x556b68803f40 (0x556b68804050) ---cr- stag 0x151000000000002 len 4, Success
[1725140201.870759] [c105:27187:0]     ucp_request.c:215  UCX  REQ   release request 0x556b68803f40 (0x556b68804050) d--cr-
[1725140201.870761] [c105:27187:0]   ucp_request.inl:236  UCX  REQ   put request 0x556b68803f40
[1725140201.870807] [c105:27187:0]        tag_send.c:250  UCX  REQ   send_nbx buffer 0x556b688067a8 count 12 tag 1f0000100000000 to c120:43443
[1725140201.870878] [c105:27187:0]        tag_send.c:250  UCX  REQ   send_nbx buffer 0x556b68806898 count 8 tag 1d0000100000001 to c120:43443
[1725140201.870883] [c105:27187:0]        tag_send.c:250  UCX  REQ   send_nbx buffer 0x556b67818ff8 count 70464 tag 1d0000100000001 to c120:43443
[1725140201.870888] [c105:27187:0]        tag_send.c:287  UCX  REQ   allocated request 0x556b68803f40
[1725140201.870907] [c105:27187:0]    proto_common.c:692  UCX  REQ   req 0x556b68803f40: tag_send from host memory length 70464 rendezvous zero-copy read from remote 50% on dc_mlx5/mlx5_0:1 and 50% on dc_mlx5/mlx5_1:
[1725140201.873830] [c105:27187:0]  proto_common.inl:172  UCX  REQ   req 0x556b68803f40: set to stage 0, progress function 'ucp_tag_rndv_rts_progress'
[1725140201.873843] [c105:27187:0]     ucp_request.c:740  UCX  REQ   req 0x556b68803f40: progress tag/rndv {ucp_tag_rndv_rts_progress} ep_cfg[0] rkey_cfg[255] offset 0/70464
[1725140201.873856] [c105:27187:0]          wireup.c:1696 UCX  TRACE   ep 0x7fd861c40000: connect lane 0 to remote peer
[1725140201.873864] [c105:27187:0]          wireup.c:1711 UCX  TRACE   ep 0x7fd861c40000: connect lane 0 to remote peer with wireup ep
[1725140201.873878] [c105:27187:0]       wireup_ep.c:408  UCX  TRACE   ep 0x7fd861c40000: created wireup ep 0x556b688068c0 to c120:43443 
[1725140201.873897] [c105:27187:0]       wireup_ep.c:530  UCX  DEBUG   ep 0x7fd861c40000: wireup_ep 0x556b688068c0 set next_ep 0x556b68855560
[1725140201.873906] [c105:27187:0]          wireup.c:1664 UCX  DEBUG   ep 0x7fd861c40000: send wireup request (flags=0x51)
[1725140201.873910] [c105:27187:0]   ucp_request.inl:282  UCX  REQ     allocated request 0x556b68806bc0 (wireup_msg_req)
[1725140201.873922] [c105:27187:0]         address.c:1442 UCX  TRACE   pack addr[0] : dc_mlx5/mlx5_0:1 sysdev 1 paths 1 eps 0 md_flags 0x2 tl_flags 0x11000000067a bw 0.00+11794.23/nMBs ovh 40ns lat_ovh 815ns dev_priority 30 a32 0xf/0x3f a64 0xf/0x3f
[1725140201.873929] [c105:27187:0]         address.c:1442 UCX  TRACE   pack addr[1] : dc_mlx5/mlx5_1:1 sysdev 2 paths 1 eps 0 md_flags 0x2 tl_flags 0x11000000067a bw 0.00+11794.23/nMBs ovh 40ns lat_ovh 815ns dev_priority 30 a32 0xf/0x3f a64 0xf/0x3f
[1725140201.873939] [c105:27187:0]         uct_mem.c:105  UCX  TRACE   allocating rc_send_desc: host memory length 8585328 flags 0x304
[1725140201.873944] [c105:27187:0]         uct_mem.c:110  UCX  TRACE     trying allocation method thp
[1725140201.873967] [c105:27187:0]         uct_mem.c:301  UCX  TRACE     allocated 10485760 bytes at 0x7fd861200000 using thp
[1725140201.876099] [c105:27187:0]           ib_md.c:508  UCX  TRACE   ibv_reg_mr(pd=0x556b67ccf9d0 addr=0x7fd861200000 len=10485760 fd=-1 offset=0 access=0xf): mr=0x556b688046d0 lkey=0x3acf5e retry=0 took 2.117 ms
[1725140201.876109] [c105:27187:0]           mpool.c:281  UCX  DEBUG   mpool rc_send_desc: allocated chunk 0x7fd861200018 of 10485736 bytes with 1250 elements
[1725140201.876149] [c105:27187:0]   ucp_request.inl:293  UCX  REQ     freed request 0x556b68806bc0
[1725140201.876177] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd868e0e000..0x7fd868e0ffff
[1725140201.876189] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd868e0f000..0x7fd868e0ffff
[1725140201.876192] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd868e0f000..0x7fd868e0ffff
[1725140201.876195] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd868e0f000..0x7fd868e0ffff
[1725140201.876197] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd868e0f000..0x7fd868e0ffff
[1725140201.876199] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd868e0f000..0x7fd868e0ffff
[1725140201.876201] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd866e76000..0x7fd866e77fff
[1725140201.876204] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd8645f0000..0x7fd8645f1fff
[1725140201.876205] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd86456a000..0x7fd86456bfff
[1725140201.876207] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861d6a000..0x7fd861d6afff
[1725140201.876209] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861888000..0x7fd861c5dfff
[1725140201.876211] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5e000..0x7fd861c5ffff
[1725140201.876213] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876217] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876219] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876221] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876223] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876225] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876227] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876229] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876232] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876234] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876235] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876238] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876240] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876242] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876244] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876246] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876248] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876250] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876253] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876255] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876257] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876259] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876261] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876263] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876265] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876267] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876268] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876270] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876272] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876274] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876276] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876280] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876281] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876283] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876287] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876288] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876290] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876292] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876294] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876297] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876300] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876302] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876303] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876307] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876309] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876311] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876313] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876315] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876317] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876320] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876322] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876324] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876326] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876328] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876330] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876332] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876334] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876336] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876338] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876340] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876342] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876344] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876346] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5f000..0x7fd861c5ffff
[1725140201.876348] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c5e000..0x7fd861c5ffff
[1725140201.876350] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd859c40000..0x7fd85bffffff
[1725140201.876352] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd860000000..0x7fd861c3ffff
[1725140201.876355] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x7fd861c3e000..0x7fd861c3ffff
[1725140201.876358] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x556b67818ff0..0x556b6782a33f
[1725140201.876360] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x556b67818ff0..0x556b67818fef
[1725140201.876362] [c105:27187:0]          rcache.c:368  UCX  TRACE   ucp_rcache: find regions in 0x556b6782a340..0x556b6782a33f
[1725140201.876469] [c105:27187:0]           ib_md.c:293  UCX  DEBUG   ibv_reg_mr(address=0x556b67818ff0, length=70480, access=0xf) failed: Bad address
[1725140201.876477] [c105:27187:0]          ucp_mm.c:70   UCX  DIAG    failed to register address 0x556b67818ff0 (host) length 70480 on md[3]=mlx5_0: Input/output error (md supports: host|cuda)
[1725140201.876497] [c105:27187:0]           ib_md.c:293  UCX  DEBUG   ibv_reg_mr(address=0x556b67818ff0, length=70480, access=0xf) failed: Bad address
[1725140201.876502] [c105:27187:0]          ucp_mm.c:70   UCX  DIAG    failed to register address 0x556b67818ff0 (host) length 70480 on md[4]=mlx5_1: Input/output error (md supports: host|cuda)
[1725140201.876507] [c105:27187:0]          ucp_mm.c:558  UCX  TRACE   memh 0x556b68806be0: registered dt_iter address 0x556b67818ff0 length 70480 md_map 0
[1725140201.876515] [c105:27187:0]          rcache.c:1050 UCX  TRACE   ucp_rcache: created region 0x556b68806be0 [0x556b67818ff0..0x556b6782a340] gt rw ref 2 no mds
[1725140201.876520] [c105:27187:0]          ucp_mm.c:861  UCX  TRACE   memh_get_slow: dt_iter address 0x556b67818ff8/0x556b67818ff0 length 70464/70480 host md_map 18 flags 0x368
[1725140201.876541] [c105:27187:0]           ib_md.c:293  UCX  DEBUG   ibv_reg_mr(address=0x556b67818ff0, length=70480, access=0xf) failed: Bad address
[1725140201.876547] [c105:27187:0]          ucp_mm.c:70   UCX  DIAG    failed to register address 0x556b67818ff0 (host) length 70480 on md[3]=mlx5_0: Input/output error (md supports: host|cuda)
[1725140201.876559] [c105:27187:0]           ib_md.c:293  UCX  DEBUG   ibv_reg_mr(address=0x556b67818ff0, length=70480, access=0xf) failed: Bad address
[1725140201.876565] [c105:27187:0]          ucp_mm.c:70   UCX  DIAG    failed to register address 0x556b67818ff0 (host) length 70480 on md[4]=mlx5_1: Input/output error (md supports: host|cuda)
[1725140201.876568] [c105:27187:0]          ucp_mm.c:558  UCX  TRACE   memh 0x556b68806be0: registered dt_iter address 0x556b67818ff0 length 70480 md_map 0
[1725140201.876574] [c105:27187:0]  proto_common.inl:371  UCX  TRACE   dt_iter_md_map=0x0 md_map=0x18
[1725140201.876580] [c105:27187:0]        ucp_rkey.c:149  UCX  TRACE   packing rkey type host md_map 0x0 dev_map 0x6
[1725140201.876586] [c105:27187:0]     ucp_request.c:751  UCX  REQ     req 0x556b68803f40: progress protocol tag/rndv returned: No resources are available to initiate the operation lane 0
[1725140201.876602] [c105:27187:0]  proto_common.inl:271  UCX  REQ   returning send request 0x556b68803f40: tag_send 0/70464 contiguous buffer:0x556b67818ff8
[1725140201.876620] [c105:27187:0]        tag_recv.c:240  UCX  REQ   allocated request 0x556b68803e00
[1725140201.876625] [c105:27187:0]        tag_recv.c:67   UCX  REQ   req 0x556b68803e00: recv_nbx buffer 0x7fff23cbfb60 dt 0x8 count 1 tag 1d0000000000001/ffffffff9fffffff
[1725140201.876628] [c105:27187:0]        tag_recv.c:163  UCX  REQ   recv_nbx returning expected request 0x556b68803e00 (0x556b68803f10)
[1725140201.878689] [c105:27187:0]         address.c:1651 UCX  TRACE unpacking address version 0 dst version 0 flags 0x3
[1725140201.878695] [c105:27187:0]          wireup.c:766  UCX  TRACE ep 0x7fd861c40000: got wireup reply src_ep_id 0x7fcb900451e0 dst_ep_id 0x7fd861c40000 sn 0
[1725140201.878704] [c105:27187:0]      conn_match.c:304  UCX  TRACE match_ctx 0x556b67fe13b0: remove expected conn_match 0x556b68673608 address 39006077544042546 conn_sn 0
[1725140201.878710] [c105:27187:0]        ucp_ep.inl:223  UCX  TRACE ep 0x7fd861c40000: set remote_id to 0x7fcb900451e0
[1725140201.878716] [c105:27187:0]          wireup.c:535  UCX  TRACE ep 0x7fd861c40000: remote connected, ep_cfg[0]
[1725140201.878721] [c105:27187:0]          wireup.c:523  UCX  TRACE ep 0x7fd861c40000: wireup_ep=0x556b688068c0 flags=0x2 new_flags=0x5
[1725140201.878734] [c105:27187:0]          wireup.c:495  UCX  TRACE ep 0x7fd861c40000: switching wireup_ep 0x556b688068c0 to ready state
[1725140201.878746] [c105:27187:0]       wireup_ep.c:421  UCX  DEBUG ep 0x7fd861c40000: destroy wireup ep 0x556b688068c0
[1725140201.878752] [c105:27187:0]          wireup.c:981  UCX  REQ   req 0x556b68803f40: replay proto tag/rndv lane 0
[1725140201.878758] [c105:27187:0]     ucp_request.c:740  UCX  REQ   req 0x556b68803f40: progress tag/rndv {ucp_tag_rndv_rts_progress} ep_cfg[0] rkey_cfg[255] offset 0/70464
[1725140201.878760] [c105:27187:0]  proto_common.inl:371  UCX  TRACE   dt_iter_md_map=0x0 md_map=0x18
[1725140201.878763] [c105:27187:0]        ucp_rkey.c:149  UCX  TRACE   packing rkey type host md_map 0x0 dev_map 0x6
[1725140201.878766] [c105:27187:0]     ucp_request.c:754  UCX  REQ     req 0x556b68803f40: progress protocol tag/rndv returned: Success
[1725140201.879208] [c105:27187:0]      proto_rndv.c:879  UCX  REQ   req 0x556b68803f40: recv RTR offset 0 length 70464/70464 req 0x556b68803f40
[1725140201.879219] [c105:27187:0]        ucp_rkey.c:820  UCX  TRACE ep 0x7fd861c40000: unpacking rkey buffer 0x7fd866db312a length 27
[1725140201.879222] [c105:27187:0]           mpool.c:281  UCX  DEBUG   mpool ucp_rkeys: allocated chunk 0x556b68806ff0 of 16472 bytes with 128 elements
[1725140201.879228] [c105:27187:0]           ib_md.c:742  UCX  TRACE   unpacked rkey 0xffffffff0013c4b4: direct 0x13c4b4 atomic 0xffffffff
[1725140201.879233] [c105:27187:0]        ucp_rkey.c:892  UCX  TRACE   rkey[0] for remote md 5 is 0xffffffff0013c4b4
[1725140201.879236] [c105:27187:0]           ib_md.c:742  UCX  TRACE   unpacked rkey 0xffffffff00078f7f: direct 0x78f7f atomic 0xffffffff
[1725140201.879238] [c105:27187:0]        ucp_rkey.c:892  UCX  TRACE   rkey[1] for remote md 6 is 0xffffffff00078f7f
[1725140201.879245] [c105:27187:0]        ucp_rkey.c:918  UCX  TRACE   ep 0x7fd861c40000: unpacked rkey 0x556b6880afc0 md_map 0x60 type host
[1725140201.879254] [c105:27187:0]    proto_common.c:692  UCX  REQ   req 0x556b68803f40: rndv_send from host memory length 70464 zero-copy fenced write to remote 50% on dc_mlx5/mlx5_0:1 and 50% on dc_mlx5/mlx5_1:1 12
[1725140201.880456] [c105:27187:0]  proto_common.inl:172  UCX  REQ   req 0x556b68803f40: set to stage 0, progress function 'ucp_proto_rndv_put_zcopy_send_progress'
[1725140201.880463] [c105:27187:0]      proto_rndv.c:729  UCX  REQ   req 0x556b68803f40: rndv_send rva 0x7fcb680008d0 length 70464 rreq_id 0x560285a70e00 with protocol rndv/put/zcopy
[1725140201.880468] [c105:27187:0]     ucp_request.c:740  UCX  REQ   req 0x556b68803f40: progress rndv/put/zcopy {ucp_proto_rndv_put_zcopy_send_progress} ep_cfg[0] rkey_cfg[0] offset 0/70464
[1725140201.880474] [c105:27187:0]  proto_common.inl:73   UCX  REQ     req 0x556b68803f40: ucp_proto_request_zcopy_init for rndv/put/zcopy
[1725140201.880509] [c105:27187:0]           ib_md.c:293  UCX  ERROR   ibv_reg_mr(address=0x556b67818ff0, length=70480, access=0xf) failed: Bad address
[1725140201.880521] [c105:27187:0]          ucp_mm.c:70   UCX  ERROR   failed to register address 0x556b67818ff0 (host) length 70480 on md[3]=mlx5_0: Input/output error (md supports: host|cuda)
[1725140201.880533] [c105:27187:0]    proto_common.c:797  UCX  DEBUG   abort request 0x556b68803f40 proto rndv/put/zcopy status Input/output error
[1725140201.880541] [c105:27187:0]        rndv_put.c:43   UCX  REQ     req 0x556b68803f40: rndv_put_common_complete
[1725140201.880551] [c105:27187:0]        ucp_mm.inl:87   UCX  TRACE   memh 0x556b68806be0: release address 0x556b67818ff0 length 70480 md_map 0
[1725140201.880558] [c105:27187:0]        rcache.inl:29   UCX  TRACE   ucp_rcache: lru add region 0x556b68806be0 [0x556b67818ff0..0x556b6782a340] gt rw ref 2 no mds
[1725140201.880563] [c105:27187:0]   ucp_request.inl:246  UCX  REQ     completing send request 0x556b68803f40 (0x556b68804050) ---c-- Input/output error
[1725140201.880570] [c105:27187:0]     ucp_request.c:215  UCX  REQ     release request 0x556b68803f40 (0x556b68804050) d--c--
[1725140201.880572] [c105:27187:0]   ucp_request.inl:236  UCX  REQ     put request 0x556b68803f40
[1725140201.880576] [c105:27187:0]     ucp_request.c:754  UCX  REQ     req 0x556b68803f40: progress protocol rndv/put/zcopy returned: Success

@yosefe
Copy link
Contributor

yosefe commented Sep 22, 2024

@cl3to can you pls share the application code, or at least the part that allocates the buffer?
Can you try setting "UCX_IB_FORK_INIT=n" instead of "UCX_RCACHE_ENABLE=n"?
is it possible to capture the contents of /proc/<pid>/maps, along with the UCX trace logs, when the issue happens?

@cl3to
Copy link
Author

cl3to commented Sep 26, 2024

@yosefe

I’ve found the root cause of the issue. The address causing the failure in the UCX ibv_reg_mr call is marked as read-only. Since ibv_reg_mr requests write permissions for that address, it fails because the address can’t be written to.

For context, my application sends the __tgt_device_image structure from LLVM/OpenMP Offloading to the server process so it can run kernels from the image. The issue arises because the section holding this __tgt_device_image is loaded as read-only in the client's process memory, which leads to the failure with ibv_reg_mr.

I worked around this by copying the __tgt_device_image to a buffer with write access, which resolved the issue for rcache.

However, I noticed that when the rcache operation fails, UCX returns UCS_ERR_IO_ERROR, but it seems like MPICH is ignoring that error in MPI_Testall. This leaves my application hanging, waiting for the request to complete, which never happens due to the ignored error. It seems like this might be an MPICH issue. Any thoughts on that?

@yosefe
Copy link
Contributor

yosefe commented Oct 6, 2024

@raffenet WDYT about MPICH MPI_Testall handling (see above)?

@amastbaum
Copy link
Collaborator

amastbaum commented Nov 25, 2024

Hi @cl3to

Could you please provide the output when running the application with UCX_LOG_LEVEL_TRIGGER=error?

Thanks

@yosefe
Copy link
Contributor

yosefe commented Dec 15, 2024

Should be addressed by #10341

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants