OpenMPI+UCX with multiple GPUs error: "named symbol not found" #10304

pascal-boeschoten-hapteon · 2024-11-15T15:44:25Z

I'm trying to use OpenMPI+UCX with multiple CUDA devices within the same rank but quickly ran into a "named symbol not found" error:

cuda_copy_md.c:375  UCX  ERROR cuMemGetAddressRange(0x7f7553400000) error: named symbol not found
cuda_copy_md.c:375  UCX  ERROR cuMemGetAddressRange(0x7f7553400000) error: named symbol not found
			   ib_md.c:293  UCX  ERROR ibv_reg_mr(address=0x7f7553400000, length=33554432, access=0xf) failed: Bad address
			  ucp_mm.c:70   UCX  ERROR failed to register address 0x7f7553400000 (host) length 33554432 on md[6]=mlx5_bond_0: Input/output error (md supports: host)

This was with OpenMPI 5.0.5 and UCX 1.17.
Could this be because during the progression of a transfer, the associated CUDA device must be the current one, set with cudaSetDevice()? And if so, is there any way to make this work with multiple devices doing transfers in parallel?
I also came across a PR that looks like it may fix the issue I'm having: #9645

The text was updated successfully, but these errors were encountered:

yosefe · 2024-11-17T11:33:39Z

This error could be asynchronous, coming from a previous failure. Can you please provide more details on the test case, and UCX/Cuda versions?

judicaelclair · 2024-11-17T22:31:55Z

@yosefe - We (@pascal-boeschoten-hapteon and I) are using UCX 1.17.0 (built from source using the tagged release) alongside CUDA 12.1.105. We encounter the above issue when using MPI_Isend/MPI_Irecv, such that from the same rank, some in-flight requests are pointing to buffers located on one GPU, and other requests point to buffers on another GPU. Pseudo-code below:

auto buf_on_cuda_dev_0;
auto buf_on_cuda_dev_1;
cudaSetDevice(0);
MPI_Isend/MPI_Irecv(buf_on_cuda_dev_0);
cudaSetDevice(1);
MPI_Isend/MPI_Irecv(buf_on_cuda_dev_1);
MPI_Waitall();

If instead, for a given rank, we only use one device at any given time, then the CUDA error disappears and everything works correctly. I.e., the previous pseudo-code would be changed to:

auto buf_on_cuda_dev_0;
auto buf_on_cuda_dev_1;
cudaSetDevice(0);
MPI_Isend/MPI_Irecv(buf_on_cuda_dev_0);
MPI_Waitall();
cudaSetDevice(1);
MPI_Isend/MPI_Irecv(buf_on_cuda_dev_1);
MPI_Waitall();

rakhmets · 2024-12-10T11:19:45Z

Yes, the reason for the error is that the CUDA device was changed.
UCX tries to detect memory using cuMemGetAddressRange Driver API call. The function returns an error since the current device is different from the one on which the memory was allocated.
Thus, currently UCX doesn't support this case. And #9645 solves this issue.

pascal-boeschoten-hapteon added the Bug label Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenMPI+UCX with multiple GPUs error: "named symbol not found" #10304

OpenMPI+UCX with multiple GPUs error: "named symbol not found" #10304

pascal-boeschoten-hapteon commented Nov 15, 2024

yosefe commented Nov 17, 2024

judicaelclair commented Nov 17, 2024

rakhmets commented Dec 10, 2024

OpenMPI+UCX with multiple GPUs error: "named symbol not found" #10304

OpenMPI+UCX with multiple GPUs error: "named symbol not found" #10304

Comments

pascal-boeschoten-hapteon commented Nov 15, 2024

yosefe commented Nov 17, 2024

judicaelclair commented Nov 17, 2024

rakhmets commented Dec 10, 2024