Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMPI+UCX with multiple GPUs error: "named symbol not found" #10304

Open
pascal-boeschoten-hapteon opened this issue Nov 15, 2024 · 3 comments
Labels

Comments

@pascal-boeschoten-hapteon

I'm trying to use OpenMPI+UCX with multiple CUDA devices within the same rank but quickly ran into a "named symbol not found" error:

cuda_copy_md.c:375  UCX  ERROR cuMemGetAddressRange(0x7f7553400000) error: named symbol not found
cuda_copy_md.c:375  UCX  ERROR cuMemGetAddressRange(0x7f7553400000) error: named symbol not found
			   ib_md.c:293  UCX  ERROR ibv_reg_mr(address=0x7f7553400000, length=33554432, access=0xf) failed: Bad address
			  ucp_mm.c:70   UCX  ERROR failed to register address 0x7f7553400000 (host) length 33554432 on md[6]=mlx5_bond_0: Input/output error (md supports: host)

This was with OpenMPI 5.0.5 and UCX 1.17.
Could this be because during the progression of a transfer, the associated CUDA device must be the current one, set with cudaSetDevice()? And if so, is there any way to make this work with multiple devices doing transfers in parallel?
I also came across a PR that looks like it may fix the issue I'm having: #9645

@yosefe
Copy link
Contributor

yosefe commented Nov 17, 2024

This error could be asynchronous, coming from a previous failure. Can you please provide more details on the test case, and UCX/Cuda versions?

@judicaelclair
Copy link

@yosefe - We (@pascal-boeschoten-hapteon and I) are using UCX 1.17.0 (built from source using the tagged release) alongside CUDA 12.1.105. We encounter the above issue when using MPI_Isend/MPI_Irecv, such that from the same rank, some in-flight requests are pointing to buffers located on one GPU, and other requests point to buffers on another GPU. Pseudo-code below:

auto buf_on_cuda_dev_0;
auto buf_on_cuda_dev_1;
cudaSetDevice(0);
MPI_Isend/MPI_Irecv(buf_on_cuda_dev_0);
cudaSetDevice(1);
MPI_Isend/MPI_Irecv(buf_on_cuda_dev_1);
MPI_Waitall();

If instead, for a given rank, we only use one device at any given time, then the CUDA error disappears and everything works correctly. I.e., the previous pseudo-code would be changed to:

auto buf_on_cuda_dev_0;
auto buf_on_cuda_dev_1;
cudaSetDevice(0);
MPI_Isend/MPI_Irecv(buf_on_cuda_dev_0);
MPI_Waitall();
cudaSetDevice(1);
MPI_Isend/MPI_Irecv(buf_on_cuda_dev_1);
MPI_Waitall();

@rakhmets
Copy link
Collaborator

Yes, the reason for the error is that the CUDA device was changed.
UCX tries to detect memory using cuMemGetAddressRange Driver API call. The function returns an error since the current device is different from the one on which the memory was allocated.
Thus, currently UCX doesn't support this case. And #9645 solves this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants