You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to use OpenMPI+UCX with multiple CUDA devices within the same rank but quickly ran into a "named symbol not found" error:
cuda_copy_md.c:375 UCX ERROR cuMemGetAddressRange(0x7f7553400000) error: named symbol not found
cuda_copy_md.c:375 UCX ERROR cuMemGetAddressRange(0x7f7553400000) error: named symbol not found
ib_md.c:293 UCX ERROR ibv_reg_mr(address=0x7f7553400000, length=33554432, access=0xf) failed: Bad address
ucp_mm.c:70 UCX ERROR failed to register address 0x7f7553400000 (host) length 33554432 on md[6]=mlx5_bond_0: Input/output error (md supports: host)
This was with OpenMPI 5.0.5 and UCX 1.17.
Could this be because during the progression of a transfer, the associated CUDA device must be the current one, set with cudaSetDevice()? And if so, is there any way to make this work with multiple devices doing transfers in parallel?
I also came across a PR that looks like it may fix the issue I'm having: #9645
The text was updated successfully, but these errors were encountered:
@yosefe - We (@pascal-boeschoten-hapteon and I) are using UCX 1.17.0 (built from source using the tagged release) alongside CUDA 12.1.105. We encounter the above issue when using MPI_Isend/MPI_Irecv, such that from the same rank, some in-flight requests are pointing to buffers located on one GPU, and other requests point to buffers on another GPU. Pseudo-code below:
auto buf_on_cuda_dev_0;
auto buf_on_cuda_dev_1;
cudaSetDevice(0);
MPI_Isend/MPI_Irecv(buf_on_cuda_dev_0);
cudaSetDevice(1);
MPI_Isend/MPI_Irecv(buf_on_cuda_dev_1);
MPI_Waitall();
If instead, for a given rank, we only use one device at any given time, then the CUDA error disappears and everything works correctly. I.e., the previous pseudo-code would be changed to:
auto buf_on_cuda_dev_0;
auto buf_on_cuda_dev_1;
cudaSetDevice(0);
MPI_Isend/MPI_Irecv(buf_on_cuda_dev_0);
MPI_Waitall();
cudaSetDevice(1);
MPI_Isend/MPI_Irecv(buf_on_cuda_dev_1);
MPI_Waitall();
Yes, the reason for the error is that the CUDA device was changed.
UCX tries to detect memory using cuMemGetAddressRange Driver API call. The function returns an error since the current device is different from the one on which the memory was allocated.
Thus, currently UCX doesn't support this case. And #9645 solves this issue.
I'm trying to use OpenMPI+UCX with multiple CUDA devices within the same rank but quickly ran into a "named symbol not found" error:
This was with OpenMPI 5.0.5 and UCX 1.17.
Could this be because during the progression of a transfer, the associated CUDA device must be the current one, set with
cudaSetDevice()
? And if so, is there any way to make this work with multiple devices doing transfers in parallel?I also came across a PR that looks like it may fix the issue I'm having: #9645
The text was updated successfully, but these errors were encountered: