Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/CUDA: Runtime CUDA >= 12.3 to enable VMM #10396

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

tvegas1
Copy link
Contributor

@tvegas1 tvegas1 commented Dec 20, 2024

What?

Do not use cuCtxSetFlags() if CUDA driver does not support it.

Why?

Unresolved symbol for cuCtxSetFlags on CUDA driver < 12.1 causes crash.

How?

Assumptions:

  • cuCtxSetFlags is only needed for VMM, which has UCX support starting from CUDA driver >= 12.3
  • cuCtxSetFlags is not strictly needed for malloc async

Testing

Locally tested, needs final testing on platform with actual older drivers.

UCX_IB_GPU_DIRECT_RDMA=no ./rfs/bin/ucx_perftest -t tag_bw -m cuda 

@tvegas1 tvegas1 force-pushed the cuda_ctx_set_flags_runtime branch from 1ce967f to 68a5f51 Compare December 20, 2024 10:46
Comment on lines 891 to 896
status = uct_cuda_copy_md_check_is_ctx_set_flags_supported();
if ((status != UCS_OK) && (md->config.enable_fabric != UCS_NO)) {
ucs_warn("disabled fabric memory allocations as cuda driver "
"library does not support cuCtxSetFlags()");
md->config.enable_fabric = UCS_NO;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
status = uct_cuda_copy_md_check_is_ctx_set_flags_supported();
if ((status != UCS_OK) && (md->config.enable_fabric != UCS_NO)) {
ucs_warn("disabled fabric memory allocations as cuda driver "
"library does not support cuCtxSetFlags()");
md->config.enable_fabric = UCS_NO;
}
if (md->config.enable_fabric != UCS_NO) {
status = uct_cuda_copy_md_check_is_ctx_set_flags_supported();
if (status != UCS_OK) {
if (md->config.enable_fabric == UCS_YES) {
ucs_error("fabric memory allocation requested but cuda driver "
"library does not support cuCtxSetFlags()");
goto err_free_md;
} else {
ucs_diag("disabled fabric memory allocations as cuda driver "
"library does not support cuCtxSetFlags()");
md->config.enable_fabric = UCS_NO;
}
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would not work as we will try to use set ctx even if fabric is not enabled.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can check md->config.enable_fabric instead of ctx_set_flags_func in uct_cuda_copy_sync_memops.

src/uct/cuda/cuda_copy/cuda_copy_md.c Outdated Show resolved Hide resolved
CUresult cu_err;

if (status == UCS_ERR_INVALID_ADDR) {
pthread_mutex_lock(&lock);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need a mutex here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

theoretical: multiple workers writing func pointer

src/uct/cuda/cuda_copy/cuda_copy_md.c Outdated Show resolved Hide resolved
@yosefe
Copy link
Contributor

yosefe commented Dec 20, 2024

we have tests for different cuda versions, which include cuda memory hooks (for example, Test Cuda Docker ubuntu18_cuda_12_0). can we add a test that would have caught the new api usage?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants