About CUDA kernels written in Rust #27

Gui-Yom · 2024-10-09T12:26:56Z

Gui-Yom
Oct 9, 2024

Hey, it does not look like there are other places where we can discuss about CUDA so I'm posting here.
I just wanted to share https://github.com/Gui-Yom/turbo-metrics, a tool to accelerate video metrics computation using CUDA (I'm the author).

I've been closely following recent (not so recent anymore) by @kjetilkjeka on the nvptx backend in rustc (more specifically the LLVM bitcode linker) and it allows one to link different kernels crates into a single ptx file using no other tool than cargo. Linking LLVM bitcode is also really useful because we can use the CUDA libdevice for optimized math functions and link to any other compiled llvm bitcode as an escape hatch.

The project I'm sharing showcases that, it mixes host and device side crates in the same cargo workspace and uses build scripts to make host crates depend on device crates. Everything can be built in one cargo command. Device side crates are standard #[no_std] crates.

Example device side kernel :

#[no_mangle]
pub unsafe extern "ptx-kernel" fn my_kernel(
    len: usize,
	buf: *mut u8
) {
	use nvptx_core::prelude;
    let x = coords_1d();
	if x >= len { return; }
	*buf.add(x) = x;
}

Example host side build script :

fn main() {
    nvptx_builder::build_ptx_crate("my-kernel-crate", "nvptx-release-profile");
}

Host side ptx loading :

let module = CuModule::load_ptx(include_str!(concat!(
    env!("OUT_DIR"),
    "/my_kernel_crate.ptx"
)))?;

There are many downsides to this approach :

Bridging the host side (built using the host target) and the device side (built using the nvptx64-nvidia-cuda target) requires the crates are checked out locally, no crates.io dependency is possible at the boundary. This also means no crates.io publishing.
There is no support for LLVM address spaces in the Rust compiler, which means you can't use shared and constant memory easily. My approach involves linking to a minimal LLVM bitcode file that contains the definition. (Such file could be generated on the fly though)

The interface between host and device code is loosely defined :

 // Kernels are loaded by name (you have to repeat this for every kernel)
 // Although a typo here will simply make your program fail at runtime
 let my_kernel = module.function_by_name("my_kernel")?,
 // Parameters are unchecked (order and size !)
 // A mismatch will cause your kernel to SILENTLY fail and produce garbage, leaving you debugging for hours
 my_kernel.launch(
     &launch_config,
     stream,
     kernel_params!(
         param0,
 		param1
     ),
 )

Anyway, I just wanted to share my approach at building full Rust CUDA applications, maybe some people here will be interested. Also, I would love to help in development in any way I can.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About CUDA kernels written in Rust #27

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

About CUDA kernels written in Rust #27

Gui-Yom Oct 9, 2024

Replies: 0 comments

Gui-Yom
Oct 9, 2024