Faster strided-batched to batched wrapper #2592

THargreaves · 2024-12-15T10:12:27Z

CUBLAS only implements strided-batched methods (acting on a M x N x B array) for a subset of operations. For other cases, batched operations are available that act on a vector of 2D CuArrays.

CUDA.jl offers a convenient wrapper that takes a strided-batched array and creates a vector of pointers to its individual matrices which can then be pass into the batched kernel. An example of this is CUDA.CUBLAS.getrf_strided_batched! which calls CUDA.CUBLAS.getrf_batched! under the hood.

The pointer conversion function is CUDA.CUBLAS.unsafe_strided_batch and is defined as

@inline function unsafe_strided_batch(strided::DenseCuArray{T}) where {T}
    batchsize = last(size(strided))
    stride = prod(size(strided)[1:end-1])
    ptrs = [pointer(strided, (i-1)*stride + 1) for i in 1:batchsize]
    return CuArray(ptrs)
end

This operation is very slow, especially in the setting of large batches of small matrices.

D = 4;
N = 1000000;
A = CUDA.rand(Float32, D, D, N);

println("Naive pointer conversion")
@benchmark CUDA.@sync CUDA.CUBLAS.unsafe_strided_batch($A)

println("Batched getrf")
@benchmark CUDA.CUBLAS.getrf_batched!($D, $Aptrs, $D, true, nothing)

gives

Naive pointer conversion

BenchmarkTools.Trial: 23 samples with 1 evaluation.
 Range (min … max):  203.631 ms … 269.869 ms  ┊ GC (min … max): 0.90% … 9.20%
 Time  (median):     219.725 ms               ┊ GC (median):    1.97%
 Time  (mean ± σ):   224.935 ms ±  19.125 ms  ┊ GC (mean ± σ):  4.69% ± 4.81%

  █▁ ▁          █              ▁    ▁                            
  ██▁█▁▁▁▁▁▁▁▁▁▁█▁▁▆▁▆▁▁▁▁▁▁▁▁▆█▁▆▁▁█▁▁▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▁▁▁▆ ▁
  204 ms           Histogram: frequency by time          270 ms <

 Memory estimate: 83.92 MiB, allocs estimate: 5000013.

Batched getrf

BenchmarkTools.Trial: 3695 samples with 8 evaluations.
 Range (min … max):    3.923 μs … 436.164 μs  ┊ GC (min … max): 0.00% … 36.24%
 Time  (median):     171.996 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   168.987 μs ±  32.501 μs  ┊ GC (mean ± σ):  0.66% ±  3.31%

  ▂                                                           █  
  █▄▄▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▁▁▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▃▁▄▃▃▄▃▄▄▄█ █
  3.92 μs       Histogram: log(frequency) by time        173 μs <

 Memory estimate: 672 bytes, allocs estimate: 26.

As you can see, the pointer creation time is roughly 1000 times slower than the operation itself.

I'm not exactly sure why this is but I assume it comes down to the repeated access of strided when creating the pointers. Since strided matrices have regular memory addresses, a faster way would be

first_address = UInt64(pointer(A));
println("Pointer creation time (no reference to A)")
@benchmark [CuPtr{Float32}($first_address + (i - 1) * $stride * sizeof(Float32)) for i in 1:($N)]

giving

Pointer creation time (no reference to A)

BenchmarkTools.Trial: 6672 samples with 1 evaluation.
 Range (min … max):  149.006 μs … 3.956 ms  ┊ GC (min … max):  0.00% … 24.57%
 Time  (median):     180.368 μs             ┊ GC (median):     0.00%
 Time  (mean ± σ):   746.292 μs ± 1.021 ms  ┊ GC (mean ± σ):  20.54% ± 22.02%

  █▄▅                        ▁▂▁       ▁▁        ▃▃       ▁▃▂ ▁
  ███▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅███▇▅▄▁▁▆███▇▆▃▁▃▁▆████▆▃▁▃▁▆███ █
  149 μs       Histogram: log(frequency) by time      3.08 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 3.

now at the same order of magnitude of speed as the kernel itself.

Better yet, the GPU can be used for this.

Aptrs_gpu = CuArray{CuPtr{Float32}}(undef, N);

function create_ptrs_kernel!(Aptrs_gpu, initial_ptr_address::UInt64, stride)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x
    for i in index:stride:length(Aptrs_gpu)
        Aptrs_gpu[i] = CuPtr{Float32}(initial_ptr_address + (i - 1) * stride * sizeof(Float32))
    end
    return nothing
end

println("Pointer creation time (GPU)")
@benchmark CUDA.@sync @cuda threads = $threads blocks = $blocks create_ptrs_kernel!(
    $Aptrs_gpu, UInt64(pointer($A)), $stride
)

at which point the pointer creation time is negligible:

Pointer creation time (GPU)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  9.023 μs …  44.357 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.795 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.836 μs ± 475.868 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                              ▂▃▅█▅▅▇▂▁▂                       
  ▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▂▂▂▂▂▂▂▂▂▂▃▄▆██████████▇▆▆▄▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  9.02 μs         Histogram: frequency by time        10.4 μs <

 Memory estimate: 256 bytes, allocs estimate: 12.

I'd be happy to implement these changes but wanted to run it by the team first to make sure there weren't any potential issues with this approach.

The text was updated successfully, but these errors were encountered:

maleadt · 2024-12-16T13:48:21Z

I'm not exactly sure why this is but I assume it comes down to the repeated access of strided when creating the pointers.

Deriving a pointer is a fairly costly operation:

CUDA.jl/src/memory.jl

Lines 530 to 598 in 4e9513b

    
           function Base.convert(::Type{CuPtr{T}}, managed::Managed{M}) where {T,M} 
        
             # let null pointers pass through as-is 
        
             ptr = convert(CuPtr{T}, managed.mem) 
        
             if ptr == CU_NULL 
        
               return ptr 
        
             end 
        
             # accessing memory during stream capture: taint the memory so that we always synchronize 
        
             state = active_state() 
        
             if is_capturing(state.stream) 
        
               managed.captured = true 
        
             end 
        
             # accessing memory on another device: ensure the data is ready and accessible 
        
             if M == DeviceMemory && state.context != managed.mem.ctx 
        
               maybe_synchronize(managed) 
        
               source_device = managed.mem.dev 
        
               # enable peer-to-peer access 
        
               if maybe_enable_peer_access(state.device, source_device) != 1 
        
                   throw(ArgumentError( 
        
                       """cannot take the GPU address of inaccessible device memory. 
        
                          You are trying to use memory from GPU $(deviceid(source_device)) on GPU $(deviceid(state.device)). 
        
                          P2P access between these devices is not possible; either switch to GPU $(deviceid(source_device)) 
        
                          by calling `CUDA.device!($(deviceid(source_device)))`, or copy the data to an array allocated on device $(deviceid(state.device)).""")) 
        
               end 
        
               # set pool visibility 
        
               if stream_ordered(source_device) 
        
                 pool = pool_create(source_device) 
        
                 access!(pool, state.device, ACCESS_FLAGS_PROT_READWRITE) 
        
               end 
        
             end 
        
             # accessing memory on another stream: ensure the data is ready and take ownership 
        
             if managed.stream != state.stream 
        
               maybe_synchronize(managed) 
        
               managed.stream = state.stream 
        
             end 
        
             managed.dirty = true 
        
             return ptr 
        
           end 
        
           function Base.convert(::Type{Ptr{T}}, managed::Managed{M}) where {T,M} 
        
             # let null pointers pass through as-is 
        
             ptr = convert(Ptr{T}, managed.mem) 
        
             if ptr == C_NULL 
        
               return ptr 
        
             end 
        
             # accessing memory on the CPU: only allowed for host or unified allocations 
        
             if M == DeviceMemory 
        
               throw(ArgumentError( 
        
                   """cannot take the CPU address of GPU memory. 
        
                      You are probably falling back to or otherwise calling CPU functionality 
        
                      with GPU array inputs. This is not supported by regular device memory; 
        
                      ensure this operation is supported by CUDA.jl, and if it isn't, try to 
        
                      avoid it or rephrase it in terms of supported operations. Alternatively, 
        
                      you can consider using GPU arrays backed by unified memory by 
        
                      allocating using `cu(...; unified=true)`.""")) 
        
             end 
        
             # make sure any work on the memory has finished. 
        
             maybe_synchronize(managed) 
        
             return ptr 
        
           end

As you can perform arithmetic with pointers, it's probably better to only call that function once, as you suggested, and inline the offset calculation (maybe by calling Base._memory_offset(x, i) to avoid duplicating logic).

Better yet, the GPU can be used for this.

Hah, fascinating. What is your batch size? I wonder if there's a threshold to be determined here, where a CPU-based calculation (necessitating a memory copy) is still faster than a kernel.

THargreaves · 2024-12-17T22:07:40Z

Batch size here is 1000000, but intuition tells me it should never be that much slower to always do it on the GPU. Since you need the pointers on the GPU and the FLOPS required to compute a pointer is negligible, the operation is bound by the GPU memory bandwidth regardless of which way you do it.

E.g. for N = 128 and including a conversion to CuArray in the CPU benchmark the times are 4.5 μs on CPU, 5.7 μs on GPU.

I actually would have expected the GPU to be at least as fast as the CPU by my logic, so maybe there's something I'm not getting, but in practice, it seems reasonable to use the GPU regardless of batch size.

maleadt · 2024-12-20T10:24:23Z

You're not considering the launch overhead of a kernel, which I normally think of as taking around 20us. You can process a lot of items during that time span (it's curious that on your system the launch overhead is closer to 5us). Then again, by doing the processing on the GPU, you don't have to memcpy the pointer values to the GPU (which I presume takes a similar time to queue as a kernel launch does), so in this specific case it's probably fine to always compute on the GPU.

Thinking some more though, we need to be careful that the "slow" code performed by pointer on the CPU is at least triggered once, because it's responsible for configuring the memory pool and exposing memory when using unified memory and/or multiple devices. Passing the source array should do that (the conversion to a device array when passing as an argument requires a pointer conversion), but this should be double checked.

THargreaves · 2024-12-20T11:07:38Z

Thinking some more though, we need to be careful that the "slow" code performed by pointer on the CPU is at least triggered once, because it's responsible for configuring the memory pool and exposing memory when using unified memory and/or multiple devices. Passing the source array should do that (the conversion to a device array when passing as an argument requires a pointer conversion), but this should be double checked.

I don't really follow you here.

Either way, I've started a PR. Very happy for you to make any changes as you see fit.

THargreaves added the enhancement New feature or request label Dec 15, 2024

maleadt added cuda libraries Stuff about CUDA library wrappers. performance How fast can we go? labels Dec 16, 2024

THargreaves mentioned this issue Dec 20, 2024

Move strided batch pointer conversion to GPU #2601

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster strided-batched to batched wrapper #2592

Faster strided-batched to batched wrapper #2592

THargreaves commented Dec 15, 2024 •

edited

Loading

maleadt commented Dec 16, 2024

THargreaves commented Dec 17, 2024

maleadt commented Dec 20, 2024

THargreaves commented Dec 20, 2024

Faster strided-batched to batched wrapper #2592

Faster strided-batched to batched wrapper #2592

Comments

THargreaves commented Dec 15, 2024 • edited Loading

maleadt commented Dec 16, 2024

THargreaves commented Dec 17, 2024

maleadt commented Dec 20, 2024

THargreaves commented Dec 20, 2024

THargreaves commented Dec 15, 2024 •

edited

Loading