-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster strided-batched to batched wrapper #2592
Comments
Deriving a pointer is a fairly costly operation: Lines 530 to 598 in 4e9513b
As you can perform arithmetic with pointers, it's probably better to only call that function once, as you suggested, and inline the offset calculation (maybe by calling
Hah, fascinating. What is your batch size? I wonder if there's a threshold to be determined here, where a CPU-based calculation (necessitating a memory copy) is still faster than a kernel. |
Batch size here is 1000000, but intuition tells me it should never be that much slower to always do it on the GPU. Since you need the pointers on the GPU and the FLOPS required to compute a pointer is negligible, the operation is bound by the GPU memory bandwidth regardless of which way you do it. E.g. for N = 128 and including a conversion to CuArray in the CPU benchmark the times are 4.5 μs on CPU, 5.7 μs on GPU. I actually would have expected the GPU to be at least as fast as the CPU by my logic, so maybe there's something I'm not getting, but in practice, it seems reasonable to use the GPU regardless of batch size. |
You're not considering the launch overhead of a kernel, which I normally think of as taking around 20us. You can process a lot of items during that time span (it's curious that on your system the launch overhead is closer to 5us). Then again, by doing the processing on the GPU, you don't have to Thinking some more though, we need to be careful that the "slow" code performed by |
I don't really follow you here. Either way, I've started a PR. Very happy for you to make any changes as you see fit. |
CUBLAS only implements strided-batched methods (acting on a M x N x B array) for a subset of operations. For other cases, batched operations are available that act on a vector of 2D CuArrays.
CUDA.jl offers a convenient wrapper that takes a strided-batched array and creates a vector of pointers to its individual matrices which can then be pass into the batched kernel. An example of this is
CUDA.CUBLAS.getrf_strided_batched!
which callsCUDA.CUBLAS.getrf_batched!
under the hood.The pointer conversion function is
CUDA.CUBLAS.unsafe_strided_batch
and is defined asThis operation is very slow, especially in the setting of large batches of small matrices.
gives
As you can see, the pointer creation time is roughly 1000 times slower than the operation itself.
I'm not exactly sure why this is but I assume it comes down to the repeated access of
strided
when creating the pointers. Since strided matrices have regular memory addresses, a faster way would begiving
now at the same order of magnitude of speed as the kernel itself.
Better yet, the GPU can be used for this.
at which point the pointer creation time is negligible:
I'd be happy to implement these changes but wanted to run it by the team first to make sure there weren't any potential issues with this approach.
The text was updated successfully, but these errors were encountered: