Add conv_transpose_1d_gemm #940

smeso · 2024-08-28T22:30:40Z

I wanted to add support for batching, padding and dilation to conv_transpose_1d and I decided to do it by re-implementing the operator using mulmat+col2im.
col2im at the moment only supports 1 dimension, but it can be easily extended to support 2d without breaking its API (it can be done in a future PR).
I also created the CUDA and SYCL kernels and I wanted also to add the Vulkan shader (maybe in a different PR).

Implementing conv_transpose_1d via mulmat+col2im uses more memory (not always) but is much faster (not always).

Because of the increased memory usage, it has been implemented as a new operator, so that people concerned about memory can still use the old version.

Here is a plot that compares performance and memory usage of the 2 approaches:

Long story short: the bigger $IC$ is (compared to the other dimensions) the better.
For very small kernels with $IC \leq 2$ the old version is better.

I noticed that, since I started working on this, a new im2col_back operator was added.
While col2im and im2col_back sound similar, the code and the way they work is a bit different. They can't work just as drop-in replacement for each other, but I think that they could be merged in a single function, but I don't know if this makes sense.

@JohannesGaessler being the author of im2col_back, do you have any opinion on this PR?
Unfortunately I didn't think that my work had a chance to clash with someone else's and I didn't coordinate first.

JohannesGaessler · 2024-08-29T07:36:40Z

I only added the IM2COL_BACK operator in order to enable training of the convolutional network in the MNIST example. If you can make that work with your COL2IM implementation I have no issue whatsoever with removing my code again.

JohannesGaessler · 2024-08-29T11:54:29Z

Implementing conv_transpose_1d via mulmat+col2im uses more memory but it can be ~80% faster (based on a very simple benchmark I did) than the current approach when the input data is relatively big.

The only reason why this is currently faster is a lack of optimization in the convolution kernels while a comparatively large amount of work has gone into optimizing general matrix multiplication. And I'm not convinced that the code in this PR is necessarily an upgrade over the code on master because the extra memory needed for IM2COL/COL2IM is quite substantial and for I/O-bound applications the speed will likely be worse as well. I currently do not use ggml convolutions for anything though so I personally do not care much about this. If I can get the training costs cheap enough I will maybe try to train an image model in 2025 using ggml. At that point I would likely overhaul the convolution code regardless of whether or not this PR is merged.

While col2im and im2col_back sound similar, the code and the way they work is a bit different. They can't work just as drop-in replacement for each other, but I think that they could be merged in a single function, but I don't know if this makes sense.

I think it would be better to have a single operator, COL2IM would be a better name I think.

JohannesGaessler · 2024-08-29T12:02:04Z

In tests/test-grad0 there is a test for IM2COL. If you can make that test pass using your code then from my end it would be fine to remove my code.

smeso · 2024-08-29T22:57:21Z

Implementing conv_transpose_1d via mulmat+col2im uses more memory but it can be ~80% faster (based on a very simple benchmark I did) than the current approach when the input data is relatively big.

The only reason why this is currently faster is a lack of optimization in the convolution kernels while a comparatively large amount of work has gone into optimizing general matrix multiplication.

Yes, of course. I didn't mean to imply that I did any magic trick here. mulmat+col2im is a well established method for transposed convolutions because there are many optimizations around GEMM. And of course, there is a memory/speed trade-off here. Also, for smaller kernels, Winograd multiplication should work even better (I don't think that ggml has this optimization currently).

And I'm not convinced that the code in this PR is necessarily an upgrade over the code on master because the extra memory needed for IM2COL/COL2IM is quite substantial and for I/O-bound applications the speed will likely be worse as well.

Yes, it does use more memory: in the worst case scenario, in my tests, it was ~800MB vs ~1MB.
In relative terms is a lot, but it works well for my use-case (hopefully for others as well, especially if smaller/quantized types are used).
Why do you think that the speed would be worse?
I'm sure that the code in this PR can be improved, but it's already much faster than the code in master and it gets faster and faster as the kernel/input size grows.
These are a couple of plots of the speed of the 2 versions as kernel+input size grows. I'm sure there are better ways to benchmark these functions, but this was quick:
(of course the size scale is log2 and time is linear)

The CUDA test is much shorter (i.e. I stopped it at a much lower size than the CPU test) because the old code was so slow that it was, in fact, much slower than using the CPU.

While col2im and im2col_back sound similar, the code and the way they work is a bit different. They can't work just as drop-in replacement for each other, but I think that they could be merged in a single function, but I don't know if this makes sense.

I think it would be better to have a single operator, COL2IM would be a better name I think.

After reviewing your code I realized that these 2 functions don't have much in common except the similar name. It should be possible to obtain one function's result from the other multiplying/dividing (element-wise) the result by some matrix... but I don't think it's worth trying, it will likely be much worse, performance-wise, in both cases.

JohannesGaessler · 2024-08-30T06:52:24Z

Why do you think that the speed would be worse?

Assuming two square matrices GEMM needs $O(N^3)$ compute and $O(N^2)$ I/O. GEMM is compute bound if the dimensions of all matrices are large. If one of the dimensions is very thin however, it is instead I/O bound. It should be comparatively more easy to write a convolution kernel that utilizes I/O well vs. a kernel that utilizes compute well. So for the I/O-bound case the speed difference between convolution and GEMM should be comparatively smaller (if at all) while IM2COL also increases the I/O even further due to redundant values.

smeso · 2024-08-30T16:56:33Z

Oh wait. I thought that your

I'm not convinced that the code in this PR is necessarily an upgrade over the code on master

was specifically referring to something wrong with my code.
Now it's clearer: your were talking about GEMM in general.
Yeah, you are right in the general case, but...
In the specific case of conv_transposed_1d, the multliplication is performed between $(IC, KW OC)$ and $(IC, IW N)$.
Now, even if you have no control over $IC$, $KW$ and $OC$, you can control the value of $IW N$ and pick one that will maintain $AI = 1$.

A direct approach will have a big advantage if a large padding is used, in that case you will be able to skip a lot of calculations and memory accesses, while GEMM still has to do all of the them.
But with a small padding it will basically have to do the same memory accesses and operations (asymptotically) if I'm not mistaken.

Anyway this PR wasn't really about performance, I'm doing it to add support for batching, dilation and padding (that I need to use). The better performance is a nice side effect, but it wasn't what I was looking for. The increased memory usage isn't nice, but for my use-case I'm happy to trade some memory for speed.

Of course, if in the future you can make it even faster, I'll be even happier!

P.S.
Here is another plot with various kernel sizes and $IW$ s chosen appropriately to maintain $AI=1$ (the shades are $3\sigma$ intervals)

JohannesGaessler · 2024-08-30T21:46:09Z

Anyway this PR wasn't really about performance, I'm doing it to add support for batching, dilation and padding (that I need to use).

In a similar way I work mainly on performance (which in turns ends up mostly matrix multiplications) so that is what a lot of my commentary tends to be about :)

smeso · 2024-08-31T11:12:33Z

BTW, if the increased memory usage is a concern, instead of re-implementing the existing conv_transpose_1d, I could add a new conv_transpose_1d_gemm and let users decide for themselves.

Signed-off-by: Salvatore Mesoraca <[email protected]>

smeso mentioned this pull request Aug 31, 2024

SYCL: do not skip CPU or FPGA if explicitly selected ggerganov/llama.cpp#9259

Closed

4 tasks

smeso force-pushed the smeso/conv_transpose_1d branch from 282a5ca to f872788 Compare September 1, 2024 16:30

smeso changed the title ~~Add support for padding and dilate to conv_transpose_1d via col2im operator~~ Add conv_transpose_1d_gemm Sep 1, 2024

smeso marked this pull request as draft September 1, 2024 21:15

smeso added 6 commits September 7, 2024 11:38

Add col2im 1d operator

4e07473

Signed-off-by: Salvatore Mesoraca <[email protected]>

Add col2im 1d CUDA implementation

f31259e

Signed-off-by: Salvatore Mesoraca <[email protected]>

Add col2im 1d SYCL implementation

5585740

Signed-off-by: Salvatore Mesoraca <[email protected]>

Add tests for 1d col2im

19c5440

Signed-off-by: Salvatore Mesoraca <[email protected]>

Add conv_transpose_1d_gemm operator

3de2490

Signed-off-by: Salvatore Mesoraca <[email protected]>

Add tests for conv_transpose_1d_gemm

1b4c19b

Signed-off-by: Salvatore Mesoraca <[email protected]>

smeso force-pushed the smeso/conv_transpose_1d branch from f872788 to 1b4c19b Compare September 7, 2024 10:02

smeso marked this pull request as ready for review September 7, 2024 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add conv_transpose_1d_gemm #940

Add conv_transpose_1d_gemm #940

smeso commented Aug 28, 2024 •

edited

Loading

JohannesGaessler commented Aug 29, 2024

JohannesGaessler commented Aug 29, 2024 •

edited

Loading

JohannesGaessler commented Aug 29, 2024

smeso commented Aug 29, 2024 •

edited

Loading

JohannesGaessler commented Aug 30, 2024 •

edited

Loading

smeso commented Aug 30, 2024 •

edited

Loading

JohannesGaessler commented Aug 30, 2024

smeso commented Aug 31, 2024

Add conv_transpose_1d_gemm #940

Are you sure you want to change the base?

Add conv_transpose_1d_gemm #940

Conversation

smeso commented Aug 28, 2024 • edited Loading

JohannesGaessler commented Aug 29, 2024

JohannesGaessler commented Aug 29, 2024 • edited Loading

JohannesGaessler commented Aug 29, 2024

smeso commented Aug 29, 2024 • edited Loading

JohannesGaessler commented Aug 30, 2024 • edited Loading

smeso commented Aug 30, 2024 • edited Loading

JohannesGaessler commented Aug 30, 2024

smeso commented Aug 31, 2024

smeso commented Aug 28, 2024 •

edited

Loading

JohannesGaessler commented Aug 29, 2024 •

edited

Loading

smeso commented Aug 29, 2024 •

edited

Loading

JohannesGaessler commented Aug 30, 2024 •

edited

Loading

smeso commented Aug 30, 2024 •

edited

Loading