Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add conv_transpose_1d_gemm #940

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

smeso
Copy link
Contributor

@smeso smeso commented Aug 28, 2024

I wanted to add support for batching, padding and dilation to conv_transpose_1d and I decided to do it by re-implementing the operator using mulmat+col2im.
col2im at the moment only supports 1 dimension, but it can be easily extended to support 2d without breaking its API (it can be done in a future PR).
I also created the CUDA and SYCL kernels and I wanted also to add the Vulkan shader (maybe in a different PR).

Implementing conv_transpose_1d via mulmat+col2im uses more memory (not always) but is much faster (not always).

Because of the increased memory usage, it has been implemented as a new operator, so that people concerned about memory can still use the old version.

Here is a plot that compares performance and memory usage of the 2 approaches:
plot

Long story short: the bigger $IC$ is (compared to the other dimensions) the better.
For very small kernels with $IC \leq 2$ the old version is better.

I noticed that, since I started working on this, a new im2col_back operator was added.
While col2im and im2col_back sound similar, the code and the way they work is a bit different. They can't work just as drop-in replacement for each other, but I think that they could be merged in a single function, but I don't know if this makes sense.

@JohannesGaessler being the author of im2col_back, do you have any opinion on this PR?
Unfortunately I didn't think that my work had a chance to clash with someone else's and I didn't coordinate first.

@JohannesGaessler
Copy link
Collaborator

I only added the IM2COL_BACK operator in order to enable training of the convolutional network in the MNIST example. If you can make that work with your COL2IM implementation I have no issue whatsoever with removing my code again.

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Aug 29, 2024

Implementing conv_transpose_1d via mulmat+col2im uses more memory but it can be ~80% faster (based on a very simple benchmark I did) than the current approach when the input data is relatively big.

The only reason why this is currently faster is a lack of optimization in the convolution kernels while a comparatively large amount of work has gone into optimizing general matrix multiplication. And I'm not convinced that the code in this PR is necessarily an upgrade over the code on master because the extra memory needed for IM2COL/COL2IM is quite substantial and for I/O-bound applications the speed will likely be worse as well. I currently do not use ggml convolutions for anything though so I personally do not care much about this. If I can get the training costs cheap enough I will maybe try to train an image model in 2025 using ggml. At that point I would likely overhaul the convolution code regardless of whether or not this PR is merged.

While col2im and im2col_back sound similar, the code and the way they work is a bit different. They can't work just as drop-in replacement for each other, but I think that they could be merged in a single function, but I don't know if this makes sense.

I think it would be better to have a single operator, COL2IM would be a better name I think.

@JohannesGaessler
Copy link
Collaborator

In tests/test-grad0 there is a test for IM2COL. If you can make that test pass using your code then from my end it would be fine to remove my code.

@smeso
Copy link
Contributor Author

smeso commented Aug 29, 2024

Implementing conv_transpose_1d via mulmat+col2im uses more memory but it can be ~80% faster (based on a very simple benchmark I did) than the current approach when the input data is relatively big.

The only reason why this is currently faster is a lack of optimization in the convolution kernels while a comparatively large amount of work has gone into optimizing general matrix multiplication.

Yes, of course. I didn't mean to imply that I did any magic trick here. mulmat+col2im is a well established method for transposed convolutions because there are many optimizations around GEMM. And of course, there is a memory/speed trade-off here. Also, for smaller kernels, Winograd multiplication should work even better (I don't think that ggml has this optimization currently).

And I'm not convinced that the code in this PR is necessarily an upgrade over the code on master because the extra memory needed for IM2COL/COL2IM is quite substantial and for I/O-bound applications the speed will likely be worse as well.

Yes, it does use more memory: in the worst case scenario, in my tests, it was ~800MB vs ~1MB.
In relative terms is a lot, but it works well for my use-case (hopefully for others as well, especially if smaller/quantized types are used).
Why do you think that the speed would be worse?
I'm sure that the code in this PR can be improved, but it's already much faster than the code in master and it gets faster and faster as the kernel/input size grows.
These are a couple of plots of the speed of the 2 versions as kernel+input size grows. I'm sure there are better ways to benchmark these functions, but this was quick:
(of course the size scale is log2 and time is linear)
CPU
CUDA

The CUDA test is much shorter (i.e. I stopped it at a much lower size than the CPU test) because the old code was so slow that it was, in fact, much slower than using the CPU.

While col2im and im2col_back sound similar, the code and the way they work is a bit different. They can't work just as drop-in replacement for each other, but I think that they could be merged in a single function, but I don't know if this makes sense.

I think it would be better to have a single operator, COL2IM would be a better name I think.

After reviewing your code I realized that these 2 functions don't have much in common except the similar name. It should be possible to obtain one function's result from the other multiplying/dividing (element-wise) the result by some matrix... but I don't think it's worth trying, it will likely be much worse, performance-wise, in both cases.

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Aug 30, 2024

Why do you think that the speed would be worse?

Assuming two square matrices GEMM needs $O(N^3)$ compute and $O(N^2)$ I/O. GEMM is compute bound if the dimensions of all matrices are large. If one of the dimensions is very thin however, it is instead I/O bound. It should be comparatively more easy to write a convolution kernel that utilizes I/O well vs. a kernel that utilizes compute well. So for the I/O-bound case the speed difference between convolution and GEMM should be comparatively smaller (if at all) while IM2COL also increases the I/O even further due to redundant values.

@smeso
Copy link
Contributor Author

smeso commented Aug 30, 2024

Oh wait. I thought that your

I'm not convinced that the code in this PR is necessarily an upgrade over the code on master

was specifically referring to something wrong with my code.
Now it's clearer: your were talking about GEMM in general.
Yeah, you are right in the general case, but...
In the specific case of conv_transposed_1d, the multliplication is performed between $(IC, KW OC)$ and $(IC, IW N)$.
Now, even if you have no control over $IC$, $KW$ and $OC$, you can control the value of $IW N$ and pick one that will maintain $AI = 1$.

A direct approach will have a big advantage if a large padding is used, in that case you will be able to skip a lot of calculations and memory accesses, while GEMM still has to do all of the them.
But with a small padding it will basically have to do the same memory accesses and operations (asymptotically) if I'm not mistaken.

Anyway this PR wasn't really about performance, I'm doing it to add support for batching, dilation and padding (that I need to use). The better performance is a nice side effect, but it wasn't what I was looking for. The increased memory usage isn't nice, but for my use-case I'm happy to trade some memory for speed.

Of course, if in the future you can make it even faster, I'll be even happier!

P.S.
Here is another plot with various kernel sizes and $IW$ s chosen appropriately to maintain $AI=1$ (the shades are $3\sigma$ intervals)

new

@JohannesGaessler
Copy link
Collaborator

Anyway this PR wasn't really about performance, I'm doing it to add support for batching, dilation and padding (that I need to use).

In a similar way I work mainly on performance (which in turns ends up mostly matrix multiplications) so that is what a lot of my commentary tends to be about :)

@smeso
Copy link
Contributor Author

smeso commented Aug 31, 2024

BTW, if the increased memory usage is a concern, instead of re-implementing the existing conv_transpose_1d, I could add a new conv_transpose_1d_gemm and let users decide for themselves.

@smeso smeso changed the title Add support for padding and dilate to conv_transpose_1d via col2im operator Add conv_transpose_1d_gemm Sep 1, 2024
@smeso smeso marked this pull request as draft September 1, 2024 21:15
Signed-off-by: Salvatore Mesoraca <[email protected]>
Signed-off-by: Salvatore Mesoraca <[email protected]>
Signed-off-by: Salvatore Mesoraca <[email protected]>
Signed-off-by: Salvatore Mesoraca <[email protected]>
Signed-off-by: Salvatore Mesoraca <[email protected]>
Signed-off-by: Salvatore Mesoraca <[email protected]>
@smeso smeso marked this pull request as ready for review September 7, 2024 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants