-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add conv_transpose_1d_gemm #940
base: master
Are you sure you want to change the base?
Conversation
I only added the |
The only reason why this is currently faster is a lack of optimization in the convolution kernels while a comparatively large amount of work has gone into optimizing general matrix multiplication. And I'm not convinced that the code in this PR is necessarily an upgrade over the code on master because the extra memory needed for
I think it would be better to have a single operator, |
In |
Assuming two square matrices GEMM needs |
Oh wait. I thought that your
was specifically referring to something wrong with my code. A direct approach will have a big advantage if a large padding is used, in that case you will be able to skip a lot of calculations and memory accesses, while GEMM still has to do all of the them. Anyway this PR wasn't really about performance, I'm doing it to add support for batching, dilation and padding (that I need to use). The better performance is a nice side effect, but it wasn't what I was looking for. The increased memory usage isn't nice, but for my use-case I'm happy to trade some memory for speed. Of course, if in the future you can make it even faster, I'll be even happier! P.S. |
In a similar way I work mainly on performance (which in turns ends up mostly matrix multiplications) so that is what a lot of my commentary tends to be about :) |
BTW, if the increased memory usage is a concern, instead of re-implementing the existing |
282a5ca
to
f872788
Compare
Signed-off-by: Salvatore Mesoraca <[email protected]>
Signed-off-by: Salvatore Mesoraca <[email protected]>
Signed-off-by: Salvatore Mesoraca <[email protected]>
Signed-off-by: Salvatore Mesoraca <[email protected]>
Signed-off-by: Salvatore Mesoraca <[email protected]>
Signed-off-by: Salvatore Mesoraca <[email protected]>
f872788
to
1b4c19b
Compare
I wanted to add support for batching, padding and dilation to
conv_transpose_1d
and I decided to do it by re-implementing the operator usingmulmat
+col2im
.col2im
at the moment only supports 1 dimension, but it can be easily extended to support 2d without breaking its API (it can be done in a future PR).I also created the CUDA and SYCL kernels and I wanted also to add the Vulkan shader (maybe in a different PR).
Implementing
conv_transpose_1d
viamulmat
+col2im
uses more memory (not always) but is much faster (not always).Because of the increased memory usage, it has been implemented as a new operator, so that people concerned about memory can still use the old version.
Here is a plot that compares performance and memory usage of the 2 approaches:
Long story short: the bigger$IC$ is (compared to the other dimensions) the better.$IC \leq 2$ the old version is better.
For very small kernels with
I noticed that, since I started working on this, a newim2col_back
operator was added.While
col2im
andim2col_back
sound similar, the code and the way they work is a bit different. They can't work just as drop-in replacement for each other, but I think that they could be merged in a single function, but I don't know if this makes sense.@JohannesGaessler being the author ofim2col_back
, do you have any opinion on this PR?Unfortunately I didn't think that my work had a chance to clash with someone else's and I didn't coordinate first.