-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
substantial performance issues when combining Tullio and Zygote #154
Comments
Unfortunately this is a known problem, e.g. #30 . I have not worked much on making this faster since then. |
OK, thanks. Turns out, I needed to use CUDA.@time; with that change, the data makes more sense and both pullback and gradient are about equally slow, about 20x - 50x of the forward computation. I've updated the bug description. |
For the forward pass it has a loop nest like this (from
The gradient looks more like this:
One thing which could be invesigated here is how inefficien this write-to-two-arrays loop nest is. The same expression with
If the pullback is twice as fast that's no help, but if it's 10x faster that would motivate changing to generate a separate loop nest per gradient. Not sure I ever did this comparison. Another step would be to see how much it matters to change the pattern to something like this; harder to check as you'd have to edit the code it generates to try it:
Changing the macro to implement either of these is unfortunately not a small job. It's not as modular as it ought to be... I knew precisely how to do a better job around the time I stopped working much on this, of course. Less ambitiously, notice that the order of loops changes in these expressions. There is some chance it's picking a bad one, and this would be relatively easy to solve. To time this you should probably |
Tullio returns symbolic derivatives that look like they are almost directly usable as Tullio expressions; why aren't those used directly? I seem to get decent performance now with a simple rrule using Tullio itself: function dist_kernel(w::AbstractArray{Float32}, x::AbstractArray{Float32})
@tullio grad=false result[i, j] := (w[k, i] - x[k, j])^2
end
function dist_kernel_pb(w, x)
@tullio grad=false result[i, j] := (w[k, i] - x[k, j])^2
return result, (Δresult) -> begin
@tullio grad=false Δw[k, i] := + Δresult[i, j] * 2 * (w[k, i] - x[k, j])
@tullio grad=false Δx[k, i] := - Δresult[i, j] * 2 * (w[k, i] - x[k, j])
return (ChainRules.NoTangent(), Δw, Δx)
end
end
ChainRules.rrule(::typeof(dist_kernel), w, x) = dist_kernel_pb(w, x) |
I think the reason it doesn't directly write the gradients as separate calls is that it doesn't know how to solve for the shifts necessary for convolutions etc. The present re-writing just accumulates into arrays at weird indices, but it will tend to get things wrong if used more directly, like in #136 . Maybe that could be solved somehow. |
Wouldn't that be fixable by either disallowing index computations on the LHS, or at least special casing for the case where there are no index computations on the LHS and warning about the other case? Is there some formal specification/analysis of the expressions that Tullio accepts? It's perhaps not surprising that it's difficult to reason about these expressions otherwise. |
They should be disallowed on the LHS, that they aren't is a bug. But the gradient calculation always wants to move them to the left, as you're now writing into the gradient of an input array. Agree you could special-case the easy cases. So far this package hasn't done that at all, just to avoid duplication... it's already far too complicated. I have not tried to write anything up. The whole package was an exploration of what's possible / what's fast. Ideally v2 would have a more fixed scope, and could be much cleaner, but I'd need to clone myself first... I believe #70 is a logic mistake in how it handles |
I was able to speed up the CPU variants ~5x by enabling threading on the Julia side (i.e.
Also, perhaps this and/or #30 should be updated to reflect that the behaviour occurs with all reverse-mode ADs Tullio is compatible with? Then people who might be using e.g. ReverseDiff wouldn't accidentally miss it. |
@ToucheSir Oops, sorry, I thought I had updated the bug report with CUDA.@sync etc. I have updated the GIST link now. In any case, I still get a 40x ratio of dist-gpu-gradient vs dist-gpu-forward on my A6000. It looks like you only get a 7.5x ratio on your GPU. I wonder whether that's due to performance differences of the GPUs. |
Make sure you're using the macros from https://github.com/JuliaCI/BenchmarkTools.jl instead of |
I'm comparing performance of the built-in Dense layer with an equivalent Linear layer and a Dist layer. There are a couple of problems.
Full code, output, Manifest, Project: https://gist.github.com/12b7cfc21482419aa39e7e9dfbb7cc32
The most important definitions are this:
PROBLEM 1 -- CPU PERFORMANCE
The Tullio-based Linear layer is significantly slower than the built-in Dense layer, but that is perhaps understandable, and it's fast enough to be useful.
But while the pullback/gradient are within a factor of two of the forward computation for the Dense layer, they are both 20x slower for the Tullio-based layers.
PROBLEM 2 -- GPU PERFORMANCE
On GPU, the pattern is different: the Zygote pullback computation is within a factor of two of the forward computation, but the Zygote gradient computation is three orders of magnitude slower than the forward computation. It is baffling to me that a fast pullback can result in such a slow gradient computation.
The text was updated successfully, but these errors were encountered: