-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent results across platforms and datatypes on CPU backend #468
Comments
I tried to run this on AMDGPU and Metal, on both backends I get something very similar to the following -
The only change I made was |
This is #330 I think. On the CPU this currently won't work. |
Use a tuple |
Thanks! That works but I am still getting inconsistent results on GPU. I'll close this issue in favor of #330 . |
Hm on the GPU you shouldn't get inconsistent results, on the CPU yes. |
So, on AMDGPU, using the following snippet, here is what I get - using Pkg;
Pkg.add("AMDGPU");
Pkg.add("KernelAbstractions");
Pkg.add("Atomix");
using AMDGPU
using KernelAbstractions
using Atomix: @atomic
@kernel function self_dot_kernel!(result, @Const(vec))
BLOCK_SIZE = @groupsize()[1]
I = @index(Global, Linear)
li = @index(Local, Linear)
intermediateresults = @localmem eltype(vec) BLOCK_SIZE
if I <= length(vec)
intermediateresults[li] = vec[I] * vec[I]
else
intermediateresults[li] = 0
end
@synchronize()
s = 1
while (li + s) <= BLOCK_SIZE
intermediateresults[li] += intermediateresults[li+s]
@synchronize()
s *= 2
end
@synchronize()
if li == 1
@atomic result[1] += intermediateresults[1]
end
end
backend = ROCBackend()
for T in [Float32, Float64, Int32, Int64]
println("===== Testing $T =====")
for dims in [100, 24354, 12323, 13454, 32512, 314, 235, 51234, 5312]
x = KernelAbstractions.ones(backend, T, dims)
dotresult = KernelAbstractions.zeros(backend, T, 1)
for i in 1:6
kernel! = self_dot_kernel!(backend, 32)
kernel!(dotresult, x; ndrange=length(x))
KernelAbstractions.synchronize(backend)
println(dotresult)
dotresult .= 0
end
end
end output -
On Metal, I get -
|
So the only thing that seems fishy to me is.
Do you not have a read-write race here? E.g. |
You are trying to do something similar to https://github.com/JuliaGPU/CUDA.jl/blob/cb14a637e0b7b7be9ae01005ea9bdcf79b320189/src/mapreduce.jl#L62 IIUC? My long term goal is to provide primitives for these kinds of block reduces #234 So the difference I see that in your first iteration all Workitems in the block are active and are accessing their neighborhood data (and being racy about it). In the CUDA code half the Workitems are active in the first iteration |
yes, thanks! There is overlap in how I was indexing which does not occur with the CUDA way. I can confirm it works -
|
Some things still don't make sense to me - This works - d = 1
while d < BLOCK_SIZE
@synchronize()
if 2 * d * (li - 1) + li <= BLOCK_SIZE
intermediateresults[li] += intermediateresults[2*d*(li-1)+li]
end
d *= 2
end This outputs only zeros - d = 1
while d < BLOCK_SIZE
@synchronize()
lhsIndex = 2 * d * (li - 1) # or @private lhsIndex = 2 * d * (li - 1)
if lhsIndex + li <= BLOCK_SIZE
intermediateresults[lhsIndex] += intermediateresults[lhsIndex+li]
end
d *= 2
end Assigning variables outside the loop is fine but inside the loop is troublesome. |
Are you back to testing on the CPU? I should add a pass that errors when synchronize is used inside control flow, KA current design simply can't support that. |
Sorry, I forgot to mention. My previous message was based on the Metal backend. # using Pkg;
# Pkg.add("AMDGPU");
# Pkg.add("KernelAbstractions");
# Pkg.add("Atomix");
using AMDGPU
using KernelAbstractions
using Atomix: @atomic
@kernel function self_dot_kernel!(result, @Const(vec))
BLOCK_SIZE = @groupsize()[1]
I = @index(Global, Linear)
li = @index(Local, Linear)
intermediateresults = @localmem eltype(vec) BLOCK_SIZE
if I <= length(vec)
intermediateresults[li] = vec[I] * vec[I]
else
intermediateresults[li] = 0
end
d = 1
while d < BLOCK_SIZE
@synchronize()
lhsIndex = 2 * d * (li - 1) # or @private lhsIndex = 2 * d * (li - 1)
if lhsIndex + li <= BLOCK_SIZE
intermediateresults[lhsIndex] += intermediateresults[lhsIndex+li]
end
d *= 2
end
@synchronize()
if li == 1
@atomic result[1] += intermediateresults[1]
end
end
backend = ROCBackend()
for T in [Float32, Float64, Int32, Int64]
println("===== Testing $T =====")
for dims in [100, 24354, 12323, 13454, 32512, 314, 235, 51234, 5312]
x = KernelAbstractions.ones(backend, T, dims)
dotresult = KernelAbstractions.zeros(backend, T, 1)
for i in 1:6
kernel! = self_dot_kernel!(backend, 32)
kernel!(dotresult, x; ndrange=length(x))
KernelAbstractions.synchronize(backend)
println(dotresult)
dotresult .= 0
end
end
end
On Metal, the same kernel outputs only 0.0s |
Are you not missing a For |
No, actually, it was supposed to be A pass that errors on when synchronize is used inside control flow would be great! |
I am trying to write a simple reduction kernel, which seems to only work for certain datatypes.
The datatypes it works for is inconsistent across platforms.
Here is the kernel, which computes the sum of squares for a vector of 6000 ones -
And here are some outputs, ran using
julia 1.10.2
-Float32, and Float64 always work on apple-aarch64, making me think that there's nothing wrong with the algorithm. And on windows-x64 -
On windows-x64, only
Float64
consistently works. What is going on?The text was updated successfully, but these errors were encountered: