-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: kernels #314
WIP: kernels #314
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reactant.jl Benchmarks
Benchmark suite | Current: b7303e5 | Previous: 45ae14f | Ratio |
---|---|---|---|
ViT base (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :after_enzyme) |
1449157594 ns |
1287700343 ns |
1.13 |
ViT base (256 x 256 x 3 x 32)/forward/CUDA/Reactant |
1301919790 ns |
1271515659 ns |
1.02 |
ViT base (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :before_enzyme) |
1339557972 ns |
1253394269 ns |
1.07 |
ViT base (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :only_enzyme) |
3312079307 ns |
3106663633 ns |
1.07 |
ViT base (256 x 256 x 3 x 32)/forward/CUDA/Lux |
206606524 ns |
217499591 ns |
0.95 |
ViT base (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :after_enzyme) |
5262646551 ns |
6749076193 ns |
0.78 |
ViT base (256 x 256 x 3 x 32)/forward/CPU/Reactant |
5233063986 ns |
5078740247 ns |
1.03 |
ViT base (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :before_enzyme) |
5084455177 ns |
5013817961 ns |
1.01 |
ViT base (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :only_enzyme) |
7686400566 ns |
7197691815 ns |
1.07 |
ViT base (256 x 256 x 3 x 32)/forward/CPU/Lux |
26339246221 ns |
35464964244 ns |
0.74 |
ViT small (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :after_enzyme) |
1300005635 ns |
1257317145 ns |
1.03 |
ViT small (256 x 256 x 3 x 4)/forward/CUDA/Reactant |
1278041149 ns |
1424374803 ns |
0.90 |
ViT small (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :before_enzyme) |
1261990698 ns |
1350049098 ns |
0.93 |
ViT small (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :only_enzyme) |
3125146586 ns |
3052800629 ns |
1.02 |
ViT small (256 x 256 x 3 x 4)/forward/CUDA/Lux |
8879631 ns |
8862682 ns |
1.00 |
ViT small (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :after_enzyme) |
1550527051 ns |
1572590140 ns |
0.99 |
ViT small (256 x 256 x 3 x 4)/forward/CPU/Reactant |
1552400963 ns |
1559474266 ns |
1.00 |
ViT small (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :before_enzyme) |
1552125020 ns |
1557501067 ns |
1.00 |
ViT small (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :only_enzyme) |
3310850083 ns |
3290628669 ns |
1.01 |
ViT small (256 x 256 x 3 x 4)/forward/CPU/Lux |
2775956032 ns |
2876354148 ns |
0.97 |
ViT tiny (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :after_enzyme) |
1303015586 ns |
1231219515 ns |
1.06 |
ViT tiny (256 x 256 x 3 x 32)/forward/CUDA/Reactant |
1272928755 ns |
1441159242 ns |
0.88 |
ViT tiny (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :before_enzyme) |
1311413197 ns |
1282010253 ns |
1.02 |
ViT tiny (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :only_enzyme) |
3028555629 ns |
3051584957 ns |
0.99 |
ViT tiny (256 x 256 x 3 x 32)/forward/CUDA/Lux |
22655396 ns |
22776746 ns |
0.99 |
ViT tiny (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :after_enzyme) |
2140398211 ns |
2154505585 ns |
0.99 |
ViT tiny (256 x 256 x 3 x 32)/forward/CPU/Reactant |
2200393344 ns |
2139776302 ns |
1.03 |
ViT tiny (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :before_enzyme) |
2142222871 ns |
2123332313 ns |
1.01 |
ViT tiny (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :only_enzyme) |
3897215106 ns |
3879039560 ns |
1.00 |
ViT tiny (256 x 256 x 3 x 32)/forward/CPU/Lux |
5312568392 ns |
5729200009 ns |
0.93 |
ViT tiny (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :after_enzyme) |
1307990936 ns |
1259798635 ns |
1.04 |
ViT tiny (256 x 256 x 3 x 4)/forward/CUDA/Reactant |
1301819826 ns |
1262851193 ns |
1.03 |
ViT tiny (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :before_enzyme) |
1284427966 ns |
1266665882 ns |
1.01 |
ViT tiny (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :only_enzyme) |
3169837598 ns |
3319553871 ns |
0.95 |
ViT tiny (256 x 256 x 3 x 4)/forward/CUDA/Lux |
7453064 ns |
7445203.5 ns |
1.00 |
ViT tiny (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :after_enzyme) |
1409136279 ns |
1424258021 ns |
0.99 |
ViT tiny (256 x 256 x 3 x 4)/forward/CPU/Reactant |
1409545691 ns |
1421721118 ns |
0.99 |
ViT tiny (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :before_enzyme) |
1414236404 ns |
1420742881 ns |
1.00 |
ViT tiny (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :only_enzyme) |
3151606700 ns |
3162578762 ns |
1.00 |
ViT tiny (256 x 256 x 3 x 4)/forward/CPU/Lux |
1654006772.5 ns |
2138106366 ns |
0.77 |
ViT tiny (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :after_enzyme) |
1291669432 ns |
1297050944 ns |
1.00 |
ViT tiny (256 x 256 x 3 x 16)/forward/CUDA/Reactant |
1265833403 ns |
1403907055 ns |
0.90 |
ViT tiny (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :before_enzyme) |
1278433111 ns |
1269229731 ns |
1.01 |
ViT tiny (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :only_enzyme) |
3126809956 ns |
3063143344 ns |
1.02 |
ViT tiny (256 x 256 x 3 x 16)/forward/CUDA/Lux |
12328188 ns |
12347497 ns |
1.00 |
ViT tiny (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :after_enzyme) |
1741906628 ns |
1721006513 ns |
1.01 |
ViT tiny (256 x 256 x 3 x 16)/forward/CPU/Reactant |
1731592537 ns |
1711405549 ns |
1.01 |
ViT tiny (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :before_enzyme) |
1720273302 ns |
1704835369 ns |
1.01 |
ViT tiny (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :only_enzyme) |
3450588571 ns |
3443971150 ns |
1.00 |
ViT tiny (256 x 256 x 3 x 16)/forward/CPU/Lux |
2948602836 ns |
3110298785 ns |
0.95 |
ViT small (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :after_enzyme) |
1494612899 ns |
1266729302 ns |
1.18 |
ViT small (256 x 256 x 3 x 16)/forward/CUDA/Reactant |
1311317968 ns |
1308873395 ns |
1.00 |
ViT small (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :before_enzyme) |
1492915221 ns |
1275958493 ns |
1.17 |
ViT small (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :only_enzyme) |
3115105513 ns |
3081413477 ns |
1.01 |
ViT small (256 x 256 x 3 x 16)/forward/CUDA/Lux |
27412509 ns |
27435162 ns |
1.00 |
ViT small (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :after_enzyme) |
2228730818 ns |
2169947879 ns |
1.03 |
ViT small (256 x 256 x 3 x 16)/forward/CPU/Reactant |
2334825207 ns |
2163945294 ns |
1.08 |
ViT small (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :before_enzyme) |
2310305349 ns |
2151891950 ns |
1.07 |
ViT small (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :only_enzyme) |
3944966197 ns |
3946269320 ns |
1.00 |
ViT small (256 x 256 x 3 x 16)/forward/CPU/Lux |
6131212634 ns |
6287057122 ns |
0.98 |
ViT small (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :after_enzyme) |
1303567764 ns |
1260705673 ns |
1.03 |
ViT small (256 x 256 x 3 x 32)/forward/CUDA/Reactant |
1424871003 ns |
1369717954 ns |
1.04 |
ViT small (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :before_enzyme) |
1275689864 ns |
1281076652 ns |
1.00 |
ViT small (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :only_enzyme) |
3045934410 ns |
3130042297 ns |
0.97 |
ViT small (256 x 256 x 3 x 32)/forward/CUDA/Lux |
52971586 ns |
53036705.5 ns |
1.00 |
ViT small (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :after_enzyme) |
3055665974 ns |
3050356994 ns |
1.00 |
ViT small (256 x 256 x 3 x 32)/forward/CPU/Reactant |
3021313773 ns |
3082997102 ns |
0.98 |
ViT small (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :before_enzyme) |
3053225043 ns |
2965563203 ns |
1.03 |
ViT small (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :only_enzyme) |
4887749197 ns |
4841087626 ns |
1.01 |
ViT small (256 x 256 x 3 x 32)/forward/CPU/Lux |
11183611226 ns |
8484129480 ns |
1.32 |
ViT base (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :after_enzyme) |
1300865042 ns |
1260921375 ns |
1.03 |
ViT base (256 x 256 x 3 x 16)/forward/CUDA/Reactant |
1295735580 ns |
1253872568 ns |
1.03 |
ViT base (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :before_enzyme) |
1232925244 ns |
1479498539 ns |
0.83 |
ViT base (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :only_enzyme) |
2922815725 ns |
3113671601 ns |
0.94 |
ViT base (256 x 256 x 3 x 16)/forward/CUDA/Lux |
71283297 ns |
71338519.5 ns |
1.00 |
ViT base (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :after_enzyme) |
3270546818 ns |
3125511597 ns |
1.05 |
ViT base (256 x 256 x 3 x 16)/forward/CPU/Reactant |
3230464036 ns |
3098530069 ns |
1.04 |
ViT base (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :before_enzyme) |
3254041312 ns |
3115589553 ns |
1.04 |
ViT base (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :only_enzyme) |
5162220727 ns |
5036626230 ns |
1.02 |
ViT base (256 x 256 x 3 x 16)/forward/CPU/Lux |
15170850681 ns |
11289651474 ns |
1.34 |
ViT base (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :after_enzyme) |
1278655290 ns |
1339569725 ns |
0.95 |
ViT base (256 x 256 x 3 x 4)/forward/CUDA/Reactant |
1229847740 ns |
1259019883 ns |
0.98 |
ViT base (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :before_enzyme) |
1439473418 ns |
1254828379 ns |
1.15 |
ViT base (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :only_enzyme) |
2922143773 ns |
2975337456 ns |
0.98 |
ViT base (256 x 256 x 3 x 4)/forward/CUDA/Lux |
20699816 ns |
20758936 ns |
1.00 |
ViT base (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :after_enzyme) |
1963950807 ns |
1859519475 ns |
1.06 |
ViT base (256 x 256 x 3 x 4)/forward/CPU/Reactant |
2218798778 ns |
1869845638 ns |
1.19 |
ViT base (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :before_enzyme) |
2058391749 ns |
1850101657 ns |
1.11 |
ViT base (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :only_enzyme) |
3614980515 ns |
3593739548 ns |
1.01 |
ViT base (256 x 256 x 3 x 4)/forward/CPU/Lux |
3206903233.5 ns |
3325189113.5 ns |
0.96 |
This comment was automatically generated by workflow using github-action-benchmark.
This comment was marked as spam.
This comment was marked as spam.
be52876
to
b8e3570
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit
JuliaFormatter
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 371 in 7c7c8ed
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 376 in 7c7c8ed
LLVMFunc{F,job.source.specTypes}(f, modstr, image, LLVM.name(entry)) |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Lines 386 to 389 in 7c7c8ed
f::F | |
mod::String | |
image | |
entry::String |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Lines 392 to 393 in 7c7c8ed
function (func::LLVMFunc{F,tt})(args...; blocks::CUDA.CuDim=1, threads::CUDA.CuDim=1, | |
shmem::Integer=0) where{F, tt} |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Lines 403 to 418 in 7c7c8ed
@show a | |
@assert a isa CuDeviceArray | |
ta = Base.pointer_to_objref(a.ptr)::TracedRArray | |
arg = ta.mlir_data | |
arg = Reactant.Compiler.transpose_val(arg) | |
push!(restys, MLIR.IR.Type(arg)) | |
push!(aliases, | |
MLIR.IR.Dialects.stablehlo.stablehloOutputOperandAliasGet( | |
MLIR.IR.context(), | |
len(args) == 1 ? 0 : 1, | |
len(args) == 1 ? C_NULL : Ref{Int64}(i-1), | |
i-1, | |
0, | |
C_NULL | |
) | |
) |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Lines 421 to 422 in 7c7c8ed
output_operand_aliases=MLIR.ArrayAttr.get(MLIR.IR.context(), aliases) | |
MLIR.IR.Dialects.stablehlo.custom_call(mlir_args; result_0=restys, call_target_name="reactant_gpu_call", output_operand_aliases) |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 430 in 7c7c8ed
const _compiler_caches = Dict{MLIR.IR.Context, Dict{Any, LLVMFunc}}(); |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 434 in 7c7c8ed
cache = Dict{Any, LLVMFunc}() |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 443 in 7c7c8ed
cache = compiler_cache(MLIR.IR.context()) |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 446 in 7c7c8ed
cuda = CUDA.active_state() |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 450 in 7c7c8ed
res |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 470 in 7c7c8ed
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Lines 480 to 489 in 7c7c8ed
arginfo2 = ArgInfo( | |
if fargs isa Nothing | |
nothing | |
else | |
[:($(recufunction)), fargs[2:end]...] | |
end, | |
[Core.Const(recufunction), argtypes[2:end]...], | |
) | |
return abstract_call_known(interp, recufunction, arginfo2, si, sv, max_methods) |
Base.unsafe_convert(Core.LLVMPtr{T,A}, x) + Base._memory_offset(x, i) | ||
end | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
unsafe_cached_load(pointer(A), index) | ||
end | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
# deprecated | ||
Base.@propagate_inbounds ldg(A::CuTracedArray, i1::Integer) = const_arrayref(A, i1) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
return CuTracedArray{T,N,A}(reinterpret(Core.LLVMPtr{T,A}, a.ptr), osize, a.maxsize) | ||
end | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit
@avik-pal @vchuravy @Pangoraw @jumerckx @mofeing obviously this is presently a reactant/enzyme/cassette frankenstein atm in utils.jl but cc'ing for visibility, and if you see an obvious reason I get the following error:
I realized while doing this that a pre req we need is supporting custom absint through indirect function calls so most of utils.jl tries to do this (and incidentally in doing so starts to set up the infra we will need later on for automated detection of if/for/etc into traced versions . Of course atm everything breaks, but anyways |
@aviatesk obviously this is still very in progress, but I'm wondering if you can give it a look? Essentially the absint part is trying to make a code call that forces the use of our abstract interpreter, for all type unstable calls (e.g. unlike cassette we run inference and optimizations first). Eventually we also will do more with the IR as well like raising control flow into nice structures, but thats a future todo |
I need to finish up registration, but kernel compile and execute (phase 1) now works! #= /home/wsmoses/Reactant.jl/test/cuda.jl:22 =# @code_hlo(square!(A)) = module {
func.func @main(%arg0: tensor<64xi64>) -> tensor<64xi64> {
%0 = stablehlo.custom_call @reactant_gpu_call(%arg0) {output_operand_aliases = [#stablehlo.output_operand_alias<output_tuple_indices = [], operand_index = 0, operand_tuple_indices = []>]} : (tensor<64xi64>) -> tensor<64xi64>
return %0 : tensor<64xi64>
}
}
(res, xs) = (64 cu traced array at 0x00007f57d89256f0, TracedRArray{Int64,1N}(((:args, 1),), size=(64,)))
("recufunction", f, tt) = ("recufunction", square_kernel!, Tuple{ReactantCUDAExt.CuTracedArray{Int64, 1, 1, (64,)}})
args = (64 cu traced array at 0x00007f57d89256f0,)
call_kwargs = Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}()
a = 64 cu traced array at 0x00007f57d89256f0
arg = %1 = "stablehlo.transpose"(%0) <{permutation = array<i64: 0>}> : (tensor<64xi64>) -> tensor<64xi64>
terminate called after throwing an instance of 'xla::XlaRuntimeError'
what(): UNIMPLEMENTED: No registered implementation for custom call to reactant_gpu_call for platform CUDA |
also cc for vis @ivanradanov @ftynse we now [almost] have a single mlir file with host and device code for an entire julia application |
ext/ReactantCUDAExt.jl
Outdated
end | ||
|
||
output_operand_aliases=MLIR.IR.Attribute(aliases) | ||
call = MLIR.Dialects.stablehlo.custom_call(mlir_args; result_0=restys, call_target_name="reactant_gpu_call", output_operand_aliases) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, to pass some information to the XLA callback of this kernel, use the backend_config
(https://github.com/openxla/stablehlo/blob/ef176a130f28196dcb4a5735d0f2f6ed0f85bd5d/stablehlo/dialect/StablehloOps.td#L2391) string attribute. That attribute gets forwarded to the "opaque" pair of arguments (pointer + size) as a C string.
@testset "Square Kernel" begin | ||
oA = collect(1:1:64) | ||
A = Reactant.to_rarray(oA) | ||
@show @code_hlo optimize=false square!(A) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
@show @code_hlo optimize=false square!(A) | |
@show @code_hlo optimize = false square!(A) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit
JuliaFormatter
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 243 in 0c61f5d
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 256 in 0c61f5d
mod, meta = GPUCompiler.compile(:llvm, job; optimize=false, cleanup=false, validate=false) |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 258 in 0c61f5d
opt_level = 2 |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 260 in 0c61f5d
LLVM.@dispose pb=LLVM.NewPMPassBuilder() begin |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 292 in 0c61f5d
modstr = string(mod) |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Lines 294 to 295 in 0c61f5d
# This is a bit weird since we're taking a module from julia's llvm into reactant's llvm version | |
# it is probably safer to reparse a string using the right llvm module api, so we will do that. |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Lines 298 to 450 in 0c61f5d
mmod = MLIR.IR.Module(@ccall MLIR.API.mlir_c.ConvertLLVMStrToMLIR(modstr::Cstring, MLIR.IR.context()::MLIR.API.MlirContext)::MLIR.API.MlirModule) | |
@show mmod | |
# check if we'll need the device runtime | |
undefined_fs = filter(collect(CUDA.LLVM.functions(meta.ir))) do f | |
CUDA.LLVM.isdeclaration(f) && !CUDA.LLVM.isintrinsic(f) | |
end | |
intrinsic_fns = ["vprintf", "malloc", "free", "__assertfail", | |
"__nvvm_reflect" #= TODO: should have been optimized away =#] | |
needs_cudadevrt = !isempty(setdiff(CUDA.LLVM.name.(undefined_fs), intrinsic_fns)) | |
# prepare invocations of CUDA compiler tools | |
ptxas_opts = String[] | |
nvlink_opts = String[] | |
## debug flags | |
if Base.JLOptions().debug_level == 1 | |
push!(ptxas_opts, "--generate-line-info") | |
elseif Base.JLOptions().debug_level >= 2 | |
push!(ptxas_opts, "--device-debug") | |
push!(nvlink_opts, "--debug") | |
end | |
## relocatable device code | |
if needs_cudadevrt | |
push!(ptxas_opts, "--compile-only") | |
end | |
ptx = job.config.params.ptx | |
cap = job.config.params.cap | |
arch = "sm_$(cap.major)$(cap.minor)" | |
# validate use of parameter memory | |
argtypes = filter([CUDA.KernelState, job.source.specTypes.parameters...]) do dt | |
!CUDA.isghosttype(dt) && !Core.Compiler.isconstType(dt) | |
end | |
param_usage = sum(sizeof, argtypes) | |
param_limit = 4096 | |
if cap >= v"7.0" && ptx >= v"8.1" | |
param_limit = 32764 | |
end | |
if param_usage > param_limit | |
msg = """Kernel invocation uses too much parameter memory. | |
$(Base.format_bytes(param_usage)) exceeds the $(Base.format_bytes(param_limit)) limit imposed by sm_$(cap.major)$(cap.minor) / PTX v$(ptx.major).$(ptx.minor).""" | |
try | |
details = "\n\nRelevant parameters:" | |
source_types = job.source.specTypes.parameters | |
source_argnames = Base.method_argnames(job.source.def) | |
while length(source_argnames) < length(source_types) | |
# this is probably due to a trailing vararg; repeat its name | |
push!(source_argnames, source_argnames[end]) | |
end | |
for (i, typ) in enumerate(source_types) | |
if CUDA.isghosttype(typ) || Core.Compiler.isconstType(typ) | |
continue | |
end | |
name = source_argnames[i] | |
details *= "\n [$(i-1)] $name::$typ uses $(Base.format_bytes(sizeof(typ)))" | |
end | |
details *= "\n" | |
if cap >= v"7.0" && ptx < v"8.1" && param_usage < 32764 | |
details *= "\nNote: use a newer CUDA to support more parameters on your device.\n" | |
end | |
msg *= details | |
catch err | |
@error "Failed to analyze kernel parameter usage; please file an issue with a reproducer." | |
end | |
error(msg) | |
end | |
# compile to machine code | |
# NOTE: we use tempname since mktemp doesn't support suffixes, and mktempdir is slow | |
ptx_input = tempname(cleanup=false) * ".ptx" | |
ptxas_output = tempname(cleanup=false) * ".cubin" | |
write(ptx_input, asm) | |
# we could use the driver's embedded JIT compiler, but that has several disadvantages: | |
# 1. fixes and improvements are slower to arrive, by using `ptxas` we only need to | |
# upgrade the toolkit to get a newer compiler; | |
# 2. version checking is simpler, we otherwise need to use NVML to query the driver | |
# version, which is hard to correlate to PTX JIT improvements; | |
# 3. if we want to be able to use newer (minor upgrades) of the CUDA toolkit on an | |
# older driver, we should use the newer compiler to ensure compatibility. | |
append!(ptxas_opts, [ | |
"--verbose", | |
"--gpu-name", arch, | |
"--output-file", ptxas_output, | |
ptx_input | |
]) | |
proc, log = CUDA.run_and_collect(`$(CUDA.ptxas()) $ptxas_opts`) | |
log = strip(log) | |
if !success(proc) | |
reason = proc.termsignal > 0 ? "ptxas received signal $(proc.termsignal)" : | |
"ptxas exited with code $(proc.exitcode)" | |
msg = "Failed to compile PTX code ($reason)" | |
msg *= "\nInvocation arguments: $(join(ptxas_opts, ' '))" | |
if !isempty(log) | |
msg *= "\n" * log | |
end | |
msg *= "\nIf you think this is a bug, please file an issue and attach $(ptx_input)" | |
if parse(Bool, get(ENV, "BUILDKITE", "false")) | |
run(`buildkite-agent artifact upload $(ptx_input)`) | |
end | |
error(msg) | |
elseif !isempty(log) | |
@debug "PTX compiler log:\n" * log | |
end | |
rm(ptx_input) | |
# link device libraries, if necessary | |
# | |
# this requires relocatable device code, which prevents certain optimizations and | |
# hurts performance. as such, we only do so when absolutely necessary. | |
# TODO: try LTO, `--link-time-opt --nvvmpath /opt/cuda/nvvm`. | |
# fails with `Ignoring -lto option because no LTO objects found` | |
if needs_cudadevrt | |
nvlink_output = tempname(cleanup=false) * ".cubin" | |
append!(nvlink_opts, [ | |
"--verbose", "--extra-warnings", | |
"--arch", arch, | |
"--library-path", dirname(libcudadevrt), | |
"--library", "cudadevrt", | |
"--output-file", nvlink_output, | |
ptxas_output | |
]) | |
proc, log = run_and_collect(`$(CUDA.nvlink()) $nvlink_opts`) | |
log = strip(log) | |
if !success(proc) | |
reason = proc.termsignal > 0 ? "nvlink received signal $(proc.termsignal)" : | |
"nvlink exited with code $(proc.exitcode)" | |
msg = "Failed to link PTX code ($reason)" | |
msg *= "\nInvocation arguments: $(join(nvlink_opts, ' '))" | |
if !isempty(log) | |
msg *= "\n" * log | |
end | |
msg *= "\nIf you think this is a bug, please file an issue and attach $(ptxas_output)" | |
error(msg) | |
elseif !isempty(log) | |
@debug "PTX linker info log:\n" * log | |
end | |
rm(ptxas_output) | |
image = read(nvlink_output) | |
rm(nvlink_output) | |
else | |
image = read(ptxas_output) | |
rm(ptxas_output) | |
end | |
modstr, image, meta.entry |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 452 in 0c61f5d
LLVMFunc{job.source.specTypes.parameters[1],job.source.specTypes}(nothing, modstr, image, CUDA.LLVM.name(entry)) |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Lines 468 to 469 in 0c61f5d
Reactant.@reactant_override @noinline function (func::LLVMFunc{F,tt})(args...; convert=Val(false), blocks::CuDim=1, threads::CuDim=1, | |
cooperative::Bool=false, shmem::Integer=0, call_kwargs...) where{F, tt} |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Lines 481 to 499 in 0c61f5d
@show a | |
@assert a isa CuTracedArray | |
ta = Base.unsafe_pointer_to_objref(Base.reinterpret(Ptr{Cvoid}, a.ptr))::TracedRArray | |
push!(rarrays, ta) | |
arg = ta.mlir_data | |
arg = transpose_val(arg) | |
@show arg | |
push!(restys, MLIR.IR.type(arg)) | |
push!(mlir_args, arg) | |
push!(aliases, | |
MLIR.IR.Attribute(MLIR.API.stablehloOutputOperandAliasGet( | |
MLIR.IR.context(), | |
length(args) == 1 ? 0 : 1, | |
length(args) == 1 ? C_NULL : Ref{Int64}(i-1), | |
i-1, | |
0, | |
C_NULL | |
)) | |
) |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Lines 502 to 503 in 0c61f5d
output_operand_aliases=MLIR.IR.Attribute(aliases) | |
call = MLIR.Dialects.stablehlo.custom_call(mlir_args; result_0=restys, call_target_name="reactant_gpu_call", output_operand_aliases, backend_config=MLIR.IR.Attribute("configstr")) |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 506 in 0c61f5d
res.mlir_data = transpose_val(MLIR.IR.result(call, i)) |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 515 in 0c61f5d
const _compiler_caches = Dict{MLIR.IR.Context, Dict{Any, LLVMFunc}}(); |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 519 in 0c61f5d
cache = Dict{Any, LLVMFunc}() |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 525 in 0c61f5d
Reactant.@reactant_override @noinline function CUDA.cufunction(f::F, tt::TT=Tuple{}; kwargs...) where {F,TT} |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 529 in 0c61f5d
cache = compiler_cache(MLIR.IR.context()) |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 532 in 0c61f5d
# cuda = CUDA.active_state() |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Lines 535 to 543 in 0c61f5d
cuda_cap=v"5.0" | |
cuda_ptx=v"6.3" | |
llvm_cap=v"5.0" | |
llvm_ptx=v"6.3" | |
kernel=true | |
always_inline=false | |
name=nothing | |
debuginfo=false | |
config = CUDA.CompilerConfig(CUDA.PTXCompilerTarget(; cap=llvm_cap, ptx=llvm_ptx, debuginfo), CUDA.CUDACompilerParams(; cap=cuda_cap, ptx=cuda_ptx); kernel, name, always_inline) |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 547 in 0c61f5d
res |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 551 in 0c61f5d
[JuliaFormatter] reported by reviewdog 🐶
Lines 370 to 381 in 0c61f5d
for (i, inst) in enumerate(ir.stmts) | |
@static if VERSION < v"1.11" | |
changed, next = rewrite_inst(inst[:inst], ir, interp) | |
Core.Compiler.setindex!(ir.stmts[i], next, :inst) | |
else | |
changed, next = rewrite_inst(inst[:stmt], ir, interp) | |
Core.Compiler.setindex!(ir.stmts[i], next, :stmt) | |
end | |
if changed | |
any_changed = true | |
Core.Compiler.setindex!(ir.stmts[i], Any, :type) | |
end |
using Reactant: | ||
Reactant, TracedRArray, AnyTracedRArray, MLIR, TracedRNumber |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
using Reactant: | |
Reactant, TracedRArray, AnyTracedRArray, MLIR, TracedRNumber | |
using Reactant: Reactant, TracedRArray, AnyTracedRArray, MLIR, TracedRNumber |
|
||
Base.show(io::IO, a::AT) where AT <: CuTracedArray = | ||
CUDA.Printf.@printf(io, "%s cu traced array at %p", join(size(a), '×'), Int(pointer(a))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
Base.show(io::IO, a::AT) where AT <: CuTracedArray = | |
CUDA.Printf.@printf(io, "%s cu traced array at %p", join(size(a), '×'), Int(pointer(a))) | |
function Base.show(io::IO, a::AT) where {AT<:CuTracedArray} | |
CUDA.Printf.@printf(io, "%s cu traced array at %p", join(size(a), '×'), Int(pointer(a))) | |
end |
Base.elsize(::Type{<:CuTracedArray{T}}) where {T} = sizeof(T) | ||
Base.size(g::CuTracedArray{T,N,A,Size}) where {T,N,A,Size} = Size | ||
Base.sizeof(x::CuTracedArray) = Base.elsize(x) * length(x) | ||
Base.pointer(x::CuTracedArray{T,<:Any,A}) where {T,A} = Base.unsafe_convert(Core.LLVMPtr{T,A}, x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
Base.pointer(x::CuTracedArray{T,<:Any,A}) where {T,A} = Base.unsafe_convert(Core.LLVMPtr{T,A}, x) | |
function Base.pointer(x::CuTracedArray{T,<:Any,A}) where {T,A} | |
return Base.unsafe_convert(Core.LLVMPtr{T,A}, x) | |
end |
Base.sizeof(x::CuTracedArray) = Base.elsize(x) * length(x) | ||
Base.pointer(x::CuTracedArray{T,<:Any,A}) where {T,A} = Base.unsafe_convert(Core.LLVMPtr{T,A}, x) | ||
@inline function Base.pointer(x::CuTracedArray{T,<:Any,A}, i::Integer) where {T,A} | ||
Base.unsafe_convert(Core.LLVMPtr{T,A}, x) + Base._memory_offset(x, i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
Base.unsafe_convert(Core.LLVMPtr{T,A}, x) + Base._memory_offset(x, i) | |
return Base.unsafe_convert(Core.LLVMPtr{T,A}, x) + Base._memory_offset(x, i) |
Base.unsafe_convert(::Type{Core.LLVMPtr{T,A}}, x::CuTracedArray{T,<:Any,A}) where {T,A} = | ||
x.ptr | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
Base.unsafe_convert(::Type{Core.LLVMPtr{T,A}}, x::CuTracedArray{T,<:Any,A}) where {T,A} = | |
x.ptr | |
function Base.unsafe_convert( | |
::Type{Core.LLVMPtr{T,A}}, x::CuTracedArray{T,<:Any,A} | |
) where {T,A} | |
return x.ptr | |
end |
Base.@propagate_inbounds Base.getindex(A::CuTracedArray, | ||
I::Union{Integer, CartesianIndex}...) = | ||
A[Base._to_linear_index(A, to_indices(A, I)...)] | ||
Base.@propagate_inbounds Base.setindex!(A::CuTracedArray, x, | ||
I::Union{Integer, CartesianIndex}...) = | ||
A[Base._to_linear_index(A, to_indices(A, I)...)] = x | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
Base.@propagate_inbounds Base.getindex(A::CuTracedArray, | |
I::Union{Integer, CartesianIndex}...) = | |
A[Base._to_linear_index(A, to_indices(A, I)...)] | |
Base.@propagate_inbounds Base.setindex!(A::CuTracedArray, x, | |
I::Union{Integer, CartesianIndex}...) = | |
A[Base._to_linear_index(A, to_indices(A, I)...)] = x | |
Base.@propagate_inbounds Base.getindex( | |
A::CuTracedArray, I::Union{Integer,CartesianIndex}... | |
) = A[Base._to_linear_index(A, to_indices(A, I)...)] | |
Base.@propagate_inbounds Base.setindex!( | |
A::CuTracedArray, x, I::Union{Integer,CartesianIndex}... | |
) = A[Base._to_linear_index(A, to_indices(A, I)...)] = x |
err = GPUArrays._reinterpret_exception(T, a) | ||
err === nothing || throw(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
err = GPUArrays._reinterpret_exception(T, a) | |
err === nothing || throw(err) | |
err = GPUArrays._reinterpret_exception(T, a) | |
err === nothing || throw(err) |
if sizeof(T) == sizeof(S) # fast case | ||
return CuTracedArray{T,N,A}(reinterpret(Core.LLVMPtr{T,A}, a.ptr), size(a), a.maxsize) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
if sizeof(T) == sizeof(S) # fast case | |
return CuTracedArray{T,N,A}(reinterpret(Core.LLVMPtr{T,A}, a.ptr), size(a), a.maxsize) | |
end | |
if sizeof(T) == sizeof(S) # fast case | |
return CuTracedArray{T,N,A}( | |
reinterpret(Core.LLVMPtr{T,A}, a.ptr), size(a), a.maxsize | |
) | |
end |
isize = size(a) | ||
size1 = div(isize[1]*sizeof(S), sizeof(T)) | ||
osize = tuple(size1, Base.tail(isize)...) | ||
return CuTracedArray{T,N,A}(reinterpret(Core.LLVMPtr{T,A}, a.ptr), osize, a.maxsize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
isize = size(a) | |
size1 = div(isize[1]*sizeof(S), sizeof(T)) | |
osize = tuple(size1, Base.tail(isize)...) | |
return CuTracedArray{T,N,A}(reinterpret(Core.LLVMPtr{T,A}, a.ptr), osize, a.maxsize) | |
isize = size(a) | |
size1 = div(isize[1] * sizeof(S), sizeof(T)) | |
osize = tuple(size1, Base.tail(isize)...) | |
return CuTracedArray{T,N,A}(reinterpret(Core.LLVMPtr{T,A}, a.ptr), osize, a.maxsize) |
if prod(dims) != length(a) | ||
throw(DimensionMismatch("new dimensions (argument `dims`) must be consistent with array size (`size(a)`)")) | ||
end | ||
if N == M && dims == size(a) | ||
return a | ||
end | ||
_derived_array(a, T, dims) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
if prod(dims) != length(a) | |
throw(DimensionMismatch("new dimensions (argument `dims`) must be consistent with array size (`size(a)`)")) | |
end | |
if N == M && dims == size(a) | |
return a | |
end | |
_derived_array(a, T, dims) | |
if prod(dims) != length(a) | |
throw( | |
DimensionMismatch( | |
"new dimensions (argument `dims`) must be consistent with array size (`size(a)`)", | |
), | |
) | |
end | |
if N == M && dims == size(a) | |
return a | |
end | |
return _derived_array(a, T, dims) |
I still need to make the nicer op with registration and lowerings (incl llvm/llvm-project#119998), but we now have host and device code in an mlir module #= /home/wmoses/git/Reactant.jl/test/cuda.jl:22 =# @code_hlo(square!(A)) = #tbaa_root = #llvm.tbaa_root<id = "custom_tbaa">
#tbaa_type_desc = #llvm.tbaa_type_desc<id = "custom_tbaa_addrspace(1)", members = {<#tbaa_root, 0>}>
#tbaa_tag = #llvm.tbaa_tag<base_type = #tbaa_type_desc, access_type = #tbaa_type_desc, offset = 0>
module {
llvm.func internal unnamed_addr fastcc @julia_throw_boundserror_2676() attributes {dso_local, no_inline, sym_visibility = "private"} {
llvm.unreachable
}
llvm.func local_unnamed_addr ptx_kernelcc @_Z14square_kernel_13CuTracedArrayI5Int64Ll1ELl1E5_64__E(%arg0: !llvm.array<1 x ptr<1>>) {
%0 = llvm.mlir.constant(63 : i32) : i32
%1 = nvvm.read.ptx.sreg.tid.x : i32
%2 = llvm.icmp "ugt" %1, %0 : i32
llvm.cond_br %2, ^bb2, ^bb1
^bb1: // pred: ^bb0
%3 = llvm.extractvalue %arg0[0] : !llvm.array<1 x ptr<1>>
%4 = llvm.zext %1 : i32 to i64
%5 = llvm.getelementptr inbounds %3[%4] : (!llvm.ptr<1>, i64) -> !llvm.ptr<1>, i64
%6 = llvm.load %5 {alignment = 1 : i64, tbaa = [#tbaa_tag]} : !llvm.ptr<1> -> i64
%7 = llvm.mul %6, %6 : i64
llvm.store %7, %5 {alignment = 1 : i64, tbaa = [#tbaa_tag]} : i64, !llvm.ptr<1>
llvm.return
^bb2: // pred: ^bb0
llvm.call fastcc @julia_throw_boundserror_2676() : () -> ()
llvm.unreachable
}
func.func @main(%arg0: tensor<64xi64>) -> tensor<64xi64> {
%0 = stablehlo.custom_call @reactant_gpu_call(%arg0) {backend_config = "_Z14square_kernel_13CuTracedArrayI5Int64Ll1ELl1E5_64__E", output_operand_aliases = [#stablehlo.output_operand_alias<output_tuple_indices = [], operand_index = 0, operand_tuple_indices = []>]} : (tensor<64xi64>) -> tensor<64xi64>
return %0 : tensor<64xi64>
}
}
call_kwargs = Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}()
blockdim = CuDim3(0x00000001, 0x00000001, 0x00000001)
threaddim = CuDim3(0x00000040, 0x00000001, 0x00000001) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit
JuliaFormatter
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 401 in d0e5195
res |
[JuliaFormatter] reported by reviewdog 🐶
Reactant.jl/ext/ReactantCUDAExt.jl
Line 405 in d0e5195
[JuliaFormatter] reported by reviewdog 🐶
Lines 370 to 381 in d0e5195
for (i, inst) in enumerate(ir.stmts) | |
@static if VERSION < v"1.11" | |
changed, next = rewrite_inst(inst[:inst], ir, interp) | |
Core.Compiler.setindex!(ir.stmts[i], next, :inst) | |
else | |
changed, next = rewrite_inst(inst[:stmt], ir, interp) | |
Core.Compiler.setindex!(ir.stmts[i], next, :stmt) | |
end | |
if changed | |
any_changed = true | |
Core.Compiler.setindex!(ir.stmts[i], Any, :type) | |
end |
res = CuTracedArray{T,N,CUDA.AS.Global, size(xs)}(Base.reinterpret(Core.LLVMPtr{T,CUDA.AS.Global}, Base.pointer_from_objref(xs))) | ||
return res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
res = CuTracedArray{T,N,CUDA.AS.Global, size(xs)}(Base.reinterpret(Core.LLVMPtr{T,CUDA.AS.Global}, Base.pointer_from_objref(xs))) | |
return res | |
res = CuTracedArray{T,N,CUDA.AS.Global,size(xs)}( | |
Base.reinterpret(Core.LLVMPtr{T,CUDA.AS.Global}, Base.pointer_from_objref(xs)) | |
) | |
return res |
return res | ||
end | ||
|
||
const _kernel_instances = Dict{Any, Any}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
const _kernel_instances = Dict{Any, Any}() | |
const _kernel_instances = Dict{Any,Any}() |
f::Union{F, Nothing} | ||
entry::MLIR.IR.Operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
f::Union{F, Nothing} | |
entry::MLIR.IR.Operation | |
f::Union{F,Nothing} | |
entry::MLIR.IR.Operation |
entry::MLIR.IR.Operation | ||
end | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
|
||
GPULowerCPUFeaturesPass() = LLVM.NewPMModulePass("GPULowerCPUFeatures", GPUCompiler.cpu_features!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
GPULowerCPUFeaturesPass() = LLVM.NewPMModulePass("GPULowerCPUFeatures", GPUCompiler.cpu_features!) | |
function GPULowerCPUFeaturesPass() | |
return LLVM.NewPMModulePass("GPULowerCPUFeatures", GPUCompiler.cpu_features!) | |
end |
function compiler_cache(ctx::MLIR.IR.Context) | ||
cache = get(_compiler_caches, ctx, nothing) | ||
if cache === nothing | ||
cache = Dict{Any, LLVMFunc}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
cache = Dict{Any, LLVMFunc}() | |
cache = Dict{Any,LLVMFunc}() |
return cache | ||
end | ||
|
||
Reactant.@reactant_override @noinline function CUDA.cufunction(f::F, tt::TT=Tuple{}; kwargs...) where {F,TT} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
Reactant.@reactant_override @noinline function CUDA.cufunction(f::F, tt::TT=Tuple{}; kwargs...) where {F,TT} | |
Reactant.@reactant_override @noinline function CUDA.cufunction( | |
f::F, tt::TT=Tuple{}; kwargs... | |
) where {F,TT} |
Reactant.@reactant_override @noinline function CUDA.cufunction(f::F, tt::TT=Tuple{}; kwargs...) where {F,TT} | ||
res = Base.@lock CUDA.cufunction_lock begin | ||
# compile the function | ||
cache = compiler_cache(MLIR.IR.context()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
cache = compiler_cache(MLIR.IR.context()) | |
cache = compiler_cache(MLIR.IR.context()) |
# compile the function | ||
cache = compiler_cache(MLIR.IR.context()) | ||
source = CUDA.methodinstance(F, tt) | ||
# cuda = CUDA.active_state() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
# cuda = CUDA.active_state() | |
# cuda = CUDA.active_state() |
cuda_cap=v"5.0" | ||
cuda_ptx=v"6.3" | ||
llvm_cap=v"5.0" | ||
llvm_ptx=v"6.3" | ||
kernel=true | ||
always_inline=false | ||
name=nothing | ||
debuginfo=false | ||
config = CUDA.CompilerConfig(CUDA.PTXCompilerTarget(; cap=llvm_cap, ptx=llvm_ptx, debuginfo), CUDA.CUDACompilerParams(; cap=cuda_cap, ptx=cuda_ptx); kernel, name, always_inline) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
cuda_cap=v"5.0" | |
cuda_ptx=v"6.3" | |
llvm_cap=v"5.0" | |
llvm_ptx=v"6.3" | |
kernel=true | |
always_inline=false | |
name=nothing | |
debuginfo=false | |
config = CUDA.CompilerConfig(CUDA.PTXCompilerTarget(; cap=llvm_cap, ptx=llvm_ptx, debuginfo), CUDA.CUDACompilerParams(; cap=cuda_cap, ptx=cuda_ptx); kernel, name, always_inline) | |
cuda_cap = v"5.0" | |
cuda_ptx = v"6.3" | |
llvm_cap = v"5.0" | |
llvm_ptx = v"6.3" | |
kernel = true | |
always_inline = false | |
name = nothing | |
debuginfo = false | |
config = CUDA.CompilerConfig( | |
CUDA.PTXCompilerTarget(; cap=llvm_cap, ptx=llvm_ptx, debuginfo), | |
CUDA.CUDACompilerParams(; cap=cuda_cap, ptx=cuda_ptx); | |
kernel, | |
name, | |
always_inline, | |
) |
config = CUDA.CompilerConfig(CUDA.PTXCompilerTarget(; cap=llvm_cap, ptx=llvm_ptx, debuginfo), CUDA.CUDACompilerParams(; cap=cuda_cap, ptx=cuda_ptx); kernel, name, always_inline) | ||
CUDA.GPUCompiler.cached_compilation(cache, source, config, compile, link) | ||
end | ||
res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
res | |
return res |
end | ||
|
||
function __init__() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
for (i, inst) in enumerate(ir.stmts) | ||
@static if VERSION < v"1.11" | ||
changed, next = rewrite_inst(inst[:inst], ir, interp) | ||
Core.Compiler.setindex!(ir.stmts[i], next, :inst) | ||
else | ||
changed, next = rewrite_inst(inst[:stmt], ir, interp) | ||
Core.Compiler.setindex!(ir.stmts[i], next, :stmt) | ||
end | ||
if changed | ||
any_changed = true | ||
Core.Compiler.setindex!(ir.stmts[i], Any, :type) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
for (i, inst) in enumerate(ir.stmts) | |
@static if VERSION < v"1.11" | |
changed, next = rewrite_inst(inst[:inst], ir, interp) | |
Core.Compiler.setindex!(ir.stmts[i], next, :inst) | |
else | |
changed, next = rewrite_inst(inst[:stmt], ir, interp) | |
Core.Compiler.setindex!(ir.stmts[i], next, :stmt) | |
end | |
if changed | |
any_changed = true | |
Core.Compiler.setindex!(ir.stmts[i], Any, :type) | |
end | |
for (i, inst) in enumerate(ir.stmts) | |
@static if VERSION < v"1.11" | |
changed, next = rewrite_inst(inst[:inst], ir, interp) | |
Core.Compiler.setindex!(ir.stmts[i], next, :inst) | |
else | |
changed, next = rewrite_inst(inst[:stmt], ir, interp) | |
Core.Compiler.setindex!(ir.stmts[i], next, :stmt) | |
end | |
if changed | |
any_changed = true | |
Core.Compiler.setindex!(ir.stmts[i], Any, :type) | |
end |
This still has some more work to go before fully working, but it improves the errors and current state (And resolves other interpreter things and is a prerequisite to a stack of downstream opts). I'm going to go ahead and merge for now and continue in follow ups |
No description provided.