Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: kernels #314

Merged
merged 23 commits into from
Dec 17, 2024
Merged

WIP: kernels #314

merged 23 commits into from
Dec 17, 2024

Conversation

wsmoses
Copy link
Member

@wsmoses wsmoses commented Nov 29, 2024

No description provided.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reactant.jl Benchmarks

Benchmark suite Current: b7303e5 Previous: 45ae14f Ratio
ViT base (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :after_enzyme) 1449157594 ns 1287700343 ns 1.13
ViT base (256 x 256 x 3 x 32)/forward/CUDA/Reactant 1301919790 ns 1271515659 ns 1.02
ViT base (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :before_enzyme) 1339557972 ns 1253394269 ns 1.07
ViT base (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :only_enzyme) 3312079307 ns 3106663633 ns 1.07
ViT base (256 x 256 x 3 x 32)/forward/CUDA/Lux 206606524 ns 217499591 ns 0.95
ViT base (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :after_enzyme) 5262646551 ns 6749076193 ns 0.78
ViT base (256 x 256 x 3 x 32)/forward/CPU/Reactant 5233063986 ns 5078740247 ns 1.03
ViT base (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :before_enzyme) 5084455177 ns 5013817961 ns 1.01
ViT base (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :only_enzyme) 7686400566 ns 7197691815 ns 1.07
ViT base (256 x 256 x 3 x 32)/forward/CPU/Lux 26339246221 ns 35464964244 ns 0.74
ViT small (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :after_enzyme) 1300005635 ns 1257317145 ns 1.03
ViT small (256 x 256 x 3 x 4)/forward/CUDA/Reactant 1278041149 ns 1424374803 ns 0.90
ViT small (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :before_enzyme) 1261990698 ns 1350049098 ns 0.93
ViT small (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :only_enzyme) 3125146586 ns 3052800629 ns 1.02
ViT small (256 x 256 x 3 x 4)/forward/CUDA/Lux 8879631 ns 8862682 ns 1.00
ViT small (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :after_enzyme) 1550527051 ns 1572590140 ns 0.99
ViT small (256 x 256 x 3 x 4)/forward/CPU/Reactant 1552400963 ns 1559474266 ns 1.00
ViT small (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :before_enzyme) 1552125020 ns 1557501067 ns 1.00
ViT small (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :only_enzyme) 3310850083 ns 3290628669 ns 1.01
ViT small (256 x 256 x 3 x 4)/forward/CPU/Lux 2775956032 ns 2876354148 ns 0.97
ViT tiny (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :after_enzyme) 1303015586 ns 1231219515 ns 1.06
ViT tiny (256 x 256 x 3 x 32)/forward/CUDA/Reactant 1272928755 ns 1441159242 ns 0.88
ViT tiny (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :before_enzyme) 1311413197 ns 1282010253 ns 1.02
ViT tiny (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :only_enzyme) 3028555629 ns 3051584957 ns 0.99
ViT tiny (256 x 256 x 3 x 32)/forward/CUDA/Lux 22655396 ns 22776746 ns 0.99
ViT tiny (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :after_enzyme) 2140398211 ns 2154505585 ns 0.99
ViT tiny (256 x 256 x 3 x 32)/forward/CPU/Reactant 2200393344 ns 2139776302 ns 1.03
ViT tiny (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :before_enzyme) 2142222871 ns 2123332313 ns 1.01
ViT tiny (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :only_enzyme) 3897215106 ns 3879039560 ns 1.00
ViT tiny (256 x 256 x 3 x 32)/forward/CPU/Lux 5312568392 ns 5729200009 ns 0.93
ViT tiny (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :after_enzyme) 1307990936 ns 1259798635 ns 1.04
ViT tiny (256 x 256 x 3 x 4)/forward/CUDA/Reactant 1301819826 ns 1262851193 ns 1.03
ViT tiny (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :before_enzyme) 1284427966 ns 1266665882 ns 1.01
ViT tiny (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :only_enzyme) 3169837598 ns 3319553871 ns 0.95
ViT tiny (256 x 256 x 3 x 4)/forward/CUDA/Lux 7453064 ns 7445203.5 ns 1.00
ViT tiny (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :after_enzyme) 1409136279 ns 1424258021 ns 0.99
ViT tiny (256 x 256 x 3 x 4)/forward/CPU/Reactant 1409545691 ns 1421721118 ns 0.99
ViT tiny (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :before_enzyme) 1414236404 ns 1420742881 ns 1.00
ViT tiny (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :only_enzyme) 3151606700 ns 3162578762 ns 1.00
ViT tiny (256 x 256 x 3 x 4)/forward/CPU/Lux 1654006772.5 ns 2138106366 ns 0.77
ViT tiny (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :after_enzyme) 1291669432 ns 1297050944 ns 1.00
ViT tiny (256 x 256 x 3 x 16)/forward/CUDA/Reactant 1265833403 ns 1403907055 ns 0.90
ViT tiny (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :before_enzyme) 1278433111 ns 1269229731 ns 1.01
ViT tiny (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :only_enzyme) 3126809956 ns 3063143344 ns 1.02
ViT tiny (256 x 256 x 3 x 16)/forward/CUDA/Lux 12328188 ns 12347497 ns 1.00
ViT tiny (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :after_enzyme) 1741906628 ns 1721006513 ns 1.01
ViT tiny (256 x 256 x 3 x 16)/forward/CPU/Reactant 1731592537 ns 1711405549 ns 1.01
ViT tiny (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :before_enzyme) 1720273302 ns 1704835369 ns 1.01
ViT tiny (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :only_enzyme) 3450588571 ns 3443971150 ns 1.00
ViT tiny (256 x 256 x 3 x 16)/forward/CPU/Lux 2948602836 ns 3110298785 ns 0.95
ViT small (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :after_enzyme) 1494612899 ns 1266729302 ns 1.18
ViT small (256 x 256 x 3 x 16)/forward/CUDA/Reactant 1311317968 ns 1308873395 ns 1.00
ViT small (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :before_enzyme) 1492915221 ns 1275958493 ns 1.17
ViT small (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :only_enzyme) 3115105513 ns 3081413477 ns 1.01
ViT small (256 x 256 x 3 x 16)/forward/CUDA/Lux 27412509 ns 27435162 ns 1.00
ViT small (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :after_enzyme) 2228730818 ns 2169947879 ns 1.03
ViT small (256 x 256 x 3 x 16)/forward/CPU/Reactant 2334825207 ns 2163945294 ns 1.08
ViT small (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :before_enzyme) 2310305349 ns 2151891950 ns 1.07
ViT small (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :only_enzyme) 3944966197 ns 3946269320 ns 1.00
ViT small (256 x 256 x 3 x 16)/forward/CPU/Lux 6131212634 ns 6287057122 ns 0.98
ViT small (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :after_enzyme) 1303567764 ns 1260705673 ns 1.03
ViT small (256 x 256 x 3 x 32)/forward/CUDA/Reactant 1424871003 ns 1369717954 ns 1.04
ViT small (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :before_enzyme) 1275689864 ns 1281076652 ns 1.00
ViT small (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :only_enzyme) 3045934410 ns 3130042297 ns 0.97
ViT small (256 x 256 x 3 x 32)/forward/CUDA/Lux 52971586 ns 53036705.5 ns 1.00
ViT small (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :after_enzyme) 3055665974 ns 3050356994 ns 1.00
ViT small (256 x 256 x 3 x 32)/forward/CPU/Reactant 3021313773 ns 3082997102 ns 0.98
ViT small (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :before_enzyme) 3053225043 ns 2965563203 ns 1.03
ViT small (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :only_enzyme) 4887749197 ns 4841087626 ns 1.01
ViT small (256 x 256 x 3 x 32)/forward/CPU/Lux 11183611226 ns 8484129480 ns 1.32
ViT base (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :after_enzyme) 1300865042 ns 1260921375 ns 1.03
ViT base (256 x 256 x 3 x 16)/forward/CUDA/Reactant 1295735580 ns 1253872568 ns 1.03
ViT base (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :before_enzyme) 1232925244 ns 1479498539 ns 0.83
ViT base (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :only_enzyme) 2922815725 ns 3113671601 ns 0.94
ViT base (256 x 256 x 3 x 16)/forward/CUDA/Lux 71283297 ns 71338519.5 ns 1.00
ViT base (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :after_enzyme) 3270546818 ns 3125511597 ns 1.05
ViT base (256 x 256 x 3 x 16)/forward/CPU/Reactant 3230464036 ns 3098530069 ns 1.04
ViT base (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :before_enzyme) 3254041312 ns 3115589553 ns 1.04
ViT base (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :only_enzyme) 5162220727 ns 5036626230 ns 1.02
ViT base (256 x 256 x 3 x 16)/forward/CPU/Lux 15170850681 ns 11289651474 ns 1.34
ViT base (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :after_enzyme) 1278655290 ns 1339569725 ns 0.95
ViT base (256 x 256 x 3 x 4)/forward/CUDA/Reactant 1229847740 ns 1259019883 ns 0.98
ViT base (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :before_enzyme) 1439473418 ns 1254828379 ns 1.15
ViT base (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :only_enzyme) 2922143773 ns 2975337456 ns 0.98
ViT base (256 x 256 x 3 x 4)/forward/CUDA/Lux 20699816 ns 20758936 ns 1.00
ViT base (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :after_enzyme) 1963950807 ns 1859519475 ns 1.06
ViT base (256 x 256 x 3 x 4)/forward/CPU/Reactant 2218798778 ns 1869845638 ns 1.19
ViT base (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :before_enzyme) 2058391749 ns 1850101657 ns 1.11
ViT base (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :only_enzyme) 3614980515 ns 3593739548 ns 1.01
ViT base (256 x 256 x 3 x 4)/forward/CPU/Lux 3206903233.5 ns 3325189113.5 ns 0.96

This comment was automatically generated by workflow using github-action-benchmark.

This comment was marked as spam.

@wsmoses wsmoses force-pushed the cuda branch 2 times, most recently from be52876 to b8e3570 Compare December 7, 2024 23:32
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

JuliaFormatter

[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶

LLVMFunc{F,job.source.specTypes}(f, modstr, image, LLVM.name(entry))


[JuliaFormatter] reported by reviewdog 🐶

f::F
mod::String
image
entry::String


[JuliaFormatter] reported by reviewdog 🐶

function (func::LLVMFunc{F,tt})(args...; blocks::CUDA.CuDim=1, threads::CUDA.CuDim=1,
shmem::Integer=0) where{F, tt}


[JuliaFormatter] reported by reviewdog 🐶

@show a
@assert a isa CuDeviceArray
ta = Base.pointer_to_objref(a.ptr)::TracedRArray
arg = ta.mlir_data
arg = Reactant.Compiler.transpose_val(arg)
push!(restys, MLIR.IR.Type(arg))
push!(aliases,
MLIR.IR.Dialects.stablehlo.stablehloOutputOperandAliasGet(
MLIR.IR.context(),
len(args) == 1 ? 0 : 1,
len(args) == 1 ? C_NULL : Ref{Int64}(i-1),
i-1,
0,
C_NULL
)
)


[JuliaFormatter] reported by reviewdog 🐶

output_operand_aliases=MLIR.ArrayAttr.get(MLIR.IR.context(), aliases)
MLIR.IR.Dialects.stablehlo.custom_call(mlir_args; result_0=restys, call_target_name="reactant_gpu_call", output_operand_aliases)


[JuliaFormatter] reported by reviewdog 🐶

const _compiler_caches = Dict{MLIR.IR.Context, Dict{Any, LLVMFunc}}();


[JuliaFormatter] reported by reviewdog 🐶

cache = Dict{Any, LLVMFunc}()


[JuliaFormatter] reported by reviewdog 🐶

cache = compiler_cache(MLIR.IR.context())


[JuliaFormatter] reported by reviewdog 🐶

cuda = CUDA.active_state()


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶

arginfo2 = ArgInfo(
if fargs isa Nothing
nothing
else
[:($(recufunction)), fargs[2:end]...]
end,
[Core.Const(recufunction), argtypes[2:end]...],
)
return abstract_call_known(interp, recufunction, arginfo2, si, sv, max_methods)

Base.unsafe_convert(Core.LLVMPtr{T,A}, x) + Base._memory_offset(x, i)
end


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change

unsafe_cached_load(pointer(A), index)
end


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change

# deprecated
Base.@propagate_inbounds ldg(A::CuTracedArray, i1::Integer) = const_arrayref(A, i1)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change

return CuTracedArray{T,N,A}(reinterpret(Core.LLVMPtr{T,A}, a.ptr), osize, a.maxsize)
end


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change

Comment on lines +199 to +200


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

JuliaFormatter

[JuliaFormatter] reported by reviewdog 🐶

src = Core.Compiler.ir_to_codeinf!(opt)


[JuliaFormatter] reported by reviewdog 🐶

$(Expr(:meta, :generated, call_with_reactant_generator))

@wsmoses
Copy link
Member Author

wsmoses commented Dec 9, 2024

@avik-pal @vchuravy @Pangoraw @jumerckx @mofeing obviously this is presently a reactant/enzyme/cassette frankenstein atm in utils.jl but cc'ing for visibility, and if you see an obvious reason I get the following error:

args = (typeof(typeof), ReactantCUDAExt.CuTracedArray{Int64, 1, 1, (64,)})
Square Kernel: Error During Test at /home/wsmoses/Reactant.jl/test/cuda.jl:20
  Got exception outside of a @test
  AssertionError: Unsupported call_with_reactant of builtin (typeof, 64 cu traced array at 0x00007fa1d6a46680)
  Stacktrace:
    [1] macro expansion
      @ ~/Reactant.jl/src/utils.jl:0 [inlined]
    [2] call_with_reactant(::typeof(typeof), ::ReactantCUDAExt.CuTracedArray{Int64, 1, 1, (64,)})
      @ Reactant ~/Reactant.jl/src/utils.jl:456
    [3] Typeof
      @ ./boot.jl:245 [inlined]
    [4] map
      @ ./tuple.jl:291 [inlined]
    [5] macro expansion
      @ ~/.julia/packages/CUDA/2kjXI/src/compiler/execution.jl:111 [inlined]
    [6] square!
      @ ~/Reactant.jl/test/cuda.jl:15 [inlined]
    [7] call_with_reactant(::typeof(square!), ::Reactant.TracedRArray{Int64, 1})
      @ Reactant ~/Reactant.jl/src/utils.jl:0
    [8] (::Reactant.var"#34#44"{Bool, Bool, typeof(square!), Tuple{ConcreteRArray{Int64, 1}}, Vector{Union{ReactantCore.MissingTracedValue, Reactant.TracedRArray, Reactant.TracedRNumber}}, Tuple{Reactant.TracedRArray{Int64, 1}}})()
      @ Reactant ~/Reactant.jl/src/utils.jl:559
    [9] block!(f::Reactant.var"#34#44"{Bool, Bool, typeof(square!), Tuple{ConcreteRArray{Int64, 1}}, Vector{Union{ReactantCore.MissingTracedValue, Reactant.TracedRArray, Reactant.TracedRNumber}}, Tuple{Reactant.TracedRArray{Int64, 1}}}, blk::Reactant.MLIR.IR.Block)
      @ Reactant.MLIR.IR ~/Reactant.jl/src/mlir/IR/Block.jl:201
   [10] make_mlir_fn(f::Function, args::Tuple{ConcreteRArray{Int64, 1}}, kwargs::Tuple{}, name::String, concretein::Bool; toscalar::Bool, return_dialect::Symbol, no_args_in_result::Bool, construct_function_without_args::Bool, do_transpose::Bool)
      @ Reactant ~/Reactant.jl/src/utils.jl:541
   [11] make_mlir_fn
      @ ~/Reactant.jl/src/utils.jl:461 [inlined]
   [12] #10
      @ ~/Reactant.jl/src/Compiler.jl:295 [inlined]
   [13] block!(f::Reactant.Compiler.var"#10#15"{typeof(square!), Tuple{ConcreteRArray{Int64, 1}}}, blk::Reactant.MLIR.IR.Block)
      @ Reactant.MLIR.IR ~/Reactant.jl/src/mlir/IR/Block.jl:201
   [14] #9
      @ ~/Reactant.jl/src/Compiler.jl:294 [inlined]
   [15] mmodule!(f::Reactant.Compiler.var"#9#14"{Reactant.MLIR.IR.Module, typeof(square!), Tuple{ConcreteRArray{Int64, 1}}}, blk::Reactant.MLIR.IR.Module)
      @ Reactant.MLIR.IR ~/Reactant.jl/src/mlir/IR/Module.jl:92
   [16] compile_mlir!(mod::Reactant.MLIR.IR.Module, f::Function, args::Tuple{ConcreteRArray{Int64, 1}}; optimize::Bool)
      @ Reactant.Compiler ~/Reactant.jl/src/Compiler.jl:291
   [17] compile_mlir!
      @ ~/Reactant.jl/src/Compiler.jl:290 [inlined]
   [18] (::Reactant.Compiler.var"#34#36"{Bool, typeof(square!), Tuple{ConcreteRArray{Int64, 1}}})()
      @ Reactant.Compiler ~/Reactant.jl/src/Compiler.jl:698
   [19] context!(f::Reactant.Compiler.var"#34#36"{Bool, typeof(square!), Tuple{ConcreteRArray{Int64, 1}}}, ctx::Reactant.MLIR.IR.Context)
      @ Reactant.MLIR.IR ~/Reactant.jl/src/mlir/IR/Context.jl:76
   [20] compile_xla(f::Function, args::Tuple{ConcreteRArray{Int64, 1}}; client::Nothing, optimize::Bool)
      @ Reactant.Compiler ~/Reactant.jl/src/Compiler.jl:695
   [21] compile_xla
      @ ~/Reactant.jl/src/Compiler.jl:690 [inlined]
   [22] compile(f::Function, args::Tuple{ConcreteRArray{Int64, 1}}; client::Nothing, optimize::Bool, sync::Bool)
      @ Reactant.Compiler ~/Reactant.jl/src/Compiler.jl:722
   [23] macro expansion
      @ ~/Reactant.jl/src/Compiler.jl:475 [inlined]
   [24] macro expansion
      @ ~/Reactant.jl/test/cuda.jl:23 [inlined]
   [25] macro expansion
      @ ~/.julia/juliaup/julia-1.10.5+0.x64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [26] top-level scope
      @ ~/Reactant.jl/test/cuda.jl:21
   [27] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [28] top-level scope
      @ ~/Reactant.jl/test/runtests.jl:44
   [29] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [30] top-level scope
      @ none:6
   [31] eval
      @ ./boot.jl:385 [inlined]
   [32] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:291
   [33] _start()
      @ Base ./client.jl:552
Test Summary: | Error  Total     Time
Square Kernel |     1      1  1m54.4s

I realized while doing this that a pre req we need is supporting custom absint through indirect function calls so most of utils.jl tries to do this (and incidentally in doing so starts to set up the infra we will need later on for automated detection of if/for/etc into traced versions .

Of course atm everything breaks, but anyways

@wsmoses
Copy link
Member Author

wsmoses commented Dec 10, 2024

@aviatesk obviously this is still very in progress, but I'm wondering if you can give it a look?

Essentially the absint part is trying to make a code call that forces the use of our abstract interpreter, for all type unstable calls (e.g. unlike cassette we run inference and optimizations first). Eventually we also will do more with the IR as well like raising control flow into nice structures, but thats a future todo

@wsmoses
Copy link
Member Author

wsmoses commented Dec 10, 2024

I need to finish up registration, but kernel compile and execute (phase 1) now works!

#= /home/wsmoses/Reactant.jl/test/cuda.jl:22 =# @code_hlo(square!(A)) = module {
  func.func @main(%arg0: tensor<64xi64>) -> tensor<64xi64> {
    %0 = stablehlo.custom_call @reactant_gpu_call(%arg0) {output_operand_aliases = [#stablehlo.output_operand_alias<output_tuple_indices = [], operand_index = 0, operand_tuple_indices = []>]} : (tensor<64xi64>) -> tensor<64xi64>
    return %0 : tensor<64xi64>
  }
}
(res, xs) = (64 cu traced array at 0x00007f57d89256f0, TracedRArray{Int64,1N}(((:args, 1),), size=(64,)))
("recufunction", f, tt) = ("recufunction", square_kernel!, Tuple{ReactantCUDAExt.CuTracedArray{Int64, 1, 1, (64,)}})
args = (64 cu traced array at 0x00007f57d89256f0,)
call_kwargs = Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}()
a = 64 cu traced array at 0x00007f57d89256f0
arg = %1 = "stablehlo.transpose"(%0) <{permutation = array<i64: 0>}> : (tensor<64xi64>) -> tensor<64xi64>
terminate called after throwing an instance of 'xla::XlaRuntimeError'
  what():  UNIMPLEMENTED: No registered implementation for custom call to reactant_gpu_call for platform CUDA

@wsmoses
Copy link
Member Author

wsmoses commented Dec 10, 2024

also cc for vis @ivanradanov @ftynse we now [almost] have a single mlir file with host and device code for an entire julia application

end

output_operand_aliases=MLIR.IR.Attribute(aliases)
call = MLIR.Dialects.stablehlo.custom_call(mlir_args; result_0=restys, call_target_name="reactant_gpu_call", output_operand_aliases)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, to pass some information to the XLA callback of this kernel, use the backend_config (https://github.com/openxla/stablehlo/blob/ef176a130f28196dcb4a5735d0f2f6ed0f85bd5d/stablehlo/dialect/StablehloOps.td#L2391) string attribute. That attribute gets forwarded to the "opaque" pair of arguments (pointer + size) as a C string.

@testset "Square Kernel" begin
oA = collect(1:1:64)
A = Reactant.to_rarray(oA)
@show @code_hlo optimize=false square!(A)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
@show @code_hlo optimize=false square!(A)
@show @code_hlo optimize = false square!(A)

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

JuliaFormatter

[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶

mod, meta = GPUCompiler.compile(:llvm, job; optimize=false, cleanup=false, validate=false)


[JuliaFormatter] reported by reviewdog 🐶

opt_level = 2


[JuliaFormatter] reported by reviewdog 🐶

LLVM.@dispose pb=LLVM.NewPMPassBuilder() begin


[JuliaFormatter] reported by reviewdog 🐶

modstr = string(mod)


[JuliaFormatter] reported by reviewdog 🐶

# This is a bit weird since we're taking a module from julia's llvm into reactant's llvm version
# it is probably safer to reparse a string using the right llvm module api, so we will do that.


[JuliaFormatter] reported by reviewdog 🐶

mmod = MLIR.IR.Module(@ccall MLIR.API.mlir_c.ConvertLLVMStrToMLIR(modstr::Cstring, MLIR.IR.context()::MLIR.API.MlirContext)::MLIR.API.MlirModule)
@show mmod
# check if we'll need the device runtime
undefined_fs = filter(collect(CUDA.LLVM.functions(meta.ir))) do f
CUDA.LLVM.isdeclaration(f) && !CUDA.LLVM.isintrinsic(f)
end
intrinsic_fns = ["vprintf", "malloc", "free", "__assertfail",
"__nvvm_reflect" #= TODO: should have been optimized away =#]
needs_cudadevrt = !isempty(setdiff(CUDA.LLVM.name.(undefined_fs), intrinsic_fns))
# prepare invocations of CUDA compiler tools
ptxas_opts = String[]
nvlink_opts = String[]
## debug flags
if Base.JLOptions().debug_level == 1
push!(ptxas_opts, "--generate-line-info")
elseif Base.JLOptions().debug_level >= 2
push!(ptxas_opts, "--device-debug")
push!(nvlink_opts, "--debug")
end
## relocatable device code
if needs_cudadevrt
push!(ptxas_opts, "--compile-only")
end
ptx = job.config.params.ptx
cap = job.config.params.cap
arch = "sm_$(cap.major)$(cap.minor)"
# validate use of parameter memory
argtypes = filter([CUDA.KernelState, job.source.specTypes.parameters...]) do dt
!CUDA.isghosttype(dt) && !Core.Compiler.isconstType(dt)
end
param_usage = sum(sizeof, argtypes)
param_limit = 4096
if cap >= v"7.0" && ptx >= v"8.1"
param_limit = 32764
end
if param_usage > param_limit
msg = """Kernel invocation uses too much parameter memory.
$(Base.format_bytes(param_usage)) exceeds the $(Base.format_bytes(param_limit)) limit imposed by sm_$(cap.major)$(cap.minor) / PTX v$(ptx.major).$(ptx.minor)."""
try
details = "\n\nRelevant parameters:"
source_types = job.source.specTypes.parameters
source_argnames = Base.method_argnames(job.source.def)
while length(source_argnames) < length(source_types)
# this is probably due to a trailing vararg; repeat its name
push!(source_argnames, source_argnames[end])
end
for (i, typ) in enumerate(source_types)
if CUDA.isghosttype(typ) || Core.Compiler.isconstType(typ)
continue
end
name = source_argnames[i]
details *= "\n [$(i-1)] $name::$typ uses $(Base.format_bytes(sizeof(typ)))"
end
details *= "\n"
if cap >= v"7.0" && ptx < v"8.1" && param_usage < 32764
details *= "\nNote: use a newer CUDA to support more parameters on your device.\n"
end
msg *= details
catch err
@error "Failed to analyze kernel parameter usage; please file an issue with a reproducer."
end
error(msg)
end
# compile to machine code
# NOTE: we use tempname since mktemp doesn't support suffixes, and mktempdir is slow
ptx_input = tempname(cleanup=false) * ".ptx"
ptxas_output = tempname(cleanup=false) * ".cubin"
write(ptx_input, asm)
# we could use the driver's embedded JIT compiler, but that has several disadvantages:
# 1. fixes and improvements are slower to arrive, by using `ptxas` we only need to
# upgrade the toolkit to get a newer compiler;
# 2. version checking is simpler, we otherwise need to use NVML to query the driver
# version, which is hard to correlate to PTX JIT improvements;
# 3. if we want to be able to use newer (minor upgrades) of the CUDA toolkit on an
# older driver, we should use the newer compiler to ensure compatibility.
append!(ptxas_opts, [
"--verbose",
"--gpu-name", arch,
"--output-file", ptxas_output,
ptx_input
])
proc, log = CUDA.run_and_collect(`$(CUDA.ptxas()) $ptxas_opts`)
log = strip(log)
if !success(proc)
reason = proc.termsignal > 0 ? "ptxas received signal $(proc.termsignal)" :
"ptxas exited with code $(proc.exitcode)"
msg = "Failed to compile PTX code ($reason)"
msg *= "\nInvocation arguments: $(join(ptxas_opts, ' '))"
if !isempty(log)
msg *= "\n" * log
end
msg *= "\nIf you think this is a bug, please file an issue and attach $(ptx_input)"
if parse(Bool, get(ENV, "BUILDKITE", "false"))
run(`buildkite-agent artifact upload $(ptx_input)`)
end
error(msg)
elseif !isempty(log)
@debug "PTX compiler log:\n" * log
end
rm(ptx_input)
# link device libraries, if necessary
#
# this requires relocatable device code, which prevents certain optimizations and
# hurts performance. as such, we only do so when absolutely necessary.
# TODO: try LTO, `--link-time-opt --nvvmpath /opt/cuda/nvvm`.
# fails with `Ignoring -lto option because no LTO objects found`
if needs_cudadevrt
nvlink_output = tempname(cleanup=false) * ".cubin"
append!(nvlink_opts, [
"--verbose", "--extra-warnings",
"--arch", arch,
"--library-path", dirname(libcudadevrt),
"--library", "cudadevrt",
"--output-file", nvlink_output,
ptxas_output
])
proc, log = run_and_collect(`$(CUDA.nvlink()) $nvlink_opts`)
log = strip(log)
if !success(proc)
reason = proc.termsignal > 0 ? "nvlink received signal $(proc.termsignal)" :
"nvlink exited with code $(proc.exitcode)"
msg = "Failed to link PTX code ($reason)"
msg *= "\nInvocation arguments: $(join(nvlink_opts, ' '))"
if !isempty(log)
msg *= "\n" * log
end
msg *= "\nIf you think this is a bug, please file an issue and attach $(ptxas_output)"
error(msg)
elseif !isempty(log)
@debug "PTX linker info log:\n" * log
end
rm(ptxas_output)
image = read(nvlink_output)
rm(nvlink_output)
else
image = read(ptxas_output)
rm(ptxas_output)
end
modstr, image, meta.entry


[JuliaFormatter] reported by reviewdog 🐶

LLVMFunc{job.source.specTypes.parameters[1],job.source.specTypes}(nothing, modstr, image, CUDA.LLVM.name(entry))


[JuliaFormatter] reported by reviewdog 🐶

Reactant.@reactant_override @noinline function (func::LLVMFunc{F,tt})(args...; convert=Val(false), blocks::CuDim=1, threads::CuDim=1,
cooperative::Bool=false, shmem::Integer=0, call_kwargs...) where{F, tt}


[JuliaFormatter] reported by reviewdog 🐶

@show a
@assert a isa CuTracedArray
ta = Base.unsafe_pointer_to_objref(Base.reinterpret(Ptr{Cvoid}, a.ptr))::TracedRArray
push!(rarrays, ta)
arg = ta.mlir_data
arg = transpose_val(arg)
@show arg
push!(restys, MLIR.IR.type(arg))
push!(mlir_args, arg)
push!(aliases,
MLIR.IR.Attribute(MLIR.API.stablehloOutputOperandAliasGet(
MLIR.IR.context(),
length(args) == 1 ? 0 : 1,
length(args) == 1 ? C_NULL : Ref{Int64}(i-1),
i-1,
0,
C_NULL
))
)


[JuliaFormatter] reported by reviewdog 🐶

output_operand_aliases=MLIR.IR.Attribute(aliases)
call = MLIR.Dialects.stablehlo.custom_call(mlir_args; result_0=restys, call_target_name="reactant_gpu_call", output_operand_aliases, backend_config=MLIR.IR.Attribute("configstr"))


[JuliaFormatter] reported by reviewdog 🐶

res.mlir_data = transpose_val(MLIR.IR.result(call, i))


[JuliaFormatter] reported by reviewdog 🐶

const _compiler_caches = Dict{MLIR.IR.Context, Dict{Any, LLVMFunc}}();


[JuliaFormatter] reported by reviewdog 🐶

cache = Dict{Any, LLVMFunc}()


[JuliaFormatter] reported by reviewdog 🐶

Reactant.@reactant_override @noinline function CUDA.cufunction(f::F, tt::TT=Tuple{}; kwargs...) where {F,TT}


[JuliaFormatter] reported by reviewdog 🐶

cache = compiler_cache(MLIR.IR.context())


[JuliaFormatter] reported by reviewdog 🐶

# cuda = CUDA.active_state()


[JuliaFormatter] reported by reviewdog 🐶

cuda_cap=v"5.0"
cuda_ptx=v"6.3"
llvm_cap=v"5.0"
llvm_ptx=v"6.3"
kernel=true
always_inline=false
name=nothing
debuginfo=false
config = CUDA.CompilerConfig(CUDA.PTXCompilerTarget(; cap=llvm_cap, ptx=llvm_ptx, debuginfo), CUDA.CUDACompilerParams(; cap=cuda_cap, ptx=cuda_ptx); kernel, name, always_inline)


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶

Reactant.jl/src/utils.jl

Lines 370 to 381 in 0c61f5d

for (i, inst) in enumerate(ir.stmts)
@static if VERSION < v"1.11"
changed, next = rewrite_inst(inst[:inst], ir, interp)
Core.Compiler.setindex!(ir.stmts[i], next, :inst)
else
changed, next = rewrite_inst(inst[:stmt], ir, interp)
Core.Compiler.setindex!(ir.stmts[i], next, :stmt)
end
if changed
any_changed = true
Core.Compiler.setindex!(ir.stmts[i], Any, :type)
end

Comment on lines +4 to +5
using Reactant:
Reactant, TracedRArray, AnyTracedRArray, MLIR, TracedRNumber
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
using Reactant:
Reactant, TracedRArray, AnyTracedRArray, MLIR, TracedRNumber
using Reactant: Reactant, TracedRArray, AnyTracedRArray, MLIR, TracedRNumber

Comment on lines +14 to +16

Base.show(io::IO, a::AT) where AT <: CuTracedArray =
CUDA.Printf.@printf(io, "%s cu traced array at %p", join(size(a), '×'), Int(pointer(a)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
Base.show(io::IO, a::AT) where AT <: CuTracedArray =
CUDA.Printf.@printf(io, "%s cu traced array at %p", join(size(a), '×'), Int(pointer(a)))
function Base.show(io::IO, a::AT) where {AT<:CuTracedArray}
CUDA.Printf.@printf(io, "%s cu traced array at %p", join(size(a), '×'), Int(pointer(a)))
end

Base.elsize(::Type{<:CuTracedArray{T}}) where {T} = sizeof(T)
Base.size(g::CuTracedArray{T,N,A,Size}) where {T,N,A,Size} = Size
Base.sizeof(x::CuTracedArray) = Base.elsize(x) * length(x)
Base.pointer(x::CuTracedArray{T,<:Any,A}) where {T,A} = Base.unsafe_convert(Core.LLVMPtr{T,A}, x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
Base.pointer(x::CuTracedArray{T,<:Any,A}) where {T,A} = Base.unsafe_convert(Core.LLVMPtr{T,A}, x)
function Base.pointer(x::CuTracedArray{T,<:Any,A}) where {T,A}
return Base.unsafe_convert(Core.LLVMPtr{T,A}, x)
end

Base.sizeof(x::CuTracedArray) = Base.elsize(x) * length(x)
Base.pointer(x::CuTracedArray{T,<:Any,A}) where {T,A} = Base.unsafe_convert(Core.LLVMPtr{T,A}, x)
@inline function Base.pointer(x::CuTracedArray{T,<:Any,A}, i::Integer) where {T,A}
Base.unsafe_convert(Core.LLVMPtr{T,A}, x) + Base._memory_offset(x, i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
Base.unsafe_convert(Core.LLVMPtr{T,A}, x) + Base._memory_offset(x, i)
return Base.unsafe_convert(Core.LLVMPtr{T,A}, x) + Base._memory_offset(x, i)

Comment on lines +31 to +33
Base.unsafe_convert(::Type{Core.LLVMPtr{T,A}}, x::CuTracedArray{T,<:Any,A}) where {T,A} =
x.ptr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
Base.unsafe_convert(::Type{Core.LLVMPtr{T,A}}, x::CuTracedArray{T,<:Any,A}) where {T,A} =
x.ptr
function Base.unsafe_convert(
::Type{Core.LLVMPtr{T,A}}, x::CuTracedArray{T,<:Any,A}
) where {T,A}
return x.ptr
end

Comment on lines +127 to +133
Base.@propagate_inbounds Base.getindex(A::CuTracedArray,
I::Union{Integer, CartesianIndex}...) =
A[Base._to_linear_index(A, to_indices(A, I)...)]
Base.@propagate_inbounds Base.setindex!(A::CuTracedArray, x,
I::Union{Integer, CartesianIndex}...) =
A[Base._to_linear_index(A, to_indices(A, I)...)] = x

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
Base.@propagate_inbounds Base.getindex(A::CuTracedArray,
I::Union{Integer, CartesianIndex}...) =
A[Base._to_linear_index(A, to_indices(A, I)...)]
Base.@propagate_inbounds Base.setindex!(A::CuTracedArray, x,
I::Union{Integer, CartesianIndex}...) =
A[Base._to_linear_index(A, to_indices(A, I)...)] = x
Base.@propagate_inbounds Base.getindex(
A::CuTracedArray, I::Union{Integer,CartesianIndex}...
) = A[Base._to_linear_index(A, to_indices(A, I)...)]
Base.@propagate_inbounds Base.setindex!(
A::CuTracedArray, x, I::Union{Integer,CartesianIndex}...
) = A[Base._to_linear_index(A, to_indices(A, I)...)] = x

Comment on lines +173 to +174
err = GPUArrays._reinterpret_exception(T, a)
err === nothing || throw(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
err = GPUArrays._reinterpret_exception(T, a)
err === nothing || throw(err)
err = GPUArrays._reinterpret_exception(T, a)
err === nothing || throw(err)

Comment on lines +176 to +178
if sizeof(T) == sizeof(S) # fast case
return CuTracedArray{T,N,A}(reinterpret(Core.LLVMPtr{T,A}, a.ptr), size(a), a.maxsize)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
if sizeof(T) == sizeof(S) # fast case
return CuTracedArray{T,N,A}(reinterpret(Core.LLVMPtr{T,A}, a.ptr), size(a), a.maxsize)
end
if sizeof(T) == sizeof(S) # fast case
return CuTracedArray{T,N,A}(
reinterpret(Core.LLVMPtr{T,A}, a.ptr), size(a), a.maxsize
)
end

Comment on lines +180 to +183
isize = size(a)
size1 = div(isize[1]*sizeof(S), sizeof(T))
osize = tuple(size1, Base.tail(isize)...)
return CuTracedArray{T,N,A}(reinterpret(Core.LLVMPtr{T,A}, a.ptr), osize, a.maxsize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
isize = size(a)
size1 = div(isize[1]*sizeof(S), sizeof(T))
osize = tuple(size1, Base.tail(isize)...)
return CuTracedArray{T,N,A}(reinterpret(Core.LLVMPtr{T,A}, a.ptr), osize, a.maxsize)
isize = size(a)
size1 = div(isize[1] * sizeof(S), sizeof(T))
osize = tuple(size1, Base.tail(isize)...)
return CuTracedArray{T,N,A}(reinterpret(Core.LLVMPtr{T,A}, a.ptr), osize, a.maxsize)

Comment on lines +190 to +196
if prod(dims) != length(a)
throw(DimensionMismatch("new dimensions (argument `dims`) must be consistent with array size (`size(a)`)"))
end
if N == M && dims == size(a)
return a
end
_derived_array(a, T, dims)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
if prod(dims) != length(a)
throw(DimensionMismatch("new dimensions (argument `dims`) must be consistent with array size (`size(a)`)"))
end
if N == M && dims == size(a)
return a
end
_derived_array(a, T, dims)
if prod(dims) != length(a)
throw(
DimensionMismatch(
"new dimensions (argument `dims`) must be consistent with array size (`size(a)`)",
),
)
end
if N == M && dims == size(a)
return a
end
return _derived_array(a, T, dims)

@wsmoses
Copy link
Member Author

wsmoses commented Dec 15, 2024

I still need to make the nicer op with registration and lowerings (incl llvm/llvm-project#119998), but we now have host and device code in an mlir module

#= /home/wmoses/git/Reactant.jl/test/cuda.jl:22 =# @code_hlo(square!(A)) = #tbaa_root = #llvm.tbaa_root<id = "custom_tbaa">
#tbaa_type_desc = #llvm.tbaa_type_desc<id = "custom_tbaa_addrspace(1)", members = {<#tbaa_root, 0>}>
#tbaa_tag = #llvm.tbaa_tag<base_type = #tbaa_type_desc, access_type = #tbaa_type_desc, offset = 0>
module {
  llvm.func internal unnamed_addr fastcc @julia_throw_boundserror_2676() attributes {dso_local, no_inline, sym_visibility = "private"} {
    llvm.unreachable
  }
  llvm.func local_unnamed_addr ptx_kernelcc @_Z14square_kernel_13CuTracedArrayI5Int64Ll1ELl1E5_64__E(%arg0: !llvm.array<1 x ptr<1>>) {
    %0 = llvm.mlir.constant(63 : i32) : i32
    %1 = nvvm.read.ptx.sreg.tid.x : i32
    %2 = llvm.icmp "ugt" %1, %0 : i32
    llvm.cond_br %2, ^bb2, ^bb1
  ^bb1:  // pred: ^bb0
    %3 = llvm.extractvalue %arg0[0] : !llvm.array<1 x ptr<1>>
    %4 = llvm.zext %1 : i32 to i64
    %5 = llvm.getelementptr inbounds %3[%4] : (!llvm.ptr<1>, i64) -> !llvm.ptr<1>, i64
    %6 = llvm.load %5 {alignment = 1 : i64, tbaa = [#tbaa_tag]} : !llvm.ptr<1> -> i64
    %7 = llvm.mul %6, %6 : i64
    llvm.store %7, %5 {alignment = 1 : i64, tbaa = [#tbaa_tag]} : i64, !llvm.ptr<1>
    llvm.return
  ^bb2:  // pred: ^bb0
    llvm.call fastcc @julia_throw_boundserror_2676() : () -> ()
    llvm.unreachable
  }
  func.func @main(%arg0: tensor<64xi64>) -> tensor<64xi64> {
    %0 = stablehlo.custom_call @reactant_gpu_call(%arg0) {backend_config = "_Z14square_kernel_13CuTracedArrayI5Int64Ll1ELl1E5_64__E", output_operand_aliases = [#stablehlo.output_operand_alias<output_tuple_indices = [], operand_index = 0, operand_tuple_indices = []>]} : (tensor<64xi64>) -> tensor<64xi64>
    return %0 : tensor<64xi64>
  }
}
call_kwargs = Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}()
blockdim = CuDim3(0x00000001, 0x00000001, 0x00000001)
threaddim = CuDim3(0x00000040, 0x00000001, 0x00000001)

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

JuliaFormatter

[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶

Reactant.jl/src/utils.jl

Lines 370 to 381 in d0e5195

for (i, inst) in enumerate(ir.stmts)
@static if VERSION < v"1.11"
changed, next = rewrite_inst(inst[:inst], ir, interp)
Core.Compiler.setindex!(ir.stmts[i], next, :inst)
else
changed, next = rewrite_inst(inst[:stmt], ir, interp)
Core.Compiler.setindex!(ir.stmts[i], next, :stmt)
end
if changed
any_changed = true
Core.Compiler.setindex!(ir.stmts[i], Any, :type)
end

Comment on lines +202 to +203
res = CuTracedArray{T,N,CUDA.AS.Global, size(xs)}(Base.reinterpret(Core.LLVMPtr{T,CUDA.AS.Global}, Base.pointer_from_objref(xs)))
return res
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
res = CuTracedArray{T,N,CUDA.AS.Global, size(xs)}(Base.reinterpret(Core.LLVMPtr{T,CUDA.AS.Global}, Base.pointer_from_objref(xs)))
return res
res = CuTracedArray{T,N,CUDA.AS.Global,size(xs)}(
Base.reinterpret(Core.LLVMPtr{T,CUDA.AS.Global}, Base.pointer_from_objref(xs))
)
return res

return res
end

const _kernel_instances = Dict{Any, Any}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
const _kernel_instances = Dict{Any, Any}()
const _kernel_instances = Dict{Any,Any}()

Comment on lines +209 to +210
f::Union{F, Nothing}
entry::MLIR.IR.Operation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
f::Union{F, Nothing}
entry::MLIR.IR.Operation
f::Union{F,Nothing}
entry::MLIR.IR.Operation

entry::MLIR.IR.Operation
end


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change

Comment on lines +217 to +218

GPULowerCPUFeaturesPass() = LLVM.NewPMModulePass("GPULowerCPUFeatures", GPUCompiler.cpu_features!)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
GPULowerCPUFeaturesPass() = LLVM.NewPMModulePass("GPULowerCPUFeatures", GPUCompiler.cpu_features!)
function GPULowerCPUFeaturesPass()
return LLVM.NewPMModulePass("GPULowerCPUFeatures", GPUCompiler.cpu_features!)
end

function compiler_cache(ctx::MLIR.IR.Context)
cache = get(_compiler_caches, ctx, nothing)
if cache === nothing
cache = Dict{Any, LLVMFunc}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
cache = Dict{Any, LLVMFunc}()
cache = Dict{Any,LLVMFunc}()

return cache
end

Reactant.@reactant_override @noinline function CUDA.cufunction(f::F, tt::TT=Tuple{}; kwargs...) where {F,TT}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
Reactant.@reactant_override @noinline function CUDA.cufunction(f::F, tt::TT=Tuple{}; kwargs...) where {F,TT}
Reactant.@reactant_override @noinline function CUDA.cufunction(
f::F, tt::TT=Tuple{}; kwargs...
) where {F,TT}

Reactant.@reactant_override @noinline function CUDA.cufunction(f::F, tt::TT=Tuple{}; kwargs...) where {F,TT}
res = Base.@lock CUDA.cufunction_lock begin
# compile the function
cache = compiler_cache(MLIR.IR.context())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
cache = compiler_cache(MLIR.IR.context())
cache = compiler_cache(MLIR.IR.context())

# compile the function
cache = compiler_cache(MLIR.IR.context())
source = CUDA.methodinstance(F, tt)
# cuda = CUDA.active_state()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
# cuda = CUDA.active_state()
# cuda = CUDA.active_state()

Comment on lines +390 to +398
cuda_cap=v"5.0"
cuda_ptx=v"6.3"
llvm_cap=v"5.0"
llvm_ptx=v"6.3"
kernel=true
always_inline=false
name=nothing
debuginfo=false
config = CUDA.CompilerConfig(CUDA.PTXCompilerTarget(; cap=llvm_cap, ptx=llvm_ptx, debuginfo), CUDA.CUDACompilerParams(; cap=cuda_cap, ptx=cuda_ptx); kernel, name, always_inline)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
cuda_cap=v"5.0"
cuda_ptx=v"6.3"
llvm_cap=v"5.0"
llvm_ptx=v"6.3"
kernel=true
always_inline=false
name=nothing
debuginfo=false
config = CUDA.CompilerConfig(CUDA.PTXCompilerTarget(; cap=llvm_cap, ptx=llvm_ptx, debuginfo), CUDA.CUDACompilerParams(; cap=cuda_cap, ptx=cuda_ptx); kernel, name, always_inline)
cuda_cap = v"5.0"
cuda_ptx = v"6.3"
llvm_cap = v"5.0"
llvm_ptx = v"6.3"
kernel = true
always_inline = false
name = nothing
debuginfo = false
config = CUDA.CompilerConfig(
CUDA.PTXCompilerTarget(; cap=llvm_cap, ptx=llvm_ptx, debuginfo),
CUDA.CUDACompilerParams(; cap=cuda_cap, ptx=cuda_ptx);
kernel,
name,
always_inline,
)

config = CUDA.CompilerConfig(CUDA.PTXCompilerTarget(; cap=llvm_cap, ptx=llvm_ptx, debuginfo), CUDA.CUDACompilerParams(; cap=cuda_cap, ptx=cuda_ptx); kernel, name, always_inline)
CUDA.GPUCompiler.cached_compilation(cache, source, config, compile, link)
end
res
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
res
return res

end

function __init__()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change

Comment on lines +370 to +381
for (i, inst) in enumerate(ir.stmts)
@static if VERSION < v"1.11"
changed, next = rewrite_inst(inst[:inst], ir, interp)
Core.Compiler.setindex!(ir.stmts[i], next, :inst)
else
changed, next = rewrite_inst(inst[:stmt], ir, interp)
Core.Compiler.setindex!(ir.stmts[i], next, :stmt)
end
if changed
any_changed = true
Core.Compiler.setindex!(ir.stmts[i], Any, :type)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
for (i, inst) in enumerate(ir.stmts)
@static if VERSION < v"1.11"
changed, next = rewrite_inst(inst[:inst], ir, interp)
Core.Compiler.setindex!(ir.stmts[i], next, :inst)
else
changed, next = rewrite_inst(inst[:stmt], ir, interp)
Core.Compiler.setindex!(ir.stmts[i], next, :stmt)
end
if changed
any_changed = true
Core.Compiler.setindex!(ir.stmts[i], Any, :type)
end
for (i, inst) in enumerate(ir.stmts)
@static if VERSION < v"1.11"
changed, next = rewrite_inst(inst[:inst], ir, interp)
Core.Compiler.setindex!(ir.stmts[i], next, :inst)
else
changed, next = rewrite_inst(inst[:stmt], ir, interp)
Core.Compiler.setindex!(ir.stmts[i], next, :stmt)
end
if changed
any_changed = true
Core.Compiler.setindex!(ir.stmts[i], Any, :type)
end

@wsmoses
Copy link
Member Author

wsmoses commented Dec 17, 2024

This still has some more work to go before fully working, but it improves the errors and current state (And resolves other interpreter things and is a prerequisite to a stack of downstream opts). I'm going to go ahead and merge for now and continue in follow ups

@wsmoses wsmoses marked this pull request as ready for review December 17, 2024 00:29
@wsmoses wsmoses merged commit dadc568 into main Dec 17, 2024
2 of 10 checks passed
@wsmoses wsmoses deleted the cuda branch December 17, 2024 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants