WIP: kernels #314

wsmoses · 2024-11-29T22:49:28Z

No description provided.

github-actions

Reactant.jl Benchmarks

Benchmark suite	Current: `b7303e5`	Previous: `45ae14f`	Ratio
`ViT base (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :after_enzyme)`	`1449157594` ns	`1287700343` ns	`1.13`
`ViT base (256 x 256 x 3 x 32)/forward/CUDA/Reactant`	`1301919790` ns	`1271515659` ns	`1.02`
`ViT base (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :before_enzyme)`	`1339557972` ns	`1253394269` ns	`1.07`
`ViT base (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :only_enzyme)`	`3312079307` ns	`3106663633` ns	`1.07`
`ViT base (256 x 256 x 3 x 32)/forward/CUDA/Lux`	`206606524` ns	`217499591` ns	`0.95`
`ViT base (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :after_enzyme)`	`5262646551` ns	`6749076193` ns	`0.78`
`ViT base (256 x 256 x 3 x 32)/forward/CPU/Reactant`	`5233063986` ns	`5078740247` ns	`1.03`
`ViT base (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :before_enzyme)`	`5084455177` ns	`5013817961` ns	`1.01`
`ViT base (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :only_enzyme)`	`7686400566` ns	`7197691815` ns	`1.07`
`ViT base (256 x 256 x 3 x 32)/forward/CPU/Lux`	`26339246221` ns	`35464964244` ns	`0.74`
`ViT small (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :after_enzyme)`	`1300005635` ns	`1257317145` ns	`1.03`
`ViT small (256 x 256 x 3 x 4)/forward/CUDA/Reactant`	`1278041149` ns	`1424374803` ns	`0.90`
`ViT small (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :before_enzyme)`	`1261990698` ns	`1350049098` ns	`0.93`
`ViT small (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :only_enzyme)`	`3125146586` ns	`3052800629` ns	`1.02`
`ViT small (256 x 256 x 3 x 4)/forward/CUDA/Lux`	`8879631` ns	`8862682` ns	`1.00`
`ViT small (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :after_enzyme)`	`1550527051` ns	`1572590140` ns	`0.99`
`ViT small (256 x 256 x 3 x 4)/forward/CPU/Reactant`	`1552400963` ns	`1559474266` ns	`1.00`
`ViT small (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :before_enzyme)`	`1552125020` ns	`1557501067` ns	`1.00`
`ViT small (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :only_enzyme)`	`3310850083` ns	`3290628669` ns	`1.01`
`ViT small (256 x 256 x 3 x 4)/forward/CPU/Lux`	`2775956032` ns	`2876354148` ns	`0.97`
`ViT tiny (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :after_enzyme)`	`1303015586` ns	`1231219515` ns	`1.06`
`ViT tiny (256 x 256 x 3 x 32)/forward/CUDA/Reactant`	`1272928755` ns	`1441159242` ns	`0.88`
`ViT tiny (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :before_enzyme)`	`1311413197` ns	`1282010253` ns	`1.02`
`ViT tiny (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :only_enzyme)`	`3028555629` ns	`3051584957` ns	`0.99`
`ViT tiny (256 x 256 x 3 x 32)/forward/CUDA/Lux`	`22655396` ns	`22776746` ns	`0.99`
`ViT tiny (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :after_enzyme)`	`2140398211` ns	`2154505585` ns	`0.99`
`ViT tiny (256 x 256 x 3 x 32)/forward/CPU/Reactant`	`2200393344` ns	`2139776302` ns	`1.03`
`ViT tiny (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :before_enzyme)`	`2142222871` ns	`2123332313` ns	`1.01`
`ViT tiny (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :only_enzyme)`	`3897215106` ns	`3879039560` ns	`1.00`
`ViT tiny (256 x 256 x 3 x 32)/forward/CPU/Lux`	`5312568392` ns	`5729200009` ns	`0.93`
`ViT tiny (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :after_enzyme)`	`1307990936` ns	`1259798635` ns	`1.04`
`ViT tiny (256 x 256 x 3 x 4)/forward/CUDA/Reactant`	`1301819826` ns	`1262851193` ns	`1.03`
`ViT tiny (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :before_enzyme)`	`1284427966` ns	`1266665882` ns	`1.01`
`ViT tiny (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :only_enzyme)`	`3169837598` ns	`3319553871` ns	`0.95`
`ViT tiny (256 x 256 x 3 x 4)/forward/CUDA/Lux`	`7453064` ns	`7445203.5` ns	`1.00`
`ViT tiny (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :after_enzyme)`	`1409136279` ns	`1424258021` ns	`0.99`
`ViT tiny (256 x 256 x 3 x 4)/forward/CPU/Reactant`	`1409545691` ns	`1421721118` ns	`0.99`
`ViT tiny (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :before_enzyme)`	`1414236404` ns	`1420742881` ns	`1.00`
`ViT tiny (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :only_enzyme)`	`3151606700` ns	`3162578762` ns	`1.00`
`ViT tiny (256 x 256 x 3 x 4)/forward/CPU/Lux`	`1654006772.5` ns	`2138106366` ns	`0.77`
`ViT tiny (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :after_enzyme)`	`1291669432` ns	`1297050944` ns	`1.00`
`ViT tiny (256 x 256 x 3 x 16)/forward/CUDA/Reactant`	`1265833403` ns	`1403907055` ns	`0.90`
`ViT tiny (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :before_enzyme)`	`1278433111` ns	`1269229731` ns	`1.01`
`ViT tiny (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :only_enzyme)`	`3126809956` ns	`3063143344` ns	`1.02`
`ViT tiny (256 x 256 x 3 x 16)/forward/CUDA/Lux`	`12328188` ns	`12347497` ns	`1.00`
`ViT tiny (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :after_enzyme)`	`1741906628` ns	`1721006513` ns	`1.01`
`ViT tiny (256 x 256 x 3 x 16)/forward/CPU/Reactant`	`1731592537` ns	`1711405549` ns	`1.01`
`ViT tiny (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :before_enzyme)`	`1720273302` ns	`1704835369` ns	`1.01`
`ViT tiny (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :only_enzyme)`	`3450588571` ns	`3443971150` ns	`1.00`
`ViT tiny (256 x 256 x 3 x 16)/forward/CPU/Lux`	`2948602836` ns	`3110298785` ns	`0.95`
`ViT small (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :after_enzyme)`	`1494612899` ns	`1266729302` ns	`1.18`
`ViT small (256 x 256 x 3 x 16)/forward/CUDA/Reactant`	`1311317968` ns	`1308873395` ns	`1.00`
`ViT small (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :before_enzyme)`	`1492915221` ns	`1275958493` ns	`1.17`
`ViT small (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :only_enzyme)`	`3115105513` ns	`3081413477` ns	`1.01`
`ViT small (256 x 256 x 3 x 16)/forward/CUDA/Lux`	`27412509` ns	`27435162` ns	`1.00`
`ViT small (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :after_enzyme)`	`2228730818` ns	`2169947879` ns	`1.03`
`ViT small (256 x 256 x 3 x 16)/forward/CPU/Reactant`	`2334825207` ns	`2163945294` ns	`1.08`
`ViT small (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :before_enzyme)`	`2310305349` ns	`2151891950` ns	`1.07`
`ViT small (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :only_enzyme)`	`3944966197` ns	`3946269320` ns	`1.00`
`ViT small (256 x 256 x 3 x 16)/forward/CPU/Lux`	`6131212634` ns	`6287057122` ns	`0.98`
`ViT small (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :after_enzyme)`	`1303567764` ns	`1260705673` ns	`1.03`
`ViT small (256 x 256 x 3 x 32)/forward/CUDA/Reactant`	`1424871003` ns	`1369717954` ns	`1.04`
`ViT small (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :before_enzyme)`	`1275689864` ns	`1281076652` ns	`1.00`
`ViT small (256 x 256 x 3 x 32)/forward/CUDA/Reactant (optimize = :only_enzyme)`	`3045934410` ns	`3130042297` ns	`0.97`
`ViT small (256 x 256 x 3 x 32)/forward/CUDA/Lux`	`52971586` ns	`53036705.5` ns	`1.00`
`ViT small (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :after_enzyme)`	`3055665974` ns	`3050356994` ns	`1.00`
`ViT small (256 x 256 x 3 x 32)/forward/CPU/Reactant`	`3021313773` ns	`3082997102` ns	`0.98`
`ViT small (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :before_enzyme)`	`3053225043` ns	`2965563203` ns	`1.03`
`ViT small (256 x 256 x 3 x 32)/forward/CPU/Reactant (optimize = :only_enzyme)`	`4887749197` ns	`4841087626` ns	`1.01`
`ViT small (256 x 256 x 3 x 32)/forward/CPU/Lux`	`11183611226` ns	`8484129480` ns	`1.32`
`ViT base (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :after_enzyme)`	`1300865042` ns	`1260921375` ns	`1.03`
`ViT base (256 x 256 x 3 x 16)/forward/CUDA/Reactant`	`1295735580` ns	`1253872568` ns	`1.03`
`ViT base (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :before_enzyme)`	`1232925244` ns	`1479498539` ns	`0.83`
`ViT base (256 x 256 x 3 x 16)/forward/CUDA/Reactant (optimize = :only_enzyme)`	`2922815725` ns	`3113671601` ns	`0.94`
`ViT base (256 x 256 x 3 x 16)/forward/CUDA/Lux`	`71283297` ns	`71338519.5` ns	`1.00`
`ViT base (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :after_enzyme)`	`3270546818` ns	`3125511597` ns	`1.05`
`ViT base (256 x 256 x 3 x 16)/forward/CPU/Reactant`	`3230464036` ns	`3098530069` ns	`1.04`
`ViT base (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :before_enzyme)`	`3254041312` ns	`3115589553` ns	`1.04`
`ViT base (256 x 256 x 3 x 16)/forward/CPU/Reactant (optimize = :only_enzyme)`	`5162220727` ns	`5036626230` ns	`1.02`
`ViT base (256 x 256 x 3 x 16)/forward/CPU/Lux`	`15170850681` ns	`11289651474` ns	`1.34`
`ViT base (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :after_enzyme)`	`1278655290` ns	`1339569725` ns	`0.95`
`ViT base (256 x 256 x 3 x 4)/forward/CUDA/Reactant`	`1229847740` ns	`1259019883` ns	`0.98`
`ViT base (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :before_enzyme)`	`1439473418` ns	`1254828379` ns	`1.15`
`ViT base (256 x 256 x 3 x 4)/forward/CUDA/Reactant (optimize = :only_enzyme)`	`2922143773` ns	`2975337456` ns	`0.98`
`ViT base (256 x 256 x 3 x 4)/forward/CUDA/Lux`	`20699816` ns	`20758936` ns	`1.00`
`ViT base (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :after_enzyme)`	`1963950807` ns	`1859519475` ns	`1.06`
`ViT base (256 x 256 x 3 x 4)/forward/CPU/Reactant`	`2218798778` ns	`1869845638` ns	`1.19`
`ViT base (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :before_enzyme)`	`2058391749` ns	`1850101657` ns	`1.11`
`ViT base (256 x 256 x 3 x 4)/forward/CPU/Reactant (optimize = :only_enzyme)`	`3614980515` ns	`3593739548` ns	`1.01`
`ViT base (256 x 256 x 3 x 4)/forward/CPU/Lux`	`3206903233.5` ns	`3325189113.5` ns	`0.96`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

JuliaFormatter

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 371 in 7c7c8ed

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 376 in 7c7c8ed

LLVMFunc{F,job.source.specTypes}(f, modstr, image, LLVM.name(entry))

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Lines 386 to 389 in 7c7c8ed

    
           f::F 
        
           mod::String 
        
           image 
        
           entry::String

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Lines 392 to 393 in 7c7c8ed

    
           function (func::LLVMFunc{F,tt})(args...; blocks::CUDA.CuDim=1, threads::CUDA.CuDim=1, 
        
           					 shmem::Integer=0) where{F, tt}

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Lines 403 to 418 in 7c7c8ed

    
           @show a 
        
           @assert a isa CuDeviceArray 
        
           ta = Base.pointer_to_objref(a.ptr)::TracedRArray 
        
           arg = ta.mlir_data 
        
           arg = Reactant.Compiler.transpose_val(arg) 
        
           push!(restys, MLIR.IR.Type(arg)) 
        
           push!(aliases, 
        
                 MLIR.IR.Dialects.stablehlo.stablehloOutputOperandAliasGet( 
        
           	MLIR.IR.context(), 
        
                   len(args) == 1 ? 0 : 1, 
        
           	len(args) == 1 ? C_NULL : Ref{Int64}(i-1), 
        
           	i-1, 
        
           	0, 
        
           	C_NULL 
        
           ) 
        
                 )

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Lines 421 to 422 in 7c7c8ed

    
           output_operand_aliases=MLIR.ArrayAttr.get(MLIR.IR.context(), aliases) 
        
           MLIR.IR.Dialects.stablehlo.custom_call(mlir_args; result_0=restys, call_target_name="reactant_gpu_call", output_operand_aliases)

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 430 in 7c7c8ed

const _compiler_caches = Dict{MLIR.IR.Context, Dict{Any, LLVMFunc}}();

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 434 in 7c7c8ed

cache = Dict{Any, LLVMFunc}()

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 443 in 7c7c8ed

cache = compiler_cache(MLIR.IR.context())

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 446 in 7c7c8ed

cuda = CUDA.active_state()

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 450 in 7c7c8ed

res

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 470 in 7c7c8ed

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Lines 480 to 489 in 7c7c8ed

    
           arginfo2 = ArgInfo( 
        
               if fargs isa Nothing 
        
           	nothing 
        
               else 
        
           	[:($(recufunction)), fargs[2:end]...] 
        
               end, 
        
               [Core.Const(recufunction), argtypes[2:end]...], 
        
           ) 
        
           return abstract_call_known(interp, recufunction, arginfo2, si, sv, max_methods)

github-actions · 2024-12-08T03:54:39Z

ext/ReactantCUDAExt.jl

+    Base.unsafe_convert(Core.LLVMPtr{T,A}, x) + Base._memory_offset(x, i)
+end
+
+


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

github-actions · 2024-12-08T03:54:40Z

ext/ReactantCUDAExt.jl

+    unsafe_cached_load(pointer(A), index)
+end
+
+


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

github-actions · 2024-12-08T03:54:41Z

ext/ReactantCUDAExt.jl

+# deprecated
+Base.@propagate_inbounds ldg(A::CuTracedArray, i1::Integer) = const_arrayref(A, i1)
+
+


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

github-actions · 2024-12-08T03:54:41Z

ext/ReactantCUDAExt.jl

+  return CuTracedArray{T,N,A}(reinterpret(Core.LLVMPtr{T,A}, a.ptr), osize, a.maxsize)
+end
+
+


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

github-actions · 2024-12-08T03:54:41Z

ext/ReactantCUDAExt.jl

+
+

[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

github-actions

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

JuliaFormatter

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/src/utils.jl

Line 247 in 3b78329

src = Core.Compiler.ir_to_codeinf!(opt)

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/src/utils.jl

Line 260 in 3b78329

$(Expr(:meta, :generated, call_with_reactant_generator))

wsmoses · 2024-12-09T16:48:03Z

@avik-pal @vchuravy @Pangoraw @jumerckx @mofeing obviously this is presently a reactant/enzyme/cassette frankenstein atm in utils.jl but cc'ing for visibility, and if you see an obvious reason I get the following error:

args = (typeof(typeof), ReactantCUDAExt.CuTracedArray{Int64, 1, 1, (64,)})
Square Kernel: Error During Test at /home/wsmoses/Reactant.jl/test/cuda.jl:20
  Got exception outside of a @test
  AssertionError: Unsupported call_with_reactant of builtin (typeof, 64 cu traced array at 0x00007fa1d6a46680)
  Stacktrace:
    [1] macro expansion
      @ ~/Reactant.jl/src/utils.jl:0 [inlined]
    [2] call_with_reactant(::typeof(typeof), ::ReactantCUDAExt.CuTracedArray{Int64, 1, 1, (64,)})
      @ Reactant ~/Reactant.jl/src/utils.jl:456
    [3] Typeof
      @ ./boot.jl:245 [inlined]
    [4] map
      @ ./tuple.jl:291 [inlined]
    [5] macro expansion
      @ ~/.julia/packages/CUDA/2kjXI/src/compiler/execution.jl:111 [inlined]
    [6] square!
      @ ~/Reactant.jl/test/cuda.jl:15 [inlined]
    [7] call_with_reactant(::typeof(square!), ::Reactant.TracedRArray{Int64, 1})
      @ Reactant ~/Reactant.jl/src/utils.jl:0
    [8] (::Reactant.var"#34#44"{Bool, Bool, typeof(square!), Tuple{ConcreteRArray{Int64, 1}}, Vector{Union{ReactantCore.MissingTracedValue, Reactant.TracedRArray, Reactant.TracedRNumber}}, Tuple{Reactant.TracedRArray{Int64, 1}}})()
      @ Reactant ~/Reactant.jl/src/utils.jl:559
    [9] block!(f::Reactant.var"#34#44"{Bool, Bool, typeof(square!), Tuple{ConcreteRArray{Int64, 1}}, Vector{Union{ReactantCore.MissingTracedValue, Reactant.TracedRArray, Reactant.TracedRNumber}}, Tuple{Reactant.TracedRArray{Int64, 1}}}, blk::Reactant.MLIR.IR.Block)
      @ Reactant.MLIR.IR ~/Reactant.jl/src/mlir/IR/Block.jl:201
   [10] make_mlir_fn(f::Function, args::Tuple{ConcreteRArray{Int64, 1}}, kwargs::Tuple{}, name::String, concretein::Bool; toscalar::Bool, return_dialect::Symbol, no_args_in_result::Bool, construct_function_without_args::Bool, do_transpose::Bool)
      @ Reactant ~/Reactant.jl/src/utils.jl:541
   [11] make_mlir_fn
      @ ~/Reactant.jl/src/utils.jl:461 [inlined]
   [12] #10
      @ ~/Reactant.jl/src/Compiler.jl:295 [inlined]
   [13] block!(f::Reactant.Compiler.var"#10#15"{typeof(square!), Tuple{ConcreteRArray{Int64, 1}}}, blk::Reactant.MLIR.IR.Block)
      @ Reactant.MLIR.IR ~/Reactant.jl/src/mlir/IR/Block.jl:201
   [14] #9
      @ ~/Reactant.jl/src/Compiler.jl:294 [inlined]
   [15] mmodule!(f::Reactant.Compiler.var"#9#14"{Reactant.MLIR.IR.Module, typeof(square!), Tuple{ConcreteRArray{Int64, 1}}}, blk::Reactant.MLIR.IR.Module)
      @ Reactant.MLIR.IR ~/Reactant.jl/src/mlir/IR/Module.jl:92
   [16] compile_mlir!(mod::Reactant.MLIR.IR.Module, f::Function, args::Tuple{ConcreteRArray{Int64, 1}}; optimize::Bool)
      @ Reactant.Compiler ~/Reactant.jl/src/Compiler.jl:291
   [17] compile_mlir!
      @ ~/Reactant.jl/src/Compiler.jl:290 [inlined]
   [18] (::Reactant.Compiler.var"#34#36"{Bool, typeof(square!), Tuple{ConcreteRArray{Int64, 1}}})()
      @ Reactant.Compiler ~/Reactant.jl/src/Compiler.jl:698
   [19] context!(f::Reactant.Compiler.var"#34#36"{Bool, typeof(square!), Tuple{ConcreteRArray{Int64, 1}}}, ctx::Reactant.MLIR.IR.Context)
      @ Reactant.MLIR.IR ~/Reactant.jl/src/mlir/IR/Context.jl:76
   [20] compile_xla(f::Function, args::Tuple{ConcreteRArray{Int64, 1}}; client::Nothing, optimize::Bool)
      @ Reactant.Compiler ~/Reactant.jl/src/Compiler.jl:695
   [21] compile_xla
      @ ~/Reactant.jl/src/Compiler.jl:690 [inlined]
   [22] compile(f::Function, args::Tuple{ConcreteRArray{Int64, 1}}; client::Nothing, optimize::Bool, sync::Bool)
      @ Reactant.Compiler ~/Reactant.jl/src/Compiler.jl:722
   [23] macro expansion
      @ ~/Reactant.jl/src/Compiler.jl:475 [inlined]
   [24] macro expansion
      @ ~/Reactant.jl/test/cuda.jl:23 [inlined]
   [25] macro expansion
      @ ~/.julia/juliaup/julia-1.10.5+0.x64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [26] top-level scope
      @ ~/Reactant.jl/test/cuda.jl:21
   [27] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [28] top-level scope
      @ ~/Reactant.jl/test/runtests.jl:44
   [29] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [30] top-level scope
      @ none:6
   [31] eval
      @ ./boot.jl:385 [inlined]
   [32] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:291
   [33] _start()
      @ Base ./client.jl:552
Test Summary: | Error  Total     Time
Square Kernel |     1      1  1m54.4s

I realized while doing this that a pre req we need is supporting custom absint through indirect function calls so most of utils.jl tries to do this (and incidentally in doing so starts to set up the infra we will need later on for automated detection of if/for/etc into traced versions .

Of course atm everything breaks, but anyways

wsmoses · 2024-12-10T05:30:21Z

@aviatesk obviously this is still very in progress, but I'm wondering if you can give it a look?

Essentially the absint part is trying to make a code call that forces the use of our abstract interpreter, for all type unstable calls (e.g. unlike cassette we run inference and optimizations first). Eventually we also will do more with the IR as well like raising control flow into nice structures, but thats a future todo

wsmoses · 2024-12-10T07:22:47Z

I need to finish up registration, but kernel compile and execute (phase 1) now works!

#= /home/wsmoses/Reactant.jl/test/cuda.jl:22 =# @code_hlo(square!(A)) = module {
  func.func @main(%arg0: tensor<64xi64>) -> tensor<64xi64> {
    %0 = stablehlo.custom_call @reactant_gpu_call(%arg0) {output_operand_aliases = [#stablehlo.output_operand_alias<output_tuple_indices = [], operand_index = 0, operand_tuple_indices = []>]} : (tensor<64xi64>) -> tensor<64xi64>
    return %0 : tensor<64xi64>
  }
}
(res, xs) = (64 cu traced array at 0x00007f57d89256f0, TracedRArray{Int64,1N}(((:args, 1),), size=(64,)))
("recufunction", f, tt) = ("recufunction", square_kernel!, Tuple{ReactantCUDAExt.CuTracedArray{Int64, 1, 1, (64,)}})
args = (64 cu traced array at 0x00007f57d89256f0,)
call_kwargs = Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}()
a = 64 cu traced array at 0x00007f57d89256f0
arg = %1 = "stablehlo.transpose"(%0) <{permutation = array<i64: 0>}> : (tensor<64xi64>) -> tensor<64xi64>
terminate called after throwing an instance of 'xla::XlaRuntimeError'
  what():  UNIMPLEMENTED: No registered implementation for custom call to reactant_gpu_call for platform CUDA

wsmoses · 2024-12-10T07:53:02Z

also cc for vis @ivanradanov @ftynse we now [almost] have a single mlir file with host and device code for an entire julia application

ftynse · 2024-12-10T17:04:25Z

ext/ReactantCUDAExt.jl

+    end
+
+    output_operand_aliases=MLIR.IR.Attribute(aliases)
+    call = MLIR.Dialects.stablehlo.custom_call(mlir_args; result_0=restys, call_target_name="reactant_gpu_call", output_operand_aliases)


FYI, to pass some information to the XLA callback of this kernel, use the backend_config (https://github.com/openxla/stablehlo/blob/ef176a130f28196dcb4a5735d0f2f6ed0f85bd5d/stablehlo/dialect/StablehloOps.td#L2391) string attribute. That attribute gets forwarded to the "opaque" pair of arguments (pointer + size) as a C string.

github-actions · 2024-12-15T04:36:34Z

test/cuda.jl

+@testset "Square Kernel" begin
+    oA = collect(1:1:64)
+    A = Reactant.to_rarray(oA)
+    @show @code_hlo optimize=false square!(A)


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

@show @code_hlo optimize=false square!(A)

@show @code_hlo optimize = false square!(A)

github-actions

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

JuliaFormatter

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 243 in 0c61f5d

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 256 in 0c61f5d

    
           mod, meta = GPUCompiler.compile(:llvm, job; optimize=false, cleanup=false, validate=false)

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 258 in 0c61f5d

opt_level = 2

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 260 in 0c61f5d

LLVM.@dispose pb=LLVM.NewPMPassBuilder() begin

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 292 in 0c61f5d

modstr = string(mod)

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Lines 294 to 295 in 0c61f5d

    
           # This is a bit weird since we're taking a module from julia's llvm into reactant's llvm version 
        
           # it is probably safer to reparse a string using the right llvm module api, so we will do that.

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Lines 298 to 450 in 0c61f5d

    
           	    mmod = MLIR.IR.Module(@ccall MLIR.API.mlir_c.ConvertLLVMStrToMLIR(modstr::Cstring, MLIR.IR.context()::MLIR.API.MlirContext)::MLIR.API.MlirModule) 
        
           	    @show mmod 
        
           	    # check if we'll need the device runtime 
        
           	    undefined_fs = filter(collect(CUDA.LLVM.functions(meta.ir))) do f 
        
           		CUDA.LLVM.isdeclaration(f) && !CUDA.LLVM.isintrinsic(f) 
        
           	    end 
        
           	    intrinsic_fns = ["vprintf", "malloc", "free", "__assertfail", 
        
           			     "__nvvm_reflect" #= TODO: should have been optimized away =#] 
        
           	    needs_cudadevrt = !isempty(setdiff(CUDA.LLVM.name.(undefined_fs), intrinsic_fns)) 
        
           	    # prepare invocations of CUDA compiler tools 
        
           	    ptxas_opts = String[] 
        
           	    nvlink_opts = String[] 
        
           	    ## debug flags 
        
           	    if Base.JLOptions().debug_level == 1 
        
           		push!(ptxas_opts, "--generate-line-info") 
        
           	    elseif Base.JLOptions().debug_level >= 2 
        
           		push!(ptxas_opts, "--device-debug") 
        
           		push!(nvlink_opts, "--debug") 
        
           	    end 
        
           	    ## relocatable device code 
        
           	    if needs_cudadevrt 
        
           		push!(ptxas_opts, "--compile-only") 
        
           	    end 
        
           	    ptx = job.config.params.ptx 
        
           	    cap = job.config.params.cap 
        
           	    arch = "sm_$(cap.major)$(cap.minor)" 
        
           	    # validate use of parameter memory 
        
           	    argtypes = filter([CUDA.KernelState, job.source.specTypes.parameters...]) do dt 
        
           		!CUDA.isghosttype(dt) && !Core.Compiler.isconstType(dt) 
        
           	    end 
        
           	    param_usage = sum(sizeof, argtypes) 
        
           	    param_limit = 4096 
        
           	    if cap >= v"7.0" && ptx >= v"8.1" 
        
           		param_limit = 32764 
        
           	    end 
        
           	    if param_usage > param_limit 
        
           		msg = """Kernel invocation uses too much parameter memory. 
        
           			 $(Base.format_bytes(param_usage)) exceeds the $(Base.format_bytes(param_limit)) limit imposed by sm_$(cap.major)$(cap.minor) / PTX v$(ptx.major).$(ptx.minor).""" 
        
           		try 
        
           		    details = "\n\nRelevant parameters:" 
        
           		    source_types = job.source.specTypes.parameters 
        
           		    source_argnames = Base.method_argnames(job.source.def) 
        
           		    while length(source_argnames) < length(source_types) 
        
           			# this is probably due to a trailing vararg; repeat its name 
        
           			push!(source_argnames, source_argnames[end]) 
        
           		    end 
        
           		    for (i, typ) in enumerate(source_types) 
        
           			if CUDA.isghosttype(typ) || Core.Compiler.isconstType(typ) 
        
           			    continue 
        
           			end 
        
           			name = source_argnames[i] 
        
           			details *= "\n  [$(i-1)] $name::$typ uses $(Base.format_bytes(sizeof(typ)))" 
        
           		    end 
        
           		    details *= "\n" 
        
           		    if cap >= v"7.0" && ptx < v"8.1" && param_usage < 32764 
        
           			details *= "\nNote: use a newer CUDA to support more parameters on your device.\n" 
        
           		    end 
        
           		    msg *= details 
        
           		catch err 
        
           		    @error "Failed to analyze kernel parameter usage; please file an issue with a reproducer." 
        
           		end 
        
           		error(msg) 
        
           	    end 
        
           	    # compile to machine code 
        
           	    # NOTE: we use tempname since mktemp doesn't support suffixes, and mktempdir is slow 
        
           	    ptx_input = tempname(cleanup=false) * ".ptx" 
        
           	    ptxas_output = tempname(cleanup=false) * ".cubin" 
        
           	    write(ptx_input, asm) 
        
           	    # we could use the driver's embedded JIT compiler, but that has several disadvantages: 
        
           	    # 1. fixes and improvements are slower to arrive, by using `ptxas` we only need to 
        
           	    #    upgrade the toolkit to get a newer compiler; 
        
           	    # 2. version checking is simpler, we otherwise need to use NVML to query the driver 
        
           	    #    version, which is hard to correlate to PTX JIT improvements; 
        
           	    # 3. if we want to be able to use newer (minor upgrades) of the CUDA toolkit on an 
        
           	    #    older driver, we should use the newer compiler to ensure compatibility. 
        
           	    append!(ptxas_opts, [ 
        
           		"--verbose", 
        
           		"--gpu-name", arch, 
        
           		"--output-file", ptxas_output, 
        
           		ptx_input 
        
           	    ]) 
        
           	    proc, log = CUDA.run_and_collect(`$(CUDA.ptxas()) $ptxas_opts`) 
        
           	    log = strip(log) 
        
           	    if !success(proc) 
        
           		reason = proc.termsignal > 0 ? "ptxas received signal $(proc.termsignal)" : 
        
           					       "ptxas exited with code $(proc.exitcode)" 
        
           		msg = "Failed to compile PTX code ($reason)" 
        
           		msg *= "\nInvocation arguments: $(join(ptxas_opts, ' '))" 
        
           		if !isempty(log) 
        
           		    msg *= "\n" * log 
        
           		end 
        
           		msg *= "\nIf you think this is a bug, please file an issue and attach $(ptx_input)" 
        
           		if parse(Bool, get(ENV, "BUILDKITE", "false")) 
        
           		    run(`buildkite-agent artifact upload $(ptx_input)`) 
        
           		end 
        
           		error(msg) 
        
           	    elseif !isempty(log) 
        
           		@debug "PTX compiler log:\n" * log 
        
           	    end 
        
           	    rm(ptx_input) 
        
           	    # link device libraries, if necessary 
        
           	    # 
        
           	    # this requires relocatable device code, which prevents certain optimizations and 
        
           	    # hurts performance. as such, we only do so when absolutely necessary. 
        
           	    # TODO: try LTO, `--link-time-opt --nvvmpath /opt/cuda/nvvm`. 
        
           	    #       fails with `Ignoring -lto option because no LTO objects found` 
        
           	    if needs_cudadevrt 
        
           		nvlink_output = tempname(cleanup=false) * ".cubin" 
        
           		append!(nvlink_opts, [ 
        
           		    "--verbose", "--extra-warnings", 
        
           		    "--arch", arch, 
        
           		    "--library-path", dirname(libcudadevrt), 
        
           		    "--library", "cudadevrt", 
        
           		    "--output-file", nvlink_output, 
        
           		    ptxas_output 
        
           		]) 
        
           		proc, log = run_and_collect(`$(CUDA.nvlink()) $nvlink_opts`) 
        
           		log = strip(log) 
        
           		if !success(proc) 
        
           		    reason = proc.termsignal > 0 ? "nvlink received signal $(proc.termsignal)" : 
        
           						   "nvlink exited with code $(proc.exitcode)" 
        
           		    msg = "Failed to link PTX code ($reason)" 
        
           		    msg *= "\nInvocation arguments: $(join(nvlink_opts, ' '))" 
        
           		    if !isempty(log) 
        
           			msg *= "\n" * log 
        
           		    end 
        
           		    msg *= "\nIf you think this is a bug, please file an issue and attach $(ptxas_output)" 
        
           		    error(msg) 
        
           		elseif !isempty(log) 
        
           		    @debug "PTX linker info log:\n" * log 
        
           		end 
        
           		rm(ptxas_output) 
        
           		image = read(nvlink_output) 
        
           		rm(nvlink_output) 
        
           	    else 
        
           		image = read(ptxas_output) 
        
           		rm(ptxas_output) 
        
           	    end 
        
           	    modstr, image, meta.entry

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 452 in 0c61f5d

    
           LLVMFunc{job.source.specTypes.parameters[1],job.source.specTypes}(nothing, modstr, image, CUDA.LLVM.name(entry))

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Lines 468 to 469 in 0c61f5d

    
           Reactant.@reactant_override @noinline function (func::LLVMFunc{F,tt})(args...; convert=Val(false), blocks::CuDim=1, threads::CuDim=1, 
        
                           cooperative::Bool=false, shmem::Integer=0, call_kwargs...) where{F, tt}

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Lines 481 to 499 in 0c61f5d

    
           @show a 
        
           @assert a isa CuTracedArray 
        
           ta = Base.unsafe_pointer_to_objref(Base.reinterpret(Ptr{Cvoid}, a.ptr))::TracedRArray 
        
           push!(rarrays, ta) 
        
           arg = ta.mlir_data 
        
           arg = transpose_val(arg) 
        
           @show arg 
        
           push!(restys, MLIR.IR.type(arg)) 
        
           push!(mlir_args, arg) 
        
           push!(aliases, 
        
                 MLIR.IR.Attribute(MLIR.API.stablehloOutputOperandAliasGet( 
        
           	MLIR.IR.context(), 
        
                   length(args) == 1 ? 0 : 1, 
        
           	length(args) == 1 ? C_NULL : Ref{Int64}(i-1), 
        
           	i-1, 
        
           	0, 
        
           	C_NULL 
        
           	)) 
        
                 )

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Lines 502 to 503 in 0c61f5d

    
           output_operand_aliases=MLIR.IR.Attribute(aliases) 
        
           call = MLIR.Dialects.stablehlo.custom_call(mlir_args; result_0=restys, call_target_name="reactant_gpu_call", output_operand_aliases, backend_config=MLIR.IR.Attribute("configstr"))

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 506 in 0c61f5d

res.mlir_data = transpose_val(MLIR.IR.result(call, i))

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 515 in 0c61f5d

const _compiler_caches = Dict{MLIR.IR.Context, Dict{Any, LLVMFunc}}();

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 519 in 0c61f5d

cache = Dict{Any, LLVMFunc}()

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 525 in 0c61f5d

    
           Reactant.@reactant_override @noinline function CUDA.cufunction(f::F, tt::TT=Tuple{}; kwargs...) where {F,TT}

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 529 in 0c61f5d

cache = compiler_cache(MLIR.IR.context())

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 532 in 0c61f5d

# cuda = CUDA.active_state()

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Lines 535 to 543 in 0c61f5d

    
           cuda_cap=v"5.0" 
        
           cuda_ptx=v"6.3" 
        
           llvm_cap=v"5.0" 
        
           llvm_ptx=v"6.3" 
        
           kernel=true 
        
           always_inline=false 
        
           name=nothing 
        
           debuginfo=false 
        
           config = CUDA.CompilerConfig(CUDA.PTXCompilerTarget(; cap=llvm_cap, ptx=llvm_ptx, debuginfo), CUDA.CUDACompilerParams(; cap=cuda_cap, ptx=cuda_ptx); kernel, name, always_inline)

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 547 in 0c61f5d

res

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 551 in 0c61f5d

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/src/utils.jl

Lines 370 to 381 in 0c61f5d

    
           for (i, inst) in enumerate(ir.stmts) 
        
           	@static if VERSION < v"1.11" 
        
           	    changed, next = rewrite_inst(inst[:inst], ir, interp) 
        
           	    Core.Compiler.setindex!(ir.stmts[i], next, :inst) 
        
           	else 
        
           	    changed, next = rewrite_inst(inst[:stmt], ir, interp) 
        
           	    Core.Compiler.setindex!(ir.stmts[i], next, :stmt) 
        
           	end 
        
           	if changed 
        
           	    any_changed = true 
        
           	    Core.Compiler.setindex!(ir.stmts[i], Any, :type) 
        
           	end

github-actions · 2024-12-15T05:32:06Z

ext/ReactantCUDAExt.jl

+using Reactant:
+    Reactant, TracedRArray, AnyTracedRArray, MLIR, TracedRNumber


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

using Reactant:

Reactant, TracedRArray, AnyTracedRArray, MLIR, TracedRNumber

using Reactant: Reactant, TracedRArray, AnyTracedRArray, MLIR, TracedRNumber

github-actions · 2024-12-15T05:32:06Z

ext/ReactantCUDAExt.jl

+
+Base.show(io::IO, a::AT) where AT <: CuTracedArray =
+   CUDA.Printf.@printf(io, "%s cu traced array at %p", join(size(a), '×'), Int(pointer(a)))


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

Base.show(io::IO, a::AT) where AT <: CuTracedArray =

CUDA.Printf.@printf(io, "%s cu traced array at %p", join(size(a), '×'), Int(pointer(a)))

function Base.show(io::IO, a::AT) where {AT<:CuTracedArray}

CUDA.Printf.@printf(io, "%s cu traced array at %p", join(size(a), '×'), Int(pointer(a)))

end

github-actions · 2024-12-15T05:32:06Z

ext/ReactantCUDAExt.jl

+Base.elsize(::Type{<:CuTracedArray{T}}) where {T} = sizeof(T)
+Base.size(g::CuTracedArray{T,N,A,Size}) where {T,N,A,Size} = Size
+Base.sizeof(x::CuTracedArray) = Base.elsize(x) * length(x)
+Base.pointer(x::CuTracedArray{T,<:Any,A}) where {T,A} = Base.unsafe_convert(Core.LLVMPtr{T,A}, x)


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

Base.pointer(x::CuTracedArray{T,<:Any,A}) where {T,A} = Base.unsafe_convert(Core.LLVMPtr{T,A}, x)

function Base.pointer(x::CuTracedArray{T,<:Any,A}) where {T,A}

return Base.unsafe_convert(Core.LLVMPtr{T,A}, x)

end

github-actions · 2024-12-15T05:32:06Z

ext/ReactantCUDAExt.jl

+Base.sizeof(x::CuTracedArray) = Base.elsize(x) * length(x)
+Base.pointer(x::CuTracedArray{T,<:Any,A}) where {T,A} = Base.unsafe_convert(Core.LLVMPtr{T,A}, x)
+@inline function Base.pointer(x::CuTracedArray{T,<:Any,A}, i::Integer) where {T,A}
+    Base.unsafe_convert(Core.LLVMPtr{T,A}, x) + Base._memory_offset(x, i)


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

Base.unsafe_convert(Core.LLVMPtr{T,A}, x) + Base._memory_offset(x, i)

return Base.unsafe_convert(Core.LLVMPtr{T,A}, x) + Base._memory_offset(x, i)

github-actions · 2024-12-15T05:32:06Z

ext/ReactantCUDAExt.jl

+Base.unsafe_convert(::Type{Core.LLVMPtr{T,A}}, x::CuTracedArray{T,<:Any,A}) where {T,A} =
+  x.ptr
+


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

Base.unsafe_convert(::Type{Core.LLVMPtr{T,A}}, x::CuTracedArray{T,<:Any,A}) where {T,A} =

x.ptr

function Base.unsafe_convert(

::Type{Core.LLVMPtr{T,A}}, x::CuTracedArray{T,<:Any,A}

) where {T,A}

return x.ptr

end

github-actions · 2024-12-15T05:32:07Z

ext/ReactantCUDAExt.jl

+Base.@propagate_inbounds Base.getindex(A::CuTracedArray,
+                                       I::Union{Integer, CartesianIndex}...) =
+    A[Base._to_linear_index(A, to_indices(A, I)...)]
+Base.@propagate_inbounds Base.setindex!(A::CuTracedArray, x,
+                                        I::Union{Integer, CartesianIndex}...) =
+    A[Base._to_linear_index(A, to_indices(A, I)...)] = x
+


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

Base.@propagate_inbounds Base.getindex(A::CuTracedArray,

I::Union{Integer, CartesianIndex}...) =

A[Base._to_linear_index(A, to_indices(A, I)...)]

Base.@propagate_inbounds Base.setindex!(A::CuTracedArray, x,

I::Union{Integer, CartesianIndex}...) =

A[Base._to_linear_index(A, to_indices(A, I)...)] = x

Base.@propagate_inbounds Base.getindex(

A::CuTracedArray, I::Union{Integer,CartesianIndex}...

) = A[Base._to_linear_index(A, to_indices(A, I)...)]

Base.@propagate_inbounds Base.setindex!(

A::CuTracedArray, x, I::Union{Integer,CartesianIndex}...

) = A[Base._to_linear_index(A, to_indices(A, I)...)] = x

github-actions · 2024-12-15T05:32:07Z

ext/ReactantCUDAExt.jl

+  err = GPUArrays._reinterpret_exception(T, a)
+  err === nothing || throw(err)


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

err = GPUArrays._reinterpret_exception(T, a)

err === nothing || throw(err)

err = GPUArrays._reinterpret_exception(T, a)

err === nothing || throw(err)

github-actions · 2024-12-15T05:32:07Z

ext/ReactantCUDAExt.jl

+  if sizeof(T) == sizeof(S) # fast case
+    return CuTracedArray{T,N,A}(reinterpret(Core.LLVMPtr{T,A}, a.ptr), size(a), a.maxsize)
+  end


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

if sizeof(T) == sizeof(S) # fast case

return CuTracedArray{T,N,A}(reinterpret(Core.LLVMPtr{T,A}, a.ptr), size(a), a.maxsize)

end

if sizeof(T) == sizeof(S) # fast case

return CuTracedArray{T,N,A}(

reinterpret(Core.LLVMPtr{T,A}, a.ptr), size(a), a.maxsize

)

end

github-actions · 2024-12-15T05:32:07Z

ext/ReactantCUDAExt.jl

+  isize = size(a)
+  size1 = div(isize[1]*sizeof(S), sizeof(T))
+  osize = tuple(size1, Base.tail(isize)...)
+  return CuTracedArray{T,N,A}(reinterpret(Core.LLVMPtr{T,A}, a.ptr), osize, a.maxsize)


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

isize = size(a)

size1 = div(isize[1]*sizeof(S), sizeof(T))

osize = tuple(size1, Base.tail(isize)...)

return CuTracedArray{T,N,A}(reinterpret(Core.LLVMPtr{T,A}, a.ptr), osize, a.maxsize)

isize = size(a)

size1 = div(isize[1] * sizeof(S), sizeof(T))

osize = tuple(size1, Base.tail(isize)...)

return CuTracedArray{T,N,A}(reinterpret(Core.LLVMPtr{T,A}, a.ptr), osize, a.maxsize)

github-actions · 2024-12-15T05:32:07Z

ext/ReactantCUDAExt.jl

+  if prod(dims) != length(a)
+      throw(DimensionMismatch("new dimensions (argument `dims`) must be consistent with array size (`size(a)`)"))
+  end
+  if N == M && dims == size(a)
+      return a
+  end
+  _derived_array(a, T, dims)


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

if prod(dims) != length(a)

throw(DimensionMismatch("new dimensions (argument `dims`) must be consistent with array size (`size(a)`)"))

end

if N == M && dims == size(a)

return a

end

_derived_array(a, T, dims)

if prod(dims) != length(a)

throw(

DimensionMismatch(

"new dimensions (argument `dims`) must be consistent with array size (`size(a)`)",

),

)

end

if N == M && dims == size(a)

return a

end

return _derived_array(a, T, dims)

wsmoses · 2024-12-15T07:37:43Z

I still need to make the nicer op with registration and lowerings (incl llvm/llvm-project#119998), but we now have host and device code in an mlir module

#= /home/wmoses/git/Reactant.jl/test/cuda.jl:22 =# @code_hlo(square!(A)) = #tbaa_root = #llvm.tbaa_root<id = "custom_tbaa">
#tbaa_type_desc = #llvm.tbaa_type_desc<id = "custom_tbaa_addrspace(1)", members = {<#tbaa_root, 0>}>
#tbaa_tag = #llvm.tbaa_tag<base_type = #tbaa_type_desc, access_type = #tbaa_type_desc, offset = 0>
module {
  llvm.func internal unnamed_addr fastcc @julia_throw_boundserror_2676() attributes {dso_local, no_inline, sym_visibility = "private"} {
    llvm.unreachable
  }
  llvm.func local_unnamed_addr ptx_kernelcc @_Z14square_kernel_13CuTracedArrayI5Int64Ll1ELl1E5_64__E(%arg0: !llvm.array<1 x ptr<1>>) {
    %0 = llvm.mlir.constant(63 : i32) : i32
    %1 = nvvm.read.ptx.sreg.tid.x : i32
    %2 = llvm.icmp "ugt" %1, %0 : i32
    llvm.cond_br %2, ^bb2, ^bb1
  ^bb1:  // pred: ^bb0
    %3 = llvm.extractvalue %arg0[0] : !llvm.array<1 x ptr<1>>
    %4 = llvm.zext %1 : i32 to i64
    %5 = llvm.getelementptr inbounds %3[%4] : (!llvm.ptr<1>, i64) -> !llvm.ptr<1>, i64
    %6 = llvm.load %5 {alignment = 1 : i64, tbaa = [#tbaa_tag]} : !llvm.ptr<1> -> i64
    %7 = llvm.mul %6, %6 : i64
    llvm.store %7, %5 {alignment = 1 : i64, tbaa = [#tbaa_tag]} : i64, !llvm.ptr<1>
    llvm.return
  ^bb2:  // pred: ^bb0
    llvm.call fastcc @julia_throw_boundserror_2676() : () -> ()
    llvm.unreachable
  }
  func.func @main(%arg0: tensor<64xi64>) -> tensor<64xi64> {
    %0 = stablehlo.custom_call @reactant_gpu_call(%arg0) {backend_config = "_Z14square_kernel_13CuTracedArrayI5Int64Ll1ELl1E5_64__E", output_operand_aliases = [#stablehlo.output_operand_alias<output_tuple_indices = [], operand_index = 0, operand_tuple_indices = []>]} : (tensor<64xi64>) -> tensor<64xi64>
    return %0 : tensor<64xi64>
  }
}
call_kwargs = Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}()
blockdim = CuDim3(0x00000001, 0x00000001, 0x00000001)
threaddim = CuDim3(0x00000040, 0x00000001, 0x00000001)

github-actions

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

JuliaFormatter

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 401 in d0e5195

res

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/ext/ReactantCUDAExt.jl

Line 405 in d0e5195

[JuliaFormatter] _{reported by reviewdog 🐶}

Reactant.jl/src/utils.jl

Lines 370 to 381 in d0e5195

    
           for (i, inst) in enumerate(ir.stmts) 
        
           	@static if VERSION < v"1.11" 
        
           	    changed, next = rewrite_inst(inst[:inst], ir, interp) 
        
           	    Core.Compiler.setindex!(ir.stmts[i], next, :inst) 
        
           	else 
        
           	    changed, next = rewrite_inst(inst[:stmt], ir, interp) 
        
           	    Core.Compiler.setindex!(ir.stmts[i], next, :stmt) 
        
           	end 
        
           	if changed 
        
           	    any_changed = true 
        
           	    Core.Compiler.setindex!(ir.stmts[i], Any, :type) 
        
           	end

github-actions · 2024-12-15T07:40:00Z

ext/ReactantCUDAExt.jl

+  res = CuTracedArray{T,N,CUDA.AS.Global, size(xs)}(Base.reinterpret(Core.LLVMPtr{T,CUDA.AS.Global}, Base.pointer_from_objref(xs)))
+  return res


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

res = CuTracedArray{T,N,CUDA.AS.Global, size(xs)}(Base.reinterpret(Core.LLVMPtr{T,CUDA.AS.Global}, Base.pointer_from_objref(xs)))

return res

res = CuTracedArray{T,N,CUDA.AS.Global,size(xs)}(

Base.reinterpret(Core.LLVMPtr{T,CUDA.AS.Global}, Base.pointer_from_objref(xs))

)

return res

github-actions · 2024-12-15T07:40:01Z

ext/ReactantCUDAExt.jl

+  return res
+end
+
+const _kernel_instances = Dict{Any, Any}()


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

const _kernel_instances = Dict{Any, Any}()

const _kernel_instances = Dict{Any,Any}()

github-actions · 2024-12-15T07:40:01Z

ext/ReactantCUDAExt.jl

+   f::Union{F, Nothing}
+   entry::MLIR.IR.Operation


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

f::Union{F, Nothing}

entry::MLIR.IR.Operation

f::Union{F,Nothing}

entry::MLIR.IR.Operation

github-actions · 2024-12-15T07:40:01Z

ext/ReactantCUDAExt.jl

+   entry::MLIR.IR.Operation
+end
+
+


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

github-actions · 2024-12-15T07:40:01Z

ext/ReactantCUDAExt.jl

+
+GPULowerCPUFeaturesPass() = LLVM.NewPMModulePass("GPULowerCPUFeatures", GPUCompiler.cpu_features!)


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

GPULowerCPUFeaturesPass() = LLVM.NewPMModulePass("GPULowerCPUFeatures", GPUCompiler.cpu_features!)

function GPULowerCPUFeaturesPass()

return LLVM.NewPMModulePass("GPULowerCPUFeatures", GPUCompiler.cpu_features!)

end

github-actions · 2024-12-15T07:40:03Z

ext/ReactantCUDAExt.jl

+function compiler_cache(ctx::MLIR.IR.Context)
+    cache = get(_compiler_caches, ctx, nothing)
+    if cache === nothing
+        cache = Dict{Any, LLVMFunc}()


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

cache = Dict{Any, LLVMFunc}()

cache = Dict{Any,LLVMFunc}()

github-actions · 2024-12-15T07:40:03Z

ext/ReactantCUDAExt.jl

+    return cache
+end
+
+Reactant.@reactant_override @noinline function CUDA.cufunction(f::F, tt::TT=Tuple{}; kwargs...) where {F,TT}


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

Reactant.@reactant_override @noinline function CUDA.cufunction(f::F, tt::TT=Tuple{}; kwargs...) where {F,TT}

Reactant.@reactant_override @noinline function CUDA.cufunction(

f::F, tt::TT=Tuple{}; kwargs...

) where {F,TT}

github-actions · 2024-12-15T07:40:03Z

ext/ReactantCUDAExt.jl

+Reactant.@reactant_override @noinline function CUDA.cufunction(f::F, tt::TT=Tuple{}; kwargs...) where {F,TT}
+    res = Base.@lock CUDA.cufunction_lock begin
+        # compile the function
+	cache = compiler_cache(MLIR.IR.context())


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

cache = compiler_cache(MLIR.IR.context())

cache = compiler_cache(MLIR.IR.context())

github-actions · 2024-12-15T07:40:03Z

ext/ReactantCUDAExt.jl

+        # compile the function
+	cache = compiler_cache(MLIR.IR.context())
+        source = CUDA.methodinstance(F, tt)
+    	# cuda = CUDA.active_state()


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

# cuda = CUDA.active_state()

# cuda = CUDA.active_state()

github-actions · 2024-12-15T07:40:03Z

ext/ReactantCUDAExt.jl

+        cuda_cap=v"5.0"
+        cuda_ptx=v"6.3"
+        llvm_cap=v"5.0"
+        llvm_ptx=v"6.3"
+        kernel=true
+        always_inline=false
+        name=nothing
+        debuginfo=false
+        config = CUDA.CompilerConfig(CUDA.PTXCompilerTarget(; cap=llvm_cap, ptx=llvm_ptx, debuginfo), CUDA.CUDACompilerParams(; cap=cuda_cap, ptx=cuda_ptx); kernel, name, always_inline)


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

cuda_cap=v"5.0"

cuda_ptx=v"6.3"

llvm_cap=v"5.0"

llvm_ptx=v"6.3"

kernel=true

always_inline=false

name=nothing

debuginfo=false

config = CUDA.CompilerConfig(CUDA.PTXCompilerTarget(; cap=llvm_cap, ptx=llvm_ptx, debuginfo), CUDA.CUDACompilerParams(; cap=cuda_cap, ptx=cuda_ptx); kernel, name, always_inline)

cuda_cap = v"5.0"

cuda_ptx = v"6.3"

llvm_cap = v"5.0"

llvm_ptx = v"6.3"

kernel = true

always_inline = false

name = nothing

debuginfo = false

config = CUDA.CompilerConfig(

CUDA.PTXCompilerTarget(; cap=llvm_cap, ptx=llvm_ptx, debuginfo),

CUDA.CUDACompilerParams(; cap=cuda_cap, ptx=cuda_ptx);

kernel,

name,

always_inline,

)

github-actions · 2024-12-17T00:27:59Z

ext/ReactantCUDAExt.jl

+        config = CUDA.CompilerConfig(CUDA.PTXCompilerTarget(; cap=llvm_cap, ptx=llvm_ptx, debuginfo), CUDA.CUDACompilerParams(; cap=cuda_cap, ptx=cuda_ptx); kernel, name, always_inline)
+        CUDA.GPUCompiler.cached_compilation(cache, source, config, compile, link)
+    end
+    res


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

res

return res

github-actions · 2024-12-17T00:28:00Z

ext/ReactantCUDAExt.jl

+end
+
+function __init__()
+


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

github-actions · 2024-12-17T00:28:00Z

src/utils.jl

+	for (i, inst) in enumerate(ir.stmts)
+		@static if VERSION < v"1.11"
+		    changed, next = rewrite_inst(inst[:inst], ir, interp)
+		    Core.Compiler.setindex!(ir.stmts[i], next, :inst)
+		else
+		    changed, next = rewrite_inst(inst[:stmt], ir, interp)
+		    Core.Compiler.setindex!(ir.stmts[i], next, :stmt)
+		end
+		if changed
+		    any_changed = true
+		    Core.Compiler.setindex!(ir.stmts[i], Any, :type)
+		end


[JuliaFormatter] _{reported by reviewdog 🐶}

Suggested change

for (i, inst) in enumerate(ir.stmts)

@static if VERSION < v"1.11"

changed, next = rewrite_inst(inst[:inst], ir, interp)

Core.Compiler.setindex!(ir.stmts[i], next, :inst)

else

changed, next = rewrite_inst(inst[:stmt], ir, interp)

Core.Compiler.setindex!(ir.stmts[i], next, :stmt)

end

if changed

any_changed = true

Core.Compiler.setindex!(ir.stmts[i], Any, :type)

end

for (i, inst) in enumerate(ir.stmts)

@static if VERSION < v"1.11"

changed, next = rewrite_inst(inst[:inst], ir, interp)

Core.Compiler.setindex!(ir.stmts[i], next, :inst)

else

changed, next = rewrite_inst(inst[:stmt], ir, interp)

Core.Compiler.setindex!(ir.stmts[i], next, :stmt)

end

if changed

any_changed = true

Core.Compiler.setindex!(ir.stmts[i], Any, :type)

end

wsmoses · 2024-12-17T00:28:47Z

This still has some more work to go before fully working, but it improves the errors and current state (And resolves other interpreter things and is a prerequisite to a stack of downstream opts). I'm going to go ahead and merge for now and continue in follow ups

github-actions bot reviewed Nov 29, 2024

View reviewed changes

This comment was marked as spam.

Sign in to view

wsmoses force-pushed the cuda branch 2 times, most recently from be52876 to b8e3570 Compare December 7, 2024 23:32

github-actions bot reviewed Dec 8, 2024

View reviewed changes

github-actions bot reviewed Dec 9, 2024

View reviewed changes

ftynse reviewed Dec 10, 2024

View reviewed changes

William Moses and others added 19 commits December 14, 2024 23:25

WIP: kernels

beba8fb

more files

c889eff

fix

680517a

wip

d51db0b

wqtmp

cc974e3

wip

45136ce

inc

6e50c5a

continuing

d26f961

wip

563c78e

more work

41cff1c

inf rec

d69f260

fix

46cadf4

overload working

c9723af

continuing

cce7722

continuing

8981f12

push

ecdf0cb

conversion

3bfd8a3

continuing

5a43ae8

fix

8a0bfb0

wsmoses force-pushed the cuda branch from 8580c11 to 8a0bfb0 Compare December 15, 2024 04:35

github-actions bot reviewed Dec 15, 2024

View reviewed changes

fix

0c61f5d

github-actions bot reviewed Dec 15, 2024

View reviewed changes

host and device IR

d0e5195

github-actions bot reviewed Dec 15, 2024

View reviewed changes

Restore testing

c8af7de

github-actions bot reviewed Dec 17, 2024

View reviewed changes

Merge branch 'main' into cuda

02e8fbc

wsmoses marked this pull request as ready for review December 17, 2024 00:29

wsmoses merged commit dadc568 into main Dec 17, 2024
2 of 10 checks passed

wsmoses deleted the cuda branch December 17, 2024 00:29

	function (func::LLVMFunc{F,tt})(args...; blocks::CUDA.CuDim=1, threads::CUDA.CuDim=1,
	shmem::Integer=0) where{F, tt}

	@show a
	@assert a isa CuDeviceArray
	ta = Base.pointer_to_objref(a.ptr)::TracedRArray
	arg = ta.mlir_data
	arg = Reactant.Compiler.transpose_val(arg)
	push!(restys, MLIR.IR.Type(arg))
	push!(aliases,
	MLIR.IR.Dialects.stablehlo.stablehloOutputOperandAliasGet(
	MLIR.IR.context(),
	len(args) == 1 ? 0 : 1,
	len(args) == 1 ? C_NULL : Ref{Int64}(i-1),
	i-1,
	0,
	C_NULL
	)
	)

	output_operand_aliases=MLIR.ArrayAttr.get(MLIR.IR.context(), aliases)
	MLIR.IR.Dialects.stablehlo.custom_call(mlir_args; result_0=restys, call_target_name="reactant_gpu_call", output_operand_aliases)


	arginfo2 = ArgInfo(
	if fargs isa Nothing
	nothing
	else
	[:($(recufunction)), fargs[2:end]...]
	end,
	[Core.Const(recufunction), argtypes[2:end]...],
	)
	return abstract_call_known(interp, recufunction, arginfo2, si, sv, max_methods)

		Base.unsafe_convert(Core.LLVMPtr{T,A}, x) + Base._memory_offset(x, i)
		end

		# deprecated
		Base.@propagate_inbounds ldg(A::CuTracedArray, i1::Integer) = const_arrayref(A, i1)

		return CuTracedArray{T,N,A}(reinterpret(Core.LLVMPtr{T,A}, a.ptr), osize, a.maxsize)
		end

	@show @code_hlo optimize=false square!(A)
	@show @code_hlo optimize = false square!(A)

	# This is a bit weird since we're taking a module from julia's llvm into reactant's llvm version
	# it is probably safer to reparse a string using the right llvm module api, so we will do that.

	mmod = MLIR.IR.Module(@ccall MLIR.API.mlir_c.ConvertLLVMStrToMLIR(modstr::Cstring, MLIR.IR.context()::MLIR.API.MlirContext)::MLIR.API.MlirModule)
	@show mmod

	# check if we'll need the device runtime
	undefined_fs = filter(collect(CUDA.LLVM.functions(meta.ir))) do f
	CUDA.LLVM.isdeclaration(f) && !CUDA.LLVM.isintrinsic(f)
	end
	intrinsic_fns = ["vprintf", "malloc", "free", "__assertfail",
	"__nvvm_reflect" #= TODO: should have been optimized away =#]
	needs_cudadevrt = !isempty(setdiff(CUDA.LLVM.name.(undefined_fs), intrinsic_fns))

	# prepare invocations of CUDA compiler tools
	ptxas_opts = String[]
	nvlink_opts = String[]
	## debug flags
	if Base.JLOptions().debug_level == 1
	push!(ptxas_opts, "--generate-line-info")
	elseif Base.JLOptions().debug_level >= 2
	push!(ptxas_opts, "--device-debug")
	push!(nvlink_opts, "--debug")
	end
	## relocatable device code
	if needs_cudadevrt
	push!(ptxas_opts, "--compile-only")
	end

	ptx = job.config.params.ptx
	cap = job.config.params.cap
	arch = "sm_$(cap.major)$(cap.minor)"

	# validate use of parameter memory
	argtypes = filter([CUDA.KernelState, job.source.specTypes.parameters...]) do dt
	!CUDA.isghosttype(dt) && !Core.Compiler.isconstType(dt)
	end
	param_usage = sum(sizeof, argtypes)
	param_limit = 4096
	if cap >= v"7.0" && ptx >= v"8.1"
	param_limit = 32764
	end
	if param_usage > param_limit
	msg = """Kernel invocation uses too much parameter memory.
	$(Base.format_bytes(param_usage)) exceeds the $(Base.format_bytes(param_limit)) limit imposed by sm_$(cap.major)$(cap.minor) / PTX v$(ptx.major).$(ptx.minor)."""

	try
	details = "\n\nRelevant parameters:"

	source_types = job.source.specTypes.parameters
	source_argnames = Base.method_argnames(job.source.def)
	while length(source_argnames) < length(source_types)
	# this is probably due to a trailing vararg; repeat its name
	push!(source_argnames, source_argnames[end])
	end

	for (i, typ) in enumerate(source_types)
	if CUDA.isghosttype(typ) \|\| Core.Compiler.isconstType(typ)
	continue
	end
	name = source_argnames[i]
	details *= "\n [$(i-1)] $name::$typ uses $(Base.format_bytes(sizeof(typ)))"
	end
	details *= "\n"

	if cap >= v"7.0" && ptx < v"8.1" && param_usage < 32764
	details *= "\nNote: use a newer CUDA to support more parameters on your device.\n"
	end

	msg *= details
	catch err
	@error "Failed to analyze kernel parameter usage; please file an issue with a reproducer."
	end
	error(msg)
	end

	# compile to machine code
	# NOTE: we use tempname since mktemp doesn't support suffixes, and mktempdir is slow
	ptx_input = tempname(cleanup=false) * ".ptx"
	ptxas_output = tempname(cleanup=false) * ".cubin"
	write(ptx_input, asm)

	# we could use the driver's embedded JIT compiler, but that has several disadvantages:
	# 1. fixes and improvements are slower to arrive, by using `ptxas` we only need to
	# upgrade the toolkit to get a newer compiler;
	# 2. version checking is simpler, we otherwise need to use NVML to query the driver
	# version, which is hard to correlate to PTX JIT improvements;
	# 3. if we want to be able to use newer (minor upgrades) of the CUDA toolkit on an
	# older driver, we should use the newer compiler to ensure compatibility.
	append!(ptxas_opts, [
	"--verbose",
	"--gpu-name", arch,
	"--output-file", ptxas_output,
	ptx_input
	])
	proc, log = CUDA.run_and_collect(`$(CUDA.ptxas()) $ptxas_opts`)
	log = strip(log)
	if !success(proc)
	reason = proc.termsignal > 0 ? "ptxas received signal $(proc.termsignal)" :
	"ptxas exited with code $(proc.exitcode)"
	msg = "Failed to compile PTX code ($reason)"
	msg *= "\nInvocation arguments: $(join(ptxas_opts, ' '))"
	if !isempty(log)
	msg = "\n" log
	end
	msg *= "\nIf you think this is a bug, please file an issue and attach $(ptx_input)"
	if parse(Bool, get(ENV, "BUILDKITE", "false"))
	run(`buildkite-agent artifact upload $(ptx_input)`)
	end
	error(msg)
	elseif !isempty(log)
	@debug "PTX compiler log:\n" * log
	end
	rm(ptx_input)

	# link device libraries, if necessary
	#
	# this requires relocatable device code, which prevents certain optimizations and
	# hurts performance. as such, we only do so when absolutely necessary.
	# TODO: try LTO, `--link-time-opt --nvvmpath /opt/cuda/nvvm`.
	# fails with `Ignoring -lto option because no LTO objects found`
	if needs_cudadevrt
	nvlink_output = tempname(cleanup=false) * ".cubin"
	append!(nvlink_opts, [
	"--verbose", "--extra-warnings",
	"--arch", arch,
	"--library-path", dirname(libcudadevrt),
	"--library", "cudadevrt",
	"--output-file", nvlink_output,
	ptxas_output
	])
	proc, log = run_and_collect(`$(CUDA.nvlink()) $nvlink_opts`)
	log = strip(log)
	if !success(proc)
	reason = proc.termsignal > 0 ? "nvlink received signal $(proc.termsignal)" :
	"nvlink exited with code $(proc.exitcode)"
	msg = "Failed to link PTX code ($reason)"
	msg *= "\nInvocation arguments: $(join(nvlink_opts, ' '))"
	if !isempty(log)
	msg = "\n" log
	end
	msg *= "\nIf you think this is a bug, please file an issue and attach $(ptxas_output)"
	error(msg)
	elseif !isempty(log)
	@debug "PTX linker info log:\n" * log
	end
	rm(ptxas_output)

	image = read(nvlink_output)
	rm(nvlink_output)
	else
	image = read(ptxas_output)
	rm(ptxas_output)
	end

	modstr, image, meta.entry

	Reactant.@reactant_override @noinline function (func::LLVMFunc{F,tt})(args...; convert=Val(false), blocks::CuDim=1, threads::CuDim=1,
	cooperative::Bool=false, shmem::Integer=0, call_kwargs...) where{F, tt}

	@show a
	@assert a isa CuTracedArray
	ta = Base.unsafe_pointer_to_objref(Base.reinterpret(Ptr{Cvoid}, a.ptr))::TracedRArray
	push!(rarrays, ta)
	arg = ta.mlir_data
	arg = transpose_val(arg)
	@show arg
	push!(restys, MLIR.IR.type(arg))
	push!(mlir_args, arg)
	push!(aliases,
	MLIR.IR.Attribute(MLIR.API.stablehloOutputOperandAliasGet(
	MLIR.IR.context(),
	length(args) == 1 ? 0 : 1,
	length(args) == 1 ? C_NULL : Ref{Int64}(i-1),
	i-1,
	0,
	C_NULL
	))
	)

	output_operand_aliases=MLIR.IR.Attribute(aliases)
	call = MLIR.Dialects.stablehlo.custom_call(mlir_args; result_0=restys, call_target_name="reactant_gpu_call", output_operand_aliases, backend_config=MLIR.IR.Attribute("configstr"))

	cuda_cap=v"5.0"
	cuda_ptx=v"6.3"
	llvm_cap=v"5.0"
	llvm_ptx=v"6.3"
	kernel=true
	always_inline=false
	name=nothing
	debuginfo=false
	config = CUDA.CompilerConfig(CUDA.PTXCompilerTarget(; cap=llvm_cap, ptx=llvm_ptx, debuginfo), CUDA.CUDACompilerParams(; cap=cuda_cap, ptx=cuda_ptx); kernel, name, always_inline)

	for (i, inst) in enumerate(ir.stmts)
	@static if VERSION < v"1.11"
	changed, next = rewrite_inst(inst[:inst], ir, interp)
	Core.Compiler.setindex!(ir.stmts[i], next, :inst)
	else
	changed, next = rewrite_inst(inst[:stmt], ir, interp)
	Core.Compiler.setindex!(ir.stmts[i], next, :stmt)
	end
	if changed
	any_changed = true
	Core.Compiler.setindex!(ir.stmts[i], Any, :type)
	end

		using Reactant:
		Reactant, TracedRArray, AnyTracedRArray, MLIR, TracedRNumber

	using Reactant:
	Reactant, TracedRArray, AnyTracedRArray, MLIR, TracedRNumber
	using Reactant: Reactant, TracedRArray, AnyTracedRArray, MLIR, TracedRNumber


		Base.show(io::IO, a::AT) where AT <: CuTracedArray =
		CUDA.Printf.@printf(io, "%s cu traced array at %p", join(size(a), '×'), Int(pointer(a)))

-Base.pointer(x::CuTracedArray{T,<:Any,A}) where {T,A} = Base.unsafe_convert(Core.LLVMPtr{T,A}, x)
+function Base.pointer(x::CuTracedArray{T,<:Any,A}) where {T,A}
+    return Base.unsafe_convert(Core.LLVMPtr{T,A}, x)
+end

	Base.unsafe_convert(Core.LLVMPtr{T,A}, x) + Base._memory_offset(x, i)
	return Base.unsafe_convert(Core.LLVMPtr{T,A}, x) + Base._memory_offset(x, i)

WIP: kernels #314

WIP: kernels #314

Conversation

wsmoses commented Nov 29, 2024

github-actions bot left a comment

Choose a reason for hiding this comment

Reactant.jl Benchmarks

This comment was marked as spam.

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Dec 8, 2024

Choose a reason for hiding this comment

github-actions bot Dec 8, 2024

Choose a reason for hiding this comment

github-actions bot Dec 8, 2024

Choose a reason for hiding this comment

github-actions bot Dec 8, 2024

Choose a reason for hiding this comment

github-actions bot Dec 8, 2024

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

wsmoses commented Dec 9, 2024

wsmoses commented Dec 10, 2024

wsmoses commented Dec 10, 2024

wsmoses commented Dec 10, 2024

ftynse Dec 10, 2024

Choose a reason for hiding this comment

github-actions bot Dec 15, 2024

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Dec 15, 2024

Choose a reason for hiding this comment

github-actions bot Dec 15, 2024

Choose a reason for hiding this comment

github-actions bot Dec 15, 2024

Choose a reason for hiding this comment

github-actions bot Dec 15, 2024

Choose a reason for hiding this comment

github-actions bot Dec 15, 2024

Choose a reason for hiding this comment

github-actions bot Dec 15, 2024

Choose a reason for hiding this comment

github-actions bot Dec 15, 2024

Choose a reason for hiding this comment

github-actions bot Dec 15, 2024

Choose a reason for hiding this comment

github-actions bot Dec 15, 2024

Choose a reason for hiding this comment

github-actions bot Dec 15, 2024

Choose a reason for hiding this comment

wsmoses commented Dec 15, 2024

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Dec 15, 2024

Choose a reason for hiding this comment

github-actions bot Dec 15, 2024

Choose a reason for hiding this comment

github-actions bot Dec 15, 2024

Choose a reason for hiding this comment

github-actions bot Dec 15, 2024

Choose a reason for hiding this comment

github-actions bot Dec 15, 2024

Choose a reason for hiding this comment

github-actions bot Dec 15, 2024

Choose a reason for hiding this comment

github-actions bot Dec 15, 2024

Choose a reason for hiding this comment

github-actions bot Dec 15, 2024

Choose a reason for hiding this comment

github-actions bot Dec 15, 2024

Choose a reason for hiding this comment

github-actions bot Dec 15, 2024

Choose a reason for hiding this comment

github-actions bot Dec 17, 2024

Choose a reason for hiding this comment

github-actions bot Dec 17, 2024

Choose a reason for hiding this comment

github-actions bot Dec 17, 2024

		Base.unsafe_convert(::Type{Core.LLVMPtr{T,A}}, x::CuTracedArray{T,<:Any,A}) where {T,A} =
		x.ptr

		err = GPUArrays._reinterpret_exception(T, a)
		err === nothing \|\| throw(err)

		res = CuTracedArray{T,N,CUDA.AS.Global, size(xs)}(Base.reinterpret(Core.LLVMPtr{T,CUDA.AS.Global}, Base.pointer_from_objref(xs)))
		return res

	const _kernel_instances = Dict{Any, Any}()
	const _kernel_instances = Dict{Any,Any}()


		GPULowerCPUFeaturesPass() = LLVM.NewPMModulePass("GPULowerCPUFeatures", GPUCompiler.cpu_features!)

-GPULowerCPUFeaturesPass() = LLVM.NewPMModulePass("GPULowerCPUFeatures", GPUCompiler.cpu_features!)
+function GPULowerCPUFeaturesPass()
+    return LLVM.NewPMModulePass("GPULowerCPUFeatures", GPUCompiler.cpu_features!)
+end