Replies: 9 comments 17 replies
-
So the first complexity here is that MIOpen currently expects to be able to ask us for "the kernel count" given, IIRC, a problem description. The second thing is that we might, for example, have MIGraphX doing our execution, and they do their own kernel launches, so what they want is "here's a binary, here're the calls you need to make", and I think both of our clients also assume all our kernels share a function signature. If we end up with more complicated graph logic, then we need to expose that to our clients, including the rather strict (IIRC) requirements of MIOpen to provide ordered kernels. And and, when we're generating test code, we generate the host harness in It's all probably doable, but it's a reworking of the infrastructure. The general thing I'd think of is that we'd want to generate the graph of kernel invocations and either have a way to turn that into host code (for our testing) or to feed it back to clients (for their execution). |
Beta Was this translation helpful? Give feedback.
-
For the discussion on 20/01/2023 :
An example of 1:multi-kernel scenario : #map0 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d1)>
module {
func.func private @test_reduce__part_0(%arg0: memref<2x3x1xf32> {func.write_access}) {
%cst = arith.constant 0.000000e+00 : f32
linalg.fill ins(%cst : f32) outs(%arg0 : memref<2x3x1xf32>)
return
}
func.func private @test_reduce__part_1(%arg0: memref<2x3x40xf32> {func.read_access}, %arg1: memref<2x3x1xf32> {func.read_access, func.write_access}) {
%0 = memref.collapse_shape %arg1 [[0], [1, 2]] : memref<2x3x1xf32> into memref<2x3xf32>
linalg.generic {indexing_maps = [#map0, #map1], iterator_types = ["parallel", "parallel", "reduction"]} ins(%arg0 : memref<2x3x40xf32>) outs(%0 : memref<2x3xf32>) {
^bb0(%arg2: f32, %arg3: f32):
%2 = arith.addf %arg2, %arg3 : f32
linalg.yield %2 : f32
}
return
}
func.func @test_reduce(%arg0: memref<2x3x40xf32>, %arg1: memref<2x3x1xf32>) attributes {arch = ""} {
%token0 = async.launch @test_reduce__part_0 (%arg1) : (memref<2x3x1xf32>) -> ()
%token1 = async.launch @test_reduce__part_1 [%token0] (%arg0, %arg1) : (memref<2x3x40xf32>, memref<2x3x1xf32>) -> ()
async.await %token1 : !async.token
return
}
module @__xmodule_gfx90a attributes {xmodel.arch = "gfx90a", xmodel.module} {
func.func private @test_reduce__part_0(%arg0: memref<2x3x1xf32> {func.write_access}) attributes {kernel, original_func = @test_reduce__part_0, grid_size = 1, block_size = 256} {
rock.zero_init_kernel %arg0 {arch = "", blockSize = 256 : i32, elemsPerThread = 1 : index, gridSize = 1 : i32} : memref<2x3x1xf32>
return
}
func.func private @test_reduce__part_1(%arg0: memref<2x3x40xf32> {func.read_access}, %arg1: memref<2x3x1xf32> {func.read_access, func.write_access}) attributes {kernel, original_func = @test_reduce__part_1, grid_size = 1, block_size = 256} {
rock.reduce sum(%arg0, %arg1) {axis = 2 : index, blockSize = 256 : i32, gridSize = 1 : i32} : (memref<2x3x40xf32>, memref<2x3x1xf32>)
return
}
}
}
|
Beta Was this translation helpful? Give feedback.
-
Ok, so, a very rough proposal for what we should do to Instead of the current edited to show a token argument, memory ops %token, [%result if in a tensor context] async.offloadable %token1(%token2, %in1, %in2) [-> return type] cpu(%token0, %arg0, %arg1) {
%token = async.launch %token0, @cpu_kernel_func
// or
%token = async.execute %token0, { tosa.conv2d }
// or
async.yield // if any
}
gpu(%token0, %arg0, %ang1) {
%token_alloc_in, %alloc_in = alloc_device_memory()
%token_alloc_out, %alloc_out = alloc_device_memory()
%token_copy = async.device_copy %token_alloc_in, %token0, %arg0 to %alloc_in
%token_launch = async.bundled_launch %token_alloc_out, %token_copy @gpu_kernel_func
// or
%token_launch = async.bundled_launch {targets = {"amdgcn-amd-amdhsa:gfx90a" = {kernel_func = @kernel_func_gfx90a}, "amdgcn-amd-amdhsa:gfx908 = {kernel_func = @gpu_kernel_func_gfx908}}}
// these'll lower to a bunch of compilations, one per GPU target, and eventually become
%token_launch = async.bundled_launch %token_alloc_out, %token_copy {targets = {"amdgcn-amd-amdhsa:gfx90a" => {blockSize = B, gridSize = G, binary =
<long string here> }, "amdgcn-amd-amdhsa:gfx908:..." } // or `gpu.bundled_launch`
%token_copy_back = async.device_copy %token_launch %alloc_out, %arg1
async.yield %token_copy_back
} ... (more targets)
// and maybe even
original(%token0, %arg0, %arg1) {
async.await %token0
tosa.conv2d
async.yield %fake_token
} |
Beta Was this translation helpful? Give feedback.
-
I don't want Also, your proposed module setup is something I don't like. I want to keep the common structure for GPU code and then have a bundled launch. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
**** Update from the upstream merge **** Just noticed new ops in the Async dialect. llvm/llvm-project@145d2a5#diff-6eff94f8f04c403c11293b0077c3e734581bffdaf730621ddb0a6afb61cfad9b |
Beta Was this translation helpful? Give feedback.
-
Ok, so, to pull things out from the discussion with Manupa, who pointed out some of the complexities of the GPU case, how about this %tokens, ... = async.offloadable cpu {
async.await %in_ready
...
async.yield %finished_computation
} "amdghsa-amd-amdgcn:gfx90a" {
%dev_in_ready, %dev_in = alloc()
%dev_out_ready, %dev_out = alloc()
%mecpy_ready = memcpy [%dev_in_ready, %in_ready] %in, %dev_in
%launch_done = [gpu.launch/async.launch/...] {block_size = N, grid_size = G, ...} [%memcpy_ready, %dev_out_ready] @gpu_kernel_gfx90a(%dev_in, %dev_out) // or @gfx90a::@gpu_kernel
%done = memcpy [%launch_done] %dev_out, %out
async.yield %done
} "amdgcn-amd-amdhsa:gfx1030" {
// same as above
} and then, if you needed to add an extra kernel, you'd rewrite the "amdghsa-amd-amdgcn:gfx90a" {
%dev_in_ready, %dev_in = alloc()
%dev_out_ready, %dev_out = alloc()
%zero_init_ready = launch [%dev_out_ready] @gfx90a::@memset_kernel(%dev_out)
%mecpy_ready = memcpy [%dev_in_ready, %in_ready] %in, %dev_in
%launch_done = [gpu.launch/async.launch/...] {block_size = N, grid_size = G, ...} [%memcpy_ready, %zero_init_ready] @gpu_kernel_gfx90a(%dev_in, %dev_out) // or @gfx90a::@gpu_kernel
%done = memcpy [%launch_done] %dev_out, %out
async.yield %done
} |
Beta Was this translation helpful? Give feedback.
-
In further notes, given that we want a target's compilation to mess with the relevant code, how's this
where the |
Beta Was this translation helpful? Give feedback.
-
So, now that |
Beta Was this translation helpful? Give feedback.
-
I think we've come across several scenarios in which we want to generate multiple kernels to translate a higher level op (e.g. init kernels for reductions, backward convs). However, the current API coupling is such that the client knows how much kernels generate and how to call them -- which works for now.
I would like to ask what if we always generate three module structure.
Currently, for tosa-based graph of operators we generate :
A1) host module : to handle calls to to kernel per op.
A2) kernel module : for each op
Instead what if we move to :
B1) host module : to handle calls to a host function representing each (fused) op.
B2) (fused) op host module : orchestration of kernels corresponding to a single (fused) op
B3) kernel module : can contain one or more kernel functions.
So primarily, the compilation will work with B2) and B3) and I would think most of cases B2 would just contain a single gpu.launch (or equivalent).
Im sure some of you have thought about this before, and I would like to know what people think about this (or why it is not an good idea)
Beta Was this translation helpful? Give feedback.
All reactions