Should we move from single module kernel generation to multi module kernel & host module generation ? #940

manupak · 2023-01-17T16:28:29Z

manupak
Jan 17, 2023
Collaborator

I think we've come across several scenarios in which we want to generate multiple kernels to translate a higher level op (e.g. init kernels for reductions, backward convs). However, the current API coupling is such that the client knows how much kernels generate and how to call them -- which works for now.

I would like to ask what if we always generate three module structure.

Currently, for tosa-based graph of operators we generate :
A1) host module : to handle calls to to kernel per op.
A2) kernel module : for each op

Instead what if we move to :
B1) host module : to handle calls to a host function representing each (fused) op.
B2) (fused) op host module : orchestration of kernels corresponding to a single (fused) op
B3) kernel module : can contain one or more kernel functions.

So primarily, the compilation will work with B2) and B3) and I would think most of cases B2 would just contain a single gpu.launch (or equivalent).

Im sure some of you have thought about this before, and I would like to know what people think about this (or why it is not an good idea)

krzysz00 · 2023-01-17T22:06:16Z

krzysz00
Jan 17, 2023
Maintainer

So the first complexity here is that MIOpen currently expects to be able to ask us for "the kernel count" given, IIRC, a problem description.

The second thing is that we might, for example, have MIGraphX doing our execution, and they do their own kernel launches, so what they want is "here's a binary, here're the calls you need to make", and I think both of our clients also assume all our kernels share a function signature.

If we end up with more complicated graph logic, then we need to expose that to our clients, including the rather strict (IIRC) requirements of MIOpen to provide ordered kernels.

And and, when we're generating test code, we generate the host harness in rocmlir-gen, so unless we change rocmlir-gen too, we have that requirement on when we need to know what kernels we have.

It's all probably doable, but it's a reworking of the infrastructure.

The general thing I'd think of is that we'd want to generate the graph of kernel invocations and either have a way to turn that into host code (for our testing) or to feed it back to clients (for their execution).

1 reply

manupak Jan 18, 2023
Collaborator Author

I'd like to understand why our clients have the requirement to own launches, that would justify the for bespoke representation for a call graph.

cc : @JehandadKhan @pfultz2 @sjw36 @jungpark-mlir

Also, if we can work out a host function call (in contrast to a launch) to be the handover, I think it would also help us with dynamic shape scenarios to codegen the selection logic as host code.

manupak · 2023-01-20T15:29:25Z

manupak
Jan 20, 2023
Collaborator Author

For the discussion on 20/01/2023 :

Current petitioner performing a tosa op level partitioning that would correspond either 1:1 or many:1 op-to-kernel scenarios. How would we handle the 1:many scenario ?

An example of 1:multi-kernel scenario :

#map0 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d1)>

module {
  func.func private @test_reduce__part_0(%arg0: memref<2x3x1xf32> {func.write_access}) {
    %cst = arith.constant 0.000000e+00 : f32
    linalg.fill ins(%cst : f32) outs(%arg0 : memref<2x3x1xf32>)
    return
  }

  func.func private @test_reduce__part_1(%arg0: memref<2x3x40xf32> {func.read_access}, %arg1: memref<2x3x1xf32> {func.read_access, func.write_access}) {
    %0 = memref.collapse_shape %arg1 [[0], [1, 2]] : memref<2x3x1xf32> into memref<2x3xf32>
    linalg.generic {indexing_maps = [#map0, #map1], iterator_types = ["parallel", "parallel", "reduction"]} ins(%arg0 : memref<2x3x40xf32>) outs(%0 : memref<2x3xf32>) {
    ^bb0(%arg2: f32, %arg3: f32):
      %2 = arith.addf %arg2, %arg3 : f32
      linalg.yield %2 : f32
    }
    return
  }

  func.func @test_reduce(%arg0: memref<2x3x40xf32>, %arg1: memref<2x3x1xf32>) attributes {arch = ""} {
    %token0 = async.launch @test_reduce__part_0 (%arg1) : (memref<2x3x1xf32>) -> ()
    %token1 = async.launch @test_reduce__part_1 [%token0] (%arg0, %arg1) : (memref<2x3x40xf32>, memref<2x3x1xf32>) -> ()
    async.await %token1 : !async.token
    return
  }

  module @__xmodule_gfx90a attributes {xmodel.arch = "gfx90a", xmodel.module} {
    func.func private @test_reduce__part_0(%arg0: memref<2x3x1xf32> {func.write_access}) attributes {kernel, original_func = @test_reduce__part_0, grid_size = 1, block_size = 256} {
      rock.zero_init_kernel %arg0 {arch = "", blockSize = 256 : i32, elemsPerThread = 1 : index, gridSize = 1 : i32} : memref<2x3x1xf32>
      return
    }
    func.func private @test_reduce__part_1(%arg0: memref<2x3x40xf32> {func.read_access}, %arg1: memref<2x3x1xf32> {func.read_access, func.write_access}) attributes {kernel, original_func = @test_reduce__part_1, grid_size = 1, block_size = 256} {
      rock.reduce  sum(%arg0, %arg1) {axis = 2 : index, blockSize = 256 : i32, gridSize = 1 : i32} : (memref<2x3x40xf32>, memref<2x3x1xf32>)
      return
    }
  }
}

Can we allow the number of kernels and how should they be called at Tosa-To-Rock conversion level ? (that level should ideally be who decides that)
This is important to MIGraphX as we go from MIGraphX (DIalect) --> Tosa --> Rock and need alignment with xmir flow.
Should or Shouldn't we ask our clients to initialize output buffers ?
If so should we Arg attr to explicitly denote that ?

1 reply

manupak Jan 20, 2023
Collaborator Author

Thanks folks for the great discussion!

The summary of what we discussed :

Current petitioner performing a tosa op level partitioning that would correspond either 1:1 or many:1 op-to-kernel scenarios. How would we handle the 1:many scenario ?

The general direction is that our clients are best for the runtime component, therefore linking in the runtime components from MLIR would not be right direction. Therefore, we would looking at lowering down to Graph-like interchange format between rocMLIR and its clients.

For the MIGraphX case, where it would eventually pass the whole sparsely lower (where rocMLIR knows how to do) this would look like a graph of subgraphs where each node of subgraphs represent a kernel.

Also the dynamic shape scenario is unlikely to be handled by rocMLIR and would be possibly handled by the client (MIOpen) and the client will own the selection logic.

Can we allow the number of kernels and how should they be called at Tosa-To-Rock conversion level ?

Yes, it seems like this would become a general requirement.

For the GPU case,
It seems to revolve around the fact async.launch need to be able to launch subgraphs of kernels and whatever the 'thing' (discussed possibilities : functions, regions, etc) that get launched need to have attach restricted call graph with async info that would work for both xmir and client's graph interchange format (as discussed before).

Should or Shouldn't we ask our clients to initialize output buffers ? If so should we Arg attr to explicitly denote that ?

It seems like a short term solutions just for init kernels that does not need preprocessing (e.g. FP16/ BF16 ? @krzysz00 correct me here).
We might just go this route until we figure out Graph-like interchange format and new partitioning flow that would eventually launch subgraphs that have a callgraph.

@sjw36 @krzysz00 @jerryyin and others who participated the discussion, I have summarized what we discussed -- please feel free to correct me if something is not right or if I missed anything.

Feel free to branch out the other discussions that we wanted to do.

krzysz00 · 2023-01-24T16:51:01Z

krzysz00
Jan 24, 2023
Maintainer

Ok, so, a very rough proposal for what we should do to async, in my opinion.

Instead of the current async.launch flow, which makes rather strong assumptions about what a launch will lower to, especially in the GPU case, we use an async.offloadable op that looks like this:

edited to show a token argument, memory ops

%token, [%result if in a tensor context] async.offloadable %token1(%token2, %in1, %in2) [-> return type] cpu(%token0, %arg0, %arg1) {
   %token = async.launch %token0, @cpu_kernel_func
   // or 
   %token = async.execute %token0, {  tosa.conv2d }
    // or
    
    async.yield // if any 
}
gpu(%token0, %arg0, %ang1) {
  %token_alloc_in, %alloc_in = alloc_device_memory()
   %token_alloc_out, %alloc_out = alloc_device_memory()
   %token_copy = async.device_copy %token_alloc_in, %token0, %arg0 to %alloc_in
   %token_launch = async.bundled_launch %token_alloc_out, %token_copy @gpu_kernel_func 
   // or
   %token_launch = async.bundled_launch {targets = {"amdgcn-amd-amdhsa:gfx90a" = {kernel_func = @kernel_func_gfx90a}, "amdgcn-amd-amdhsa:gfx908 = {kernel_func = @gpu_kernel_func_gfx908}}}
  // these'll lower to a bunch of compilations, one per GPU target, and eventually become 
  %token_launch = async.bundled_launch %token_alloc_out, %token_copy {targets = {"amdgcn-amd-amdhsa:gfx90a" => {blockSize = B, gridSize = G, binary = 
  <long string here> }, "amdgcn-amd-amdhsa:gfx908:..." } // or `gpu.bundled_launch`
  %token_copy_back = async.device_copy %token_launch %alloc_out, %arg1
 async.yield %token_copy_back
} ... (more targets)
// and maybe even
original(%token0, %arg0, %arg1) {
  async.await %token0
  tosa.conv2d
  async.yield %fake_token
}

1 reply

manupak Jan 25, 2023
Collaborator Author

IIUC (omitting tensor context for clarity) and some slight improvement suggestion..

async.offloadable @higher_level_partition [#tokens] (%in1, %in2) -> token {
  cpu_kind(%arg0, %arg1) {
     %token =  async.launch @kernel0
     yield %token // yield or return ?
  },
  gpu_kind(%arg0, %arg1) {
     target0(%arg0, %arg1)  {
       // can we define "amdgcn-amd-amdhsa:gfx90a" in the function ?
       // maybe the parent (module) of the function could hold that too ?
       %token0 =  async.launch @kernel0(%arg1) 
       %token1 =  async.launch [%token0] @kernel1(%arg0, %arg1)
       yield %token1
       module {target = "amdgcn-amd-amdhsa:gfx90a"}{
          func.func private @kernel0 (%arg0) { ... }
          func.func private @kernel1 (%arg0, %arg1) { ... }
       }
     }
     target1(%arg0, %arg1)  {
       %token0 =  async.launch @kernel0(%arg1)
       %token1 =  async.launch [%token0] @kernel1(%arg0, %arg1)
       yield %token1
       module {target = "amdgcn-amd-amdhsa:gfx908"}{
          func.func private @kernel0 (%arg0) { ... }
          func.func private @kernel1 (%arg0, %arg1) { ... }
       }
     }
  },
  ...
  xpu_kind(%arg0, %arg1) {
     %token = async.launch @kernel0 // that lowers to a binary and whatever format the driver requires
     yield %token
  }
}

Please take this with a pinch of salt as Im also not very convinced with my suggestion either :P.
However, I'd want something to traceback from the the "tosa.op" to async launches and this is just one way to go to the parent where the entry block has the async launches.

krzysz00 · 2023-01-25T17:39:03Z

krzysz00
Jan 25, 2023
Maintainer

I don't want offloadable to define a symbol - it's a block within an existing function.

Also, your proposed module setup is something I don't like. I want to keep the common structure for GPU code and then have a bundled launch.

6 replies

manupak Jan 25, 2023
Collaborator Author

I don't want offloadable to define a symbol - it's a block within an existing function.

That is fine we can put it inside a function which has a symbol anyway.

Also, your proposed module setup is something I don't like. I want to keep the common structure for GPU code and then have a bundled launch.

So the requirement is once a partition is created for a target -- I would think that as a single function with a one or more tosa operations -- from there I'd want to create (maybe clone) functions and edit the launches of them. Thus, I m fine with anyway that can achieve that.

Im suspecting the anwser maybe lying in the "bundled_launch" ?

krzysz00 Jan 25, 2023
Maintainer

Ok, to rephrase, the point is that, the gpu code has the same structure (in terms of memory ops and launches - which I added above) for all the GPU targets. However, the function to be invoked / binary to run / some parameters (block size etc.) will vary between GPUs and we have async.bundled_launch that points to a copy of the function / binary per target. These copies could live in separate modules outside of the function - for example, you could refer to @gfx90a::@kernel_func and @gfx908::@kernel_funcin yourasync.bundled_launch` .

manupak Jan 26, 2023
Collaborator Author

So basically you are saying bundled launch is something that does delayed symbol resolution (based on the target).

If everything else is the same, It sounds to me that we are maintaining multiple functions just to hold the varying launch params -- couldn' that be avoided by making those args of the async.launch themselves ?

Also are we ruling out that different GPUs wont require different number of kernels (more presicely a different kernel call graph) ?

krzysz00 Jan 26, 2023
Maintainer

Bundled launch doesn't quite do delayed symbol resolution - by the time you're at runtime, it does "I have binaries for these GPUs, run the correct one", roughly like what having __device code in a HIP C++ kernel compiles down to.

And the thing is, not everything else will be the same. Consider compiling for gfx90a and gfx1030 - one has MFMA, the other doesn't. They need entirely separate kernels. And they'll have different launch params.

krzysz00 Jan 26, 2023
Maintainer

And yes, I am assuming that the sequence of kernels you want to run is the same on all GPUs ... which might not be accurate. The design may want to have one offloadable block per target. So you'd have a gfx90a block, a gfx1030 block, and so on.

krzysz00 · 2023-01-25T18:41:55Z

krzysz00
Jan 25, 2023
Maintainer

bundled_launch gets expanded to a gpu.launch call of the binary that matches the GPU that's in the system.

3 replies

manupak Jan 26, 2023
Collaborator Author

As far as I understand there are two problems being discussed here :

C1: How to best represent target type + kind launches that supports delayed symbol resolution (most prolly at runtime ?) ?
to this extent async.bundled_launch is being proposed.

C2: How do we partition a tosa-graph where each partitions could be lowered to a graph of async kernel launches ?
for this, I think we would require some way to edit the launches but we would require the a clean seperated callgraph -- an entry-block'ish way of async ops that could be lowered to graph interchange format for clients.

manupak Jan 26, 2023
Collaborator Author

Copying something related : https://discourse.llvm.org/t/rfc-mlir-spirv-runner/1564

Somehow, I feel like the gpu functions should be in a different module -- whether its nested or not, Im not sure.
Strictly speaking, the restricted call graph is still host code -- therefore, I feel we should have a region in the async.bundled_launch to hold a restricted (for e.g. no scf / cfs constructs) call graph made out async_launches connected via tokens.

Throughout the lowering, once the tosa-graph is partitioned the tosa op rewriters should be able to modify this region of the async.launches to insert any number of kernels.

I think a lowering pass down the lane could inline this regions (of all async.bundled_launch es) to a single graph to optimize the token passing to move the launches of the kernels as it seems fit -- in xmir like flow. For our clients, we can export this region to a graph that could be interpreted from their end to understand how to launch the underlying kernels to get the effect the op the client wanted rocMLIR to generate kernels for.

cc : @sjw36

krzysz00 Jan 26, 2023
Maintainer

I don't understand what you're trying to say.

One of the points of the current async.launch is that it lowers (in the GPU case, compiles) a copy of the function for multiple targets, and then when you hit all the xmir stuff, the runtime picks one of those to execute.

The proposal here (now that you've pointed out that the xdlops vs non-xdlops case, effectively) will be that, instead of one symbol per target, there's one region per target, which'll contain the relevant host code.

jungpark-mlir · 2023-01-26T11:46:57Z

jungpark-mlir
Jan 26, 2023
Collaborator

**** Update from the upstream merge ****

Just noticed new ops in the Async dialect.
There are async_func, _call, _return
Main background looks to support for the coroutine but I don't have a good understanding yet.
So, I'm wondering if this is relevant here.

llvm/llvm-project@145d2a5#diff-6eff94f8f04c403c11293b0077c3e734581bffdaf730621ddb0a6afb61cfad9b

0 replies

krzysz00 · 2023-01-26T16:30:33Z

krzysz00
Jan 26, 2023
Maintainer

Ok, so, to pull things out from the discussion with Manupa, who pointed out some of the complexities of the GPU case, how about this

%tokens, ... = async.offloadable cpu {
  async.await %in_ready
   ...
  async.yield %finished_computation
} "amdghsa-amd-amdgcn:gfx90a" {
  %dev_in_ready, %dev_in = alloc()
  %dev_out_ready, %dev_out = alloc()
  %mecpy_ready = memcpy [%dev_in_ready, %in_ready] %in, %dev_in
  %launch_done = [gpu.launch/async.launch/...] {block_size = N, grid_size = G, ...} [%memcpy_ready, %dev_out_ready] @gpu_kernel_gfx90a(%dev_in, %dev_out) // or @gfx90a::@gpu_kernel
  %done = memcpy [%launch_done] %dev_out, %out
  async.yield %done
} "amdgcn-amd-amdhsa:gfx1030" {
  // same as above
}

and then, if you needed to add an extra kernel, you'd rewrite the gfx90a part of that to

"amdghsa-amd-amdgcn:gfx90a" {
  %dev_in_ready, %dev_in = alloc()
  %dev_out_ready, %dev_out = alloc()
  %zero_init_ready = launch [%dev_out_ready] @gfx90a::@memset_kernel(%dev_out)
  %mecpy_ready = memcpy [%dev_in_ready, %in_ready] %in, %dev_in
  %launch_done = [gpu.launch/async.launch/...] {block_size = N, grid_size = G, ...} [%memcpy_ready, %zero_init_ready] @gpu_kernel_gfx90a(%dev_in, %dev_out) // or @gfx90a::@gpu_kernel
  %done = memcpy [%launch_done] %dev_out, %out
  async.yield %done
}

0 replies

krzysz00 · 2023-01-26T23:46:33Z

krzysz00
Jan 26, 2023
Maintainer

In further notes, given that we want a target's compilation to mess with the relevant code, how's this

%tokens, %results = async.offloaadable (%input_token, %in0, %in1, ...) { cpu = @cpu_module::@cpu_func, amdgcn-amd-amdhsa:gfx90a = @gfx90a_module::@host_func, amdghc-amd-amdhsa:gfx908 = @gfx908_module::@host_func }

where the @host_func contains launches and the like.

5 replies

krzysz00 Jan 26, 2023
Maintainer

(and then xmir can inline the host code if it should feel inclined)

manupak Jan 27, 2023
Collaborator Author

I think we are converging a bit now..

I would still need to find the @gfx90a_module::@host_func starting the from the tosa.op in the partitioned function.
So will the module "@gfx90a_module" contain all those functions ?

i.e.,
From partitioned op persepective, it would just need to find the host_func in the parent module ?

krzysz00 Jan 27, 2023
Maintainer

From the perspective of the cloned partitioned op, it'd need to go find the host code in the same module, yeah.
@gfx90a would have, for example, @gfx90a:::@part_8_kernel and @gfx90a::@part_8, for example, with the latter presumably being the host code.

The rewrite, which would need to operate on a ModuleOp, would to be to go look for callers of the kernel and add the extra kernel launch before them.

manupak Jan 27, 2023
Collaborator Author

SGTM!

krzysz00 Jan 31, 2023
Maintainer

Yeah, pretty much. You could probably do that by finding the callers of the partitioned function.

krzysz00 · 2023-03-01T16:18:51Z

krzysz00
Mar 1, 2023
Maintainer

So, now that async.call exists, how about we add async.offloadable_call, which, instead of 1 symbol, takes a series of symbols which all implement the same function. There'll be a target attribute - either on the offloadable_call itself, on the module containing the function, or on the modules containing the call ... but still, the idea would be that you have a bunch of async functions that will execute the code for a given target, and the runner picks which function to call

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we move from single module kernel generation to multi module kernel & host module generation ? #940

{{title}}

Replies: 9 comments 17 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Should we move from single module kernel generation to multi module kernel & host module generation ? #940

manupak Jan 17, 2023 Collaborator

Replies: 9 comments · 17 replies

krzysz00 Jan 17, 2023 Maintainer

manupak Jan 18, 2023 Collaborator Author

manupak Jan 20, 2023 Collaborator Author

manupak Jan 20, 2023 Collaborator Author

krzysz00 Jan 24, 2023 Maintainer

manupak Jan 25, 2023 Collaborator Author

krzysz00 Jan 25, 2023 Maintainer

manupak Jan 25, 2023 Collaborator Author

krzysz00 Jan 25, 2023 Maintainer

manupak Jan 26, 2023 Collaborator Author

krzysz00 Jan 26, 2023 Maintainer

krzysz00 Jan 26, 2023 Maintainer

krzysz00 Jan 25, 2023 Maintainer

manupak Jan 26, 2023 Collaborator Author

manupak Jan 26, 2023 Collaborator Author

krzysz00 Jan 26, 2023 Maintainer

jungpark-mlir Jan 26, 2023 Collaborator

krzysz00 Jan 26, 2023 Maintainer

krzysz00 Jan 26, 2023 Maintainer

krzysz00 Jan 26, 2023 Maintainer

manupak Jan 27, 2023 Collaborator Author

krzysz00 Jan 27, 2023 Maintainer

manupak Jan 27, 2023 Collaborator Author

krzysz00 Jan 31, 2023 Maintainer

krzysz00 Mar 1, 2023 Maintainer

manupak
Jan 17, 2023
Collaborator

Replies: 9 comments 17 replies

krzysz00
Jan 17, 2023
Maintainer

manupak Jan 18, 2023
Collaborator Author

manupak
Jan 20, 2023
Collaborator Author

manupak Jan 20, 2023
Collaborator Author

krzysz00
Jan 24, 2023
Maintainer

manupak Jan 25, 2023
Collaborator Author

krzysz00
Jan 25, 2023
Maintainer

manupak Jan 25, 2023
Collaborator Author

krzysz00 Jan 25, 2023
Maintainer

manupak Jan 26, 2023
Collaborator Author

krzysz00 Jan 26, 2023
Maintainer

krzysz00 Jan 26, 2023
Maintainer

krzysz00
Jan 25, 2023
Maintainer

manupak Jan 26, 2023
Collaborator Author

manupak Jan 26, 2023
Collaborator Author

krzysz00 Jan 26, 2023
Maintainer

jungpark-mlir
Jan 26, 2023
Collaborator

krzysz00
Jan 26, 2023
Maintainer

krzysz00
Jan 26, 2023
Maintainer

krzysz00 Jan 26, 2023
Maintainer

manupak Jan 27, 2023
Collaborator Author

krzysz00 Jan 27, 2023
Maintainer

manupak Jan 27, 2023
Collaborator Author

krzysz00 Jan 31, 2023
Maintainer

krzysz00
Mar 1, 2023
Maintainer