-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
one-shot-bufferize
pass generates memref.alloc()
s in GPU kernels code and breaks the pipeline
#360
Comments
I think the core problem is that there's a This
any other options that I forgot to mention? @kurapov-peter |
Right, the fill should be handled specially to avoid unnecessary allocations. In most cases, those should actually be placed onto registers and fall back to SLM when we don't have enough. Hoisting allocations for all the groups can provide functional correctness, yet there's little value in it as it'd produce some dead-slow kernels (even though there might be cases where that's necessary but I'd rather see them first).
For simple cases such as MLP, the latter should suffice, as you mention. For generic cases, we'll need additional handling similar to, for example, what iree does with multi-buffering and other optimizations. SLM allocations/deallocations should adhere to semantic restrictions and land into the |
one-shot-bufferize
pass generates memref.alloc()
s in GPU kernels code and breaks the pipeline
Using upstream passes for the fusion ( reproducer after applying the passes above#map = affine_map<(d0, d1) -> (d0, d1)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d2)>
#map2 = affine_map<(d0, d1, d2) -> (d1, d2)>
#map3 = affine_map<(d0, d1, d2) -> (d0, d1)>
module @fragment_name attributes {"#dlti.sys_spec" = #dlti.target_system_spec<"CPU" : #dlti.target_device_spec<#dlti.dl_entry<"tile_size", 32 : i32>, #dlti.dl_entry<"num_threads", 4 : i32>, #dlti.dl_entry<"L1_cache_size_in_bytes", 49152 : i32>, #dlti.dl_entry<"L2_cache_size_in_bytes", 2097152 : i32>, #dlti.dl_entry<"L3_cache_size_in_bytes", 1966080 : i32>, #dlti.dl_entry<"max_vector_width", 512 : i32>>>} {
func.func @entry(%arg0: memref<128x1024xf16>, %arg1: memref<1024x1024xf16>, %arg2: memref<128x1024xf16>, %arg3: memref<128x1024xf16>) attributes {compiletime_const_args_index = [1 : i32, 2 : i32]} {
%cst = arith.constant 0.000000e+00 : f16
%0 = bufferization.to_tensor %arg0 restrict : memref<128x1024xf16>
%1 = bufferization.to_tensor %arg1 restrict : memref<1024x1024xf16>
%2 = bufferization.to_tensor %arg2 restrict : memref<128x1024xf16>
%3 = tensor.empty() : tensor<128x1024xf16>
%4 = linalg.generic {indexing_maps = [#map], iterator_types = ["parallel", "parallel"]} outs(%3 : tensor<128x1024xf16>) {
^bb0(%out: f16):
linalg.yield %cst : f16
} -> tensor<128x1024xf16>
%5 = linalg.generic {indexing_maps = [#map1, #map2, #map3], iterator_types = ["parallel", "parallel", "reduction"]} ins(%0, %1 : tensor<128x1024xf16>, tensor<1024x1024xf16>) outs(%4 : tensor<128x1024xf16>) {
^bb0(%in: f16, %in_0: f16, %out: f16):
%8 = arith.mulf %in, %in_0 : f16
%9 = arith.addf %out, %8 : f16
linalg.yield %9 : f16
} -> tensor<128x1024xf16>
%6 = tensor.empty() : tensor<128x1024xf16>
%7 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%5, %2 : tensor<128x1024xf16>, tensor<128x1024xf16>) outs(%6 : tensor<128x1024xf16>) {
^bb0(%in: f16, %in_0: f16, %out: f16):
%8 = arith.addf %in, %in_0 : f16
%9 = arith.maximumf %8, %cst : f16
linalg.yield %9 : f16
} -> tensor<128x1024xf16>
bufferization.materialize_in_destination %7 in restrict writable %arg3 : (tensor<128x1024xf16>, memref<128x1024xf16>) -> ()
return
}
}
mlir after specializing generic ops#map = affine_map<(d0, d1) -> (d0, d1)>
module @fragment_name attributes {"#dlti.sys_spec" = #dlti.target_system_spec<"CPU" : #dlti.target_device_spec<#dlti.dl_entry<"tile_size", 32 : i32>, #dlti.dl_entry<"num_threads", 4 : i32>, #dlti.dl_entry<"L1_cache_size_in_bytes", 49152 : i32>, #dlti.dl_entry<"L2_cache_size_in_bytes", 2097152 : i32>, #dlti.dl_entry<"L3_cache_size_in_bytes", 1966080 : i32>, #dlti.dl_entry<"max_vector_width", 512 : i32>>>} {
func.func @entry(%arg0: memref<128x1024xf16>, %arg1: memref<1024x1024xf16>, %arg2: memref<128x1024xf16>, %arg3: memref<128x1024xf16>) attributes {compiletime_const_args_index = [1 : i32, 2 : i32]} {
%cst = arith.constant 0.000000e+00 : f16
%0 = bufferization.to_tensor %arg0 restrict : memref<128x1024xf16>
%1 = bufferization.to_tensor %arg1 restrict : memref<1024x1024xf16>
%2 = bufferization.to_tensor %arg2 restrict : memref<128x1024xf16>
%3 = tensor.empty() : tensor<128x1024xf16>
%4 = linalg.generic {indexing_maps = [#map], iterator_types = ["parallel", "parallel"]} outs(%3 : tensor<128x1024xf16>) {
^bb0(%out: f16):
linalg.yield %cst : f16
} -> tensor<128x1024xf16>
%5 = linalg.matmul_transpose_b ins(%0, %1 : tensor<128x1024xf16>, tensor<1024x1024xf16>) outs(%4 : tensor<128x1024xf16>) -> tensor<128x1024xf16>
%6 = tensor.empty() : tensor<128x1024xf16>
%7 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%5, %2 : tensor<128x1024xf16>, tensor<128x1024xf16>) outs(%6 : tensor<128x1024xf16>) {
^bb0(%in: f16, %in_0: f16, %out: f16):
%8 = arith.addf %in, %in_0 : f16
%9 = arith.maximumf %8, %cst : f16
linalg.yield %9 : f16
} -> tensor<128x1024xf16>
bufferization.materialize_in_destination %7 in restrict writable %arg3 : (tensor<128x1024xf16>, memref<128x1024xf16>) -> ()
return
}
} The first of the remaining generics is a The only way to generalize the second generic is to replace it with @kurapov-peter I'm wondering what our next steps should be. Are we okay with having some linalg-ops as generics during our pipeline (I thought you've said we should better have named ops)? How should we handle UPD: discussed offline and decided to gradually apply |
Unfortunately this won't solve the problem completely. Besides the case with It appears that in case of a nested-looping the Simple matmul module with nested tilingfunc.func @entry(%arg0: memref<64x128xf16>, %arg1: memref<128x128xf16>, %arg2: memref<64x128xf16>, %arg3: memref<i8>) {
%cst = arith.constant 0.000000e+00 : f16
%alloc = memref.alloc() {alignment = 64 : i64} : memref<64x128xf16>
scf.forall (%arg4, %arg5) = (0, 0) to (64, 128) step (64, 128) {
%subview = memref.subview %arg0[%arg4, 0] [64, 128] [1, 1] : memref<64x128xf16> to memref<64x128xf16, strided<[128, 1], offset: ?>>
%subview_0 = memref.subview %arg1[%arg5, 0] [128, 128] [1, 1] : memref<128x128xf16> to memref<128x128xf16, strided<[128, 1], offset: ?>>
%subview_1 = memref.subview %alloc[%arg4, %arg5] [64, 128] [1, 1] : memref<64x128xf16> to memref<64x128xf16, strided<[128, 1], offset: ?>>
%subview_2 = memref.subview %arg0[%arg4, %arg5] [64, 128] [1, 1] : memref<64x128xf16> to memref<64x128xf16, strided<[128, 1], offset: ?>>
%subview_3 = memref.subview %arg2[%arg4, %arg5] [64, 128] [1, 1] : memref<64x128xf16> to memref<64x128xf16, strided<[128, 1], offset: ?>>
scf.forall (%arg6, %arg7) = (0, 0) to (64, 128) step (8, 16) {
%subview_5 = memref.subview %subview[%arg6, 0] [8, 128] [1, 1] : memref<64x128xf16, strided<[128, 1], offset: ?>> to memref<8x128xf16, strided<[128, 1], offset: ?>>
%subview_6 = memref.subview %subview_0[%arg7, 0] [16, 128] [1, 1] : memref<128x128xf16, strided<[128, 1], offset: ?>> to memref<16x128xf16, strided<[128, 1], offset: ?>>
%subview_7 = memref.subview %subview_1[%arg6, %arg7] [8, 16] [1, 1] : memref<64x128xf16, strided<[128, 1], offset: ?>> to memref<8x16xf16, strided<[128, 1], offset: ?>>
// allocating buffer for 'linalg.fill' result (also matmul result)
%alloc_8 = memref.alloc() {alignment = 64 : i64} : memref<8x16xf16>
linalg.fill ins(%cst : f16) outs(%alloc_8 : memref<8x16xf16>)
linalg.matmul_transpose_b ins(%subview_5, %subview_6 : memref<8x128xf16, strided<[128, 1], offset: ?>>, memref<16x128xf16, strided<[128, 1], offset: ?>>) outs(%alloc_8 : memref<8x16xf16>)
%subview_9 = memref.subview %subview_2[%arg6, %arg7] [8, 16] [1, 1] : memref<64x128xf16, strided<[128, 1], offset: ?>> to memref<8x16xf16, strided<[128, 1], offset: ?>>
%subview_10 = memref.subview %subview_3[%arg6, %arg7] [8, 16] [1, 1] : memref<64x128xf16, strided<[128, 1], offset: ?>> to memref<8x16xf16, strided<[128, 1], offset: ?>>
linalg.add ins(%alloc_8, %subview_9 : memref<8x16xf16>, memref<8x16xf16, strided<[128, 1], offset: ?>>) outs(%subview_10 : memref<8x16xf16, strided<[128, 1], offset: ?>>)
%subview_11 = memref.subview %subview_3[%arg6, %arg7] [8, 16] [1, 1] : memref<64x128xf16, strided<[128, 1], offset: ?>> to memref<8x16xf16, strided<[128, 1], offset: ?>>
memref.copy %subview_10, %subview_11 : memref<8x16xf16, strided<[128, 1], offset: ?>> to memref<8x16xf16, strided<[128, 1], offset: ?>>
}
%subview_4 = memref.subview %arg2[%arg4, %arg5] [64, 128] [1, 1] : memref<64x128xf16> to memref<64x128xf16, strided<[128, 1], offset: ?>>
memref.copy %subview_3, %subview_4 : memref<64x128xf16, strided<[128, 1], offset: ?>> to memref<64x128xf16, strided<[128, 1], offset: ?>>
}
memref.copy %arg2, %arg2 : memref<64x128xf16> to memref<64x128xf16>
return
} @kurapov-peter what do you think we should do about this problem right now? Will lowering |
I think it's time to address the problem. Let's start with a pass that would put the allocas to SLM. |
The input code is as follows and the insertGPUAllocs cannot deal with the following case properly.
"gpu.dealloc"(%51) : (memref<16x16xf16>) -> ()
should be inserted inside the kernel code but be inserted ouside nowReproducer
gc-opt --gc-gpu-pipeline file.mlir
File.mlir:
long log
test.txt
The text was updated successfully, but these errors were encountered: