-
Notifications
You must be signed in to change notification settings - Fork 30
End to end walk through
Welcome to the iree-amd-aie wiki!
Below is a demonstration of the end-to-end flow for compiling an ONNX model to AIE execution including custom dispatches.
Model ingestion involve legalization from a number of different source languages. Each of these languages can be cross-converted from one to another. Supporting a wide array of front ends is achieved by finding a fully representative path from the source language to the target. In the case of onnx
we target linalg
via torch
.
ONNX models are ingested using the onnx_importer, converting the ONNX binary file to a series of torch.operator
functions represented in the MLIR dialect. Below is a sample of this ingested form for a onnx.matmul
operation.
Run: onnx_importer.py file.onnx > onnx.mlir
Results
func.func @test_matmul_2d(%arg0: !torch.vtensor<[8,16],f32>, %arg1: !torch.vtensor<[16,8],f32>) -> !torch.vtensor<[8,8],f32> attributes {torch.onnx_meta.ir_version = 7 : si64, torch.onnx_meta.opset_version = 13 : si64} {
%0 = torch.operator "onnx.MatMul"(%arg0, %arg1) : (!torch.vtensor<[8,16],f32>, !torch.vtensor<[16,8],f32>) -> !torch.vtensor<[8,8],f32>
return %0 : !torch.vtensor<[8,8],f32>
}
Once ingested these onnx
operations can be translated into native torch.aten
operators. This allows onnx
to be lowered to alternative representations via the existing torch
conversion pipelines and avoids creating a custom onnx
pipeline.
torch-mlir-opt --convert-torch-onnx-to-torch onnx.mlir
Results
func.func @test_matmul_2d(%arg0: !torch.vtensor<[8,16],f32>, %arg1: !torch.vtensor<[16,8],f32>) -> !torch.vtensor<[8,8],f32> attributes {torch.onnx_meta.ir_version = 7 : si64, torch.onnx_meta.opset_version = 13 : si64} {
%0 = torch.aten.matmul %arg0, %arg1 : !torch.vtensor<[8,16],f32>, !torch.vtensor<[16,8],f32> -> !torch.vtensor<[8,8],f32>
return %0 : !torch.vtensor<[8,8],f32>
}
When targeting aie
we convert from the torch
dialect to our general computational dialect, linalg
. Unlike torch
, linalg
provides a semantic representation of what each operation does by focusing on composition instead of standalone operations.
torch-mlir-opt --convert-torch-to-linalg --torch-func-backend-type-conversion --cse --canonicalize --torch-finalizing-backend-type-conversion torch.mlir > linalg.mlir
Results
func.func @test_matmul_2d(%arg0: tensor<8x16xf32>, %arg1: tensor<16x8xf32>) -> tensor<8x8xf32> attributes {torch.onnx_meta.ir_version = 7 : si64, torch.onnx_meta.opset_version = 13 : si64, torch.onnx_meta.producer_name = "backend-test", torch.onnx_meta.producer_version = ""} {
%cst = arith.constant 0.000000e+00 : f32
%0 = tensor.empty() : tensor<8x8xf32>
%1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<8x8xf32>) -> tensor<8x8xf32>
%2 = linalg.matmul ins(%arg0, %arg1 : tensor<8x16xf32>, tensor<16x8xf32>) outs(%1 : tensor<8x8xf32>) -> tensor<8x8xf32>
return %2 : tensor<8x8xf32>
}
SAMPLES_DIR=<iree-amd-aie source dir>/tests/samples
iree-compile --iree-hal-target-backends=amd-aie \
${SAMPLES_DIR}/simple_pack_pipeline_e2e.mlir \
--iree-amd-aie-peano-install-dir=<peano installation directory> \
--iree-amd-aie-mlir-aie-install-dir=<mlir-aie installation directory> \
--iree-amd-aie-install-dir=<iree installation directory>> \
--iree-amd-aie-vitis-install-dir=<vitis installation directory> \
--iree-hal-dump-executable-files-to=$PWD \
--iree-amd-aie-show-invoked-commands -o output.vmfb
In this example we show a a matrix multiplication between a tensor<8x16xi32>
and a tensor<16x8xi32>
resulting in a tensor<8x8xi32>
.
Results
module {
func.func @matmul_static_dispatch_0_matmul_8x8x16_i32() {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%c2 = arith.constant 2 : index
%c16 = arith.constant 16 : index
%c4 = arith.constant 4 : index
%c0_i32 = arith.constant 0 : i32
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : memref<8x16xi32>
memref.assume_alignment %0, 64 : memref<8x16xi32>
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : memref<16x8xi32>
memref.assume_alignment %1, 64 : memref<16x8xi32>
%2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) : memref<8x8xi32>
memref.assume_alignment %2, 64 : memref<8x8xi32>
scf.parallel (%arg0, %arg1) = (%c0, %c0) to (%c1, %c1) step (%c1, %c1) {
%3 = affine.apply affine_map<(d0) -> (d0 * 8)>(%arg0)
%4 = affine.apply affine_map<(d0) -> (d0 * 8)>(%arg1)
%subview = memref.subview %0[%3, 0] [8, 16] [1, 1] : memref<8x16xi32> to memref<8x16xi32, strided<[16, 1], offset: ?>>
%subview_0 = memref.subview %1[0, %4] [16, 8] [1, 1] : memref<16x8xi32> to memref<16x8xi32, strided<[8, 1], offset: ?>>
%subview_1 = memref.subview %2[%3, %4] [8, 8] [1, 1] : memref<8x8xi32> to memref<8x8xi32, strided<[8, 1], offset: ?>>
%alloc = memref.alloc() : memref<8x16xi32, 1>
memref.copy %subview, %alloc : memref<8x16xi32, strided<[16, 1], offset: ?>> to memref<8x16xi32, 1>
%alloc_2 = memref.alloc() : memref<16x8xi32, 1>
memref.copy %subview_0, %alloc_2 : memref<16x8xi32, strided<[8, 1], offset: ?>> to memref<16x8xi32, 1>
%alloc_3 = memref.alloc() : memref<8x8xi32, 1>
scf.parallel (%arg2, %arg3) = (%c0, %c0) to (%c2, %c2) step (%c1, %c1) {
%5 = affine.apply affine_map<(d0) -> (d0 * 4)>(%arg2)
%6 = affine.apply affine_map<(d0) -> (d0 * 4)>(%arg3)
%subview_4 = memref.subview %alloc[%5, 0] [4, 16] [1, 1] : memref<8x16xi32, 1> to memref<4x16xi32, strided<[16, 1], offset: ?>, 1>
%subview_5 = memref.subview %alloc_2[0, %6] [16, 4] [1, 1] : memref<16x8xi32, 1> to memref<16x4xi32, strided<[8, 1], offset: ?>, 1>
%subview_6 = memref.subview %alloc_3[%5, %6] [4, 4] [1, 1] : memref<8x8xi32, 1> to memref<4x4xi32, strided<[8, 1], offset: ?>, 1>
%alloc_7 = memref.alloc() : memref<4x4xi32, 2>
linalg.fill ins(%c0_i32 : i32) outs(%alloc_7 : memref<4x4xi32, 2>)
scf.for %arg4 = %c0 to %c16 step %c4 {
%subview_8 = memref.subview %subview_4[0, %arg4] [4, 4] [1, 1] : memref<4x16xi32, strided<[16, 1], offset: ?>, 1> to memref<4x4xi32, strided<[16, 1], offset: ?>, 1>
%subview_9 = memref.subview %subview_5[%arg4, 0] [4, 4] [1, 1] : memref<16x4xi32, strided<[8, 1], offset: ?>, 1> to memref<4x4xi32, strided<[8, 1], offset: ?>, 1>
%alloc_10 = memref.alloc() : memref<4x4xi32, 2>
memref.copy %subview_8, %alloc_10 : memref<4x4xi32, strided<[16, 1], offset: ?>, 1> to memref<4x4xi32, 2>
%alloc_11 = memref.alloc() : memref<4x4xi32, 2>
memref.copy %subview_9, %alloc_11 : memref<4x4xi32, strided<[8, 1], offset: ?>, 1> to memref<4x4xi32, 2>
linalg.matmul ins(%alloc_10, %alloc_11 : memref<4x4xi32, 2>, memref<4x4xi32, 2>) outs(%alloc_7 : memref<4x4xi32, 2>)
memref.dealloc %alloc_10 : memref<4x4xi32, 2>
memref.dealloc %alloc_11 : memref<4x4xi32, 2>
}
memref.copy %alloc_7, %subview_6 : memref<4x4xi32, 2> to memref<4x4xi32, strided<[8, 1], offset: ?>, 1>
memref.dealloc %alloc_7 : memref<4x4xi32, 2>
scf.yield
}
memref.copy %alloc_3, %subview_1 : memref<8x8xi32, 1> to memref<8x8xi32, strided<[8, 1], offset: ?>>
memref.dealloc %alloc : memref<8x16xi32, 1>
memref.dealloc %alloc_2 : memref<16x8xi32, 1>
memref.dealloc %alloc_3 : memref<8x8xi32, 1>
scf.yield
}
return
}
}
Through passes in IREE we represent the memory movement from DDR to L2 cache to L1 cache and associated loop nests and computation in the above IR. This is then handed of to mlir-air passes.
Results
func.func @matmul_static_dispatch_0_matmul_8x8x16_i32() {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%async_token, %results = air.execute -> (memref<8x16xi32>) {
%1 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : memref<8x16xi32>
air.execute_terminator %1 : memref<8x16xi32>
}
%async_token_0 = air.execute [%async_token] {
memref.assume_alignment %results, 64 : memref<8x16xi32>
}
%async_token_1, %results_2 = air.execute -> (memref<16x8xi32>) {
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : memref<16x8xi32>
air.execute_terminator %1 : memref<16x8xi32>
}
%async_token_3 = air.execute [%async_token_1] {
memref.assume_alignment %results_2, 64 : memref<16x8xi32>
}
%async_token_4, %results_5 = air.execute -> (memref<8x8xi32>) {
%1 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) : memref<8x8xi32>
air.execute_terminator %1 : memref<8x8xi32>
}
%async_token_6 = air.execute [%async_token_4] {
memref.assume_alignment %results_5, 64 : memref<8x8xi32>
}
%0 = air.launch async [%async_token_0, %async_token_3, %async_token_6] (%arg0, %arg1) in (%arg2=%c1, %arg3=%c1) args(%arg4=%results, %arg5=%results_2, %arg6=%results_5) : memref<8x16xi32>, memref<16x8xi32>, memref<8x8xi32> attributes {id = 1 : i32} {
%c1_7 = arith.constant 1 : index
%c16 = arith.constant 16 : index
%c8 = arith.constant 8 : index
%c0_8 = arith.constant 0 : index
%async_token_9, %results_10 = air.execute -> (index) {
%5 = affine.apply affine_map<()[s0] -> (s0 * 8)>()[%arg0]
air.execute_terminator %5 : index
}
%async_token_11, %results_12 = air.execute -> (index) {
%5 = affine.apply affine_map<()[s0] -> (s0 * 8)>()[%arg1]
air.execute_terminator %5 : index
}
%async_token_13, %results_14 = air.execute -> (index) {
%5 = affine.apply affine_map<()[s0] -> (s0 * 8)>()[%arg0]
air.execute_terminator %5 : index
}
%async_token_15, %results_16 = air.execute -> (index) {
%5 = affine.apply affine_map<()[s0] -> (s0 * 8)>()[%arg0]
air.execute_terminator %5 : index
}
%1 = air.channel.put async [%async_token_15] @channel_5[] (%arg4[%results_16, %c0_8] [%c8, %c16] [%c16, %c1_7]) {id = 1 : i32} : (memref<8x16xi32>)
%2 = air.channel.put async [%async_token_11, %1] @channel_5[] (%arg5[%c0_8, %results_12] [%c16, %c8] [%c8, %c1_7]) {id = 2 : i32} : (memref<16x8xi32>)
%async_token_17, %results_18 = air.execute -> (index) {
%5 = affine.apply affine_map<()[s0] -> (s0 * 8)>()[%arg0]
air.execute_terminator %5 : index
}
%async_token_19, %results_20 = air.execute -> (index) {
%5 = affine.apply affine_map<()[s0] -> (s0 * 8)>()[%arg1]
air.execute_terminator %5 : index
}
%3 = air.channel.get async [%async_token_17, %async_token_19] @channel_7[] (%arg6[%results_18, %results_20] [%c8, %c8] [%c8, %c1_7]) {id = 3 : i32} : (memref<8x8xi32>)
%4 = air.segment @segment_0 async attributes {id = 2 : i32, x_loc = 0 : i64, x_size = 1 : i64, y_loc = 2 : i64, y_size = 4 : i64} {
%c4 = arith.constant 4 : index
%c2 = arith.constant 2 : index
%c12 = arith.constant 12 : index
%c8_21 = arith.constant 8 : index
%c1_22 = arith.constant 1 : index
%c16_23 = arith.constant 16 : index
%c0_24 = arith.constant 0 : index
%async_token_25, %results_26 = air.execute -> (memref<8x16xi32, 1>) {
%alloc = memref.alloc() : memref<8x16xi32, 1>
air.execute_terminator %alloc : memref<8x16xi32, 1>
}
%5 = air.wait_all async
%async_token_27, %results_28 = air.execute -> (memref<16x8xi32, 1>) {
%alloc = memref.alloc() : memref<16x8xi32, 1>
air.execute_terminator %alloc : memref<16x8xi32, 1>
}
%6 = air.channel.get async [%async_token_25] @channel_5[] (%results_26[] [] []) {id = 4 : i32} : (memref<8x16xi32, 1>)
%7 = air.channel.get async [%async_token_27, %6] @channel_5[] (%results_28[] [] []) {id = 5 : i32} : (memref<16x8xi32, 1>)
%async_token_29, %results_30 = air.execute -> (memref<8x8xi32, 1>) {
%alloc = memref.alloc() : memref<8x8xi32, 1>
air.execute_terminator %alloc : memref<8x8xi32, 1>
}
%8 = air.channel.put async [%5] @channel_0[] (%results_26[%c0_24, %c0_24] [%c4, %c4] [%c16_23, %c1_22]) {id = 6 : i32, unrolled_iteration = 0 : i32} : (memref<8x16xi32, 1>)
%9 = air.channel.put async [%8] @channel_0[] (%results_26[%c0_24, %c4] [%c4, %c4] [%c16_23, %c1_22]) {id = 6 : i32, unrolled_iteration = 1 : i32} : (memref<8x16xi32, 1>)
%10 = air.channel.put async [%9] @channel_0[] (%results_26[%c0_24, %c8_21] [%c4, %c4] [%c16_23, %c1_22]) {id = 6 : i32, unrolled_iteration = 2 : i32} : (memref<8x16xi32, 1>)
%11 = air.channel.put async [%10] @channel_0[] (%results_26[%c0_24, %c12] [%c4, %c4] [%c16_23, %c1_22]) {id = 6 : i32, unrolled_iteration = 3 : i32} : (memref<8x16xi32, 1>)
%12 = air.channel.put async [%5] @channel_1[] (%results_26[%c4, %c0_24] [%c4, %c4] [%c16_23, %c1_22]) {id = 7 : i32, unrolled_iteration = 0 : i32} : (memref<8x16xi32, 1>)
%13 = air.channel.put async [%12] @channel_1[] (%results_26[%c4, %c4] [%c4, %c4] [%c16_23, %c1_22]) {id = 7 : i32, unrolled_iteration = 1 : i32} : (memref<8x16xi32, 1>)
%14 = air.channel.put async [%13] @channel_1[] (%results_26[%c4, %c8_21] [%c4, %c4] [%c16_23, %c1_22]) {id = 7 : i32, unrolled_iteration = 2 : i32} : (memref<8x16xi32, 1>)
%15 = air.channel.put async [%14] @channel_1[] (%results_26[%c4, %c12] [%c4, %c4] [%c16_23, %c1_22]) {id = 7 : i32, unrolled_iteration = 3 : i32} : (memref<8x16xi32, 1>)
%16 = air.channel.put async [%7] @channel_2[] (%results_28[%c0_24, %c0_24] [%c4, %c4] [%c8_21, %c1_22]) {id = 8 : i32, unrolled_iteration = 0 : i32} : (memref<16x8xi32, 1>)
%17 = air.channel.put async [%16] @channel_2[] (%results_28[%c4, %c0_24] [%c4, %c4] [%c8_21, %c1_22]) {id = 8 : i32, unrolled_iteration = 1 : i32} : (memref<16x8xi32, 1>)
%18 = air.channel.put async [%17] @channel_2[] (%results_28[%c8_21, %c0_24] [%c4, %c4] [%c8_21, %c1_22]) {id = 8 : i32, unrolled_iteration = 2 : i32} : (memref<16x8xi32, 1>)
%19 = air.channel.put async [%18] @channel_2[] (%results_28[%c12, %c0_24] [%c4, %c4] [%c8_21, %c1_22]) {id = 8 : i32, unrolled_iteration = 3 : i32} : (memref<16x8xi32, 1>)
%20 = air.channel.put async [%7] @channel_3[] (%results_28[%c0_24, %c4] [%c4, %c4] [%c8_21, %c1_22]) {id = 9 : i32, unrolled_iteration = 0 : i32} : (memref<16x8xi32, 1>)
%21 = air.channel.put async [%20] @channel_3[] (%results_28[%c4, %c4] [%c4, %c4] [%c8_21, %c1_22]) {id = 9 : i32, unrolled_iteration = 1 : i32} : (memref<16x8xi32, 1>)
%22 = air.channel.put async [%21] @channel_3[] (%results_28[%c8_21, %c4] [%c4, %c4] [%c8_21, %c1_22]) {id = 9 : i32, unrolled_iteration = 2 : i32} : (memref<16x8xi32, 1>)
%23 = air.channel.put async [%22] @channel_3[] (%results_28[%c12, %c4] [%c4, %c4] [%c8_21, %c1_22]) {id = 9 : i32, unrolled_iteration = 3 : i32} : (memref<16x8xi32, 1>)
%24 = scf.parallel (%arg7, %arg8) = (%c0_24, %c0_24) to (%c2, %c2) step (%c1_22, %c1_22) init (%async_token_29) -> !air.async.token {
%async_token_34, %results_35 = air.execute -> (index) {
%28 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%arg7]
air.execute_terminator %28 : index
}
%async_token_36, %results_37 = air.execute -> (index) {
%28 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%arg8]
air.execute_terminator %28 : index
}
%27 = air.channel.get async [%async_token_29, %async_token_36, %async_token_34] @channel_6[%arg7, %arg8] (%results_30[%results_35, %results_37] [%c4, %c4] [%c8_21, %c1_22]) {id = 10 : i32} : (memref<8x8xi32, 1>)
scf.reduce(%27) : !air.async.token {
^bb0(%arg9: !air.async.token, %arg10: !air.async.token):
%28 = air.wait_all async [%arg9, %arg10]
scf.reduce.return %28 : !air.async.token
}
scf.yield
}
%25 = air.herd @herd_0 async [%7, %async_token_29] tile (%arg7, %arg8) in (%arg9=%c1_22, %arg10=%c4) attributes {id = 3 : i32, x_loc = 0 : i64, y_loc = 2 : i64} {
%c4_34 = arith.constant 4 : index
%c1_35 = arith.constant 1 : index
%c16_36 = arith.constant 16 : index
%c0_37 = arith.constant 0 : index
%c0_i32 = arith.constant 0 : i32
%c8_38 = arith.constant 8 : index
%c2_39 = arith.constant 2 : index
%27 = arith.remsi %arg8, %c2_39 : index
%28 = arith.divsi %arg8, %c2_39 : index
%async_token_40, %results_41 = air.execute -> (memref<4x4xi32, 2>) {
%alloc = memref.alloc() : memref<4x4xi32, 2>
air.execute_terminator %alloc : memref<4x4xi32, 2>
}
%async_token_42 = air.execute [%async_token_40] {
scf.for %arg11 = %c0_37 to %c4_34 step %c1_35 {
scf.for %arg12 = %c0_37 to %c4_34 step %c1_35 {
memref.store %c0_i32, %results_41[%arg11, %arg12] : memref<4x4xi32, 2>
}
}
}
%async_token_43, %results_44 = air.execute [%async_token_42] -> (memref<4x4xi32, 2>) {
%alloc = memref.alloc() : memref<4x4xi32, 2>
air.execute_terminator %alloc : memref<4x4xi32, 2>
}
%async_token_45, %results_46 = air.execute [%async_token_43] -> (memref<4x4xi32, 2>) {
%alloc = memref.alloc() : memref<4x4xi32, 2>
air.execute_terminator %alloc : memref<4x4xi32, 2>
}
%async_token_47, %results_48 = air.execute [%async_token_45] -> (memref<4x4xi32, 2>) {
%alloc = memref.alloc() : memref<4x4xi32, 2>
air.execute_terminator %alloc : memref<4x4xi32, 2>
}
%async_token_49, %results_50 = air.execute [%async_token_45] -> (memref<4x4xi32, 2>) {
%alloc = memref.alloc() : memref<4x4xi32, 2>
air.execute_terminator %alloc : memref<4x4xi32, 2>
}
%29:4 = scf.for %arg11 = %c0_37 to %c16_36 step %c8_38 iter_args(%arg12 = %async_token_47, %arg13 = %async_token_49, %arg14 = %async_token_49, %arg15 = %async_token_49) -> (!air.async.token, !air.async.token, !air.async.token, !air.async.token) {
%31 = affine.if affine_set<()[s0, s1] : (s0 == 0, s1 >= 0, -s1 + 1 >= 0)>()[%28, %27] -> !air.async.token {
%36 = air.channel.get async [%arg15, %arg12, %async_token_47] @channel_0[%28, %27] (%results_48[] [] []) {id = 11 : i32} : (memref<4x4xi32, 2>)
affine.yield %36 : !air.async.token
} else {
%36 = air.channel.get async [%arg15, %arg12, %async_token_47] @channel_1[%28, %27] (%results_48[] [] []) {id = 12 : i32} : (memref<4x4xi32, 2>)
affine.yield %36 : !air.async.token
}
%32 = affine.if affine_set<()[s0, s1] : (s0 >= 0, -s0 + 1 >= 0, s1 == 0)>()[%28, %27] -> !air.async.token {
%36 = air.channel.get async [%arg15, %arg12, %async_token_49] @channel_2[%28, %27] (%results_50[] [] []) {id = 13 : i32} : (memref<4x4xi32, 2>)
affine.yield %36 : !air.async.token
} else {
%36 = air.channel.get async [%arg15, %arg12, %async_token_49] @channel_3[%28, %27] (%results_50[] [] []) {id = 14 : i32} : (memref<4x4xi32, 2>)
affine.yield %36 : !air.async.token
}
%async_token_52 = air.execute [%arg14, %32, %31] {
scf.for %arg16 = %c0_37 to %c4_34 step %c1_35 {
scf.for %arg17 = %c0_37 to %c4_34 step %c1_35 {
scf.for %arg18 = %c0_37 to %c4_34 step %c1_35 {
%36 = memref.load %results_48[%arg16, %arg18] : memref<4x4xi32, 2>
%37 = memref.load %results_50[%arg18, %arg17] : memref<4x4xi32, 2>
%38 = memref.load %results_41[%arg16, %arg17] : memref<4x4xi32, 2>
%39 = arith.muli %36, %37 : i32
%40 = arith.addi %38, %39 : i32
memref.store %40, %results_41[%arg16, %arg17] : memref<4x4xi32, 2>
}
}
}
}
%async_token_53 = air.execute {
memref.dealloc %results_48 : memref<4x4xi32, 2>
}
%async_token_54 = air.execute {
memref.dealloc %results_50 : memref<4x4xi32, 2>
}
%33 = affine.if affine_set<()[s0, s1] : (s0 == 0, s1 >= 0, -s1 + 1 >= 0)>()[%28, %27] -> !air.async.token {
%36 = air.channel.get async [%32, %31, %arg13] @channel_0[%28, %27] (%results_46[] [] []) {id = 11 : i32} : (memref<4x4xi32, 2>)
affine.yield %36 : !air.async.token
} else {
%36 = air.channel.get async [%32, %31, %arg13] @channel_1[%28, %27] (%results_46[] [] []) {id = 12 : i32} : (memref<4x4xi32, 2>)
affine.yield %36 : !air.async.token
}
%34 = affine.if affine_set<()[s0, s1] : (s0 >= 0, -s0 + 1 >= 0, s1 == 0)>()[%28, %27] -> !air.async.token {
%36 = air.channel.get async [%32, %31, %arg13] @channel_2[%28, %27] (%results_44[] [] []) {id = 13 : i32} : (memref<4x4xi32, 2>)
affine.yield %36 : !air.async.token
} else {
%36 = air.channel.get async [%32, %31, %arg13] @channel_3[%28, %27] (%results_44[] [] []) {id = 14 : i32} : (memref<4x4xi32, 2>)
affine.yield %36 : !air.async.token
}
%async_token_55 = air.execute [%async_token_52, %34, %33] {
scf.for %arg16 = %c0_37 to %c4_34 step %c1_35 {
scf.for %arg17 = %c0_37 to %c4_34 step %c1_35 {
scf.for %arg18 = %c0_37 to %c4_34 step %c1_35 {
%36 = memref.load %results_46[%arg16, %arg18] : memref<4x4xi32, 2>
%37 = memref.load %results_44[%arg18, %arg17] : memref<4x4xi32, 2>
%38 = memref.load %results_41[%arg16, %arg17] : memref<4x4xi32, 2>
%39 = arith.muli %36, %37 : i32
%40 = arith.addi %38, %39 : i32
memref.store %40, %results_41[%arg16, %arg17] : memref<4x4xi32, 2>
}
}
}
}
%async_token_56 = air.execute {
memref.dealloc %results_46 : memref<4x4xi32, 2>
}
%async_token_57 = air.execute {
memref.dealloc %results_44 : memref<4x4xi32, 2>
}
%35 = air.wait_all async [%33, %34]
scf.yield %async_token_52, %async_token_55, %async_token_55, %35 : !air.async.token, !air.async.token, !air.async.token, !air.async.token
}
%30 = air.channel.put async [%29#1] @channel_6[%28, %27] (%results_41[] [] []) {id = 15 : i32} : (memref<4x4xi32, 2>)
%async_token_51 = air.execute [%30] {
memref.dealloc %results_41 : memref<4x4xi32, 2>
}
air.herd_terminator
}
%26 = air.channel.put async [%25] @channel_7[] (%results_30[] [] []) {id = 16 : i32} : (memref<8x8xi32, 1>)
%async_token_31 = air.execute [%5] {
memref.dealloc %results_26 : memref<8x16xi32, 1>
}
%async_token_32 = air.execute [%7] {
memref.dealloc %results_28 : memref<16x8xi32, 1>
}
%async_token_33 = air.execute [%26] {
memref.dealloc %results_30 : memref<8x8xi32, 1>
}
air.segment_terminator
}
air.launch_terminator
}
return
}
AIR passes do async analysis to determine the scheduling and logical placement/routing.
Results
#executable_target_elf = #hal.executable.target<"amd-aie", "elf", {target_arch = "chip-tbd"}>
#pipeline_layout = #hal.pipeline.layout<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer, ReadOnly>, <2, storage_buffer>]>]>
#translation = #iree_codegen.translation_info<TransformDialectCodegen codegen_spec = @__transform_main>
#device_target_amd_aie = #hal.device.target<"amd-aie", {executable_targets = [#executable_target_elf], legacy_sync}>
module attributes {hal.device.targets = [#device_target_amd_aie]} {
hal.executable private @matmul_static_dispatch_0 {
hal.executable.variant public @elf target(#executable_target_elf) {
hal.executable.export public @matmul_static_dispatch_0_matmul_64x64x256_i32 ordinal(0) layout(#pipeline_layout) attributes {translation_info = #translation} {
^bb0(%arg0: !hal.device):
%c2 = arith.constant 2 : index
%c1 = arith.constant 1 : index
hal.return %c2, %c2, %c1 : index, index, index
}
builtin.module {
AIE.device(ipu) {
%tile_0_0 = AIE.tile(0, 0)
%tile_0_1 = AIE.tile(0, 1)
%tile_0_2 = AIE.tile(0, 2)
%tile_0_3 = AIE.tile(0, 3)
%tile_0_4 = AIE.tile(0, 4)
%tile_0_5 = AIE.tile(0, 5)
%lock_0_1 = AIE.lock(%tile_0_1, 5) {init = 4 : i32}
%lock_0_1_0 = AIE.lock(%tile_0_1, 4) {init = 0 : i32}
%lock_0_1_1 = AIE.lock(%tile_0_1, 3) {init = 2 : i32}
%lock_0_1_2 = AIE.lock(%tile_0_1, 2) {init = 0 : i32}
%lock_0_1_3 = AIE.lock(%tile_0_1, 1) {init = 2 : i32}
%lock_0_1_4 = AIE.lock(%tile_0_1, 0) {init = 0 : i32}
%lock_0_2 = AIE.lock(%tile_0_2, 5) {init = 2 : i32}
%lock_0_2_5 = AIE.lock(%tile_0_2, 4) {init = 0 : i32}
%lock_0_2_6 = AIE.lock(%tile_0_2, 3) {init = 2 : i32}
%lock_0_2_7 = AIE.lock(%tile_0_2, 2) {init = 0 : i32}
%lock_0_2_8 = AIE.lock(%tile_0_2, 1) {init = 1 : i32}
%lock_0_2_9 = AIE.lock(%tile_0_2, 0) {init = 0 : i32}
%lock_0_3 = AIE.lock(%tile_0_3, 5) {init = 2 : i32}
%lock_0_3_10 = AIE.lock(%tile_0_3, 4) {init = 0 : i32}
%lock_0_3_11 = AIE.lock(%tile_0_3, 3) {init = 2 : i32}
%lock_0_3_12 = AIE.lock(%tile_0_3, 2) {init = 0 : i32}
%lock_0_3_13 = AIE.lock(%tile_0_3, 1) {init = 1 : i32}
%lock_0_3_14 = AIE.lock(%tile_0_3, 0) {init = 0 : i32}
%lock_0_4 = AIE.lock(%tile_0_4, 5) {init = 2 : i32}
%lock_0_4_15 = AIE.lock(%tile_0_4, 4) {init = 0 : i32}
%lock_0_4_16 = AIE.lock(%tile_0_4, 3) {init = 2 : i32}
%lock_0_4_17 = AIE.lock(%tile_0_4, 2) {init = 0 : i32}
%lock_0_4_18 = AIE.lock(%tile_0_4, 1) {init = 1 : i32}
%lock_0_4_19 = AIE.lock(%tile_0_4, 0) {init = 0 : i32}
%lock_0_5 = AIE.lock(%tile_0_5, 5) {init = 2 : i32}
%lock_0_5_20 = AIE.lock(%tile_0_5, 4) {init = 0 : i32}
%lock_0_5_21 = AIE.lock(%tile_0_5, 3) {init = 2 : i32}
%lock_0_5_22 = AIE.lock(%tile_0_5, 2) {init = 0 : i32}
%lock_0_5_23 = AIE.lock(%tile_0_5, 1) {init = 1 : i32}
%lock_0_5_24 = AIE.lock(%tile_0_5, 0) {init = 0 : i32}
%buffer_0_1 = AIE.buffer(%tile_0_1) {sym_name = "buf22"} : memref<32x256xi32, 1>
%buffer_0_1_25 = AIE.buffer(%tile_0_1) {sym_name = "buf21"} : memref<256x32xi32, 1>
%buffer_0_1_26 = AIE.buffer(%tile_0_1) {sym_name = "buf20"} : memref<32x32xi32, 1>
%buffer_0_5 = AIE.buffer(%tile_0_5) {sym_name = "buf19"} : memref<16x16xi32, 2>
%buffer_0_5_27 = AIE.buffer(%tile_0_5) {sym_name = "buf18"} : memref<16x16xi32, 2>
%buffer_0_5_28 = AIE.buffer(%tile_0_5) {sym_name = "buf17"} : memref<16x16xi32, 2>
%buffer_0_5_29 = AIE.buffer(%tile_0_5) {sym_name = "buf16"} : memref<16x16xi32, 2>
%buffer_0_5_30 = AIE.buffer(%tile_0_5) {sym_name = "buf15"} : memref<16x16xi32, 2>
%buffer_0_4 = AIE.buffer(%tile_0_4) {sym_name = "buf14"} : memref<16x16xi32, 2>
%buffer_0_4_31 = AIE.buffer(%tile_0_4) {sym_name = "buf13"} : memref<16x16xi32, 2>
%buffer_0_4_32 = AIE.buffer(%tile_0_4) {sym_name = "buf12"} : memref<16x16xi32, 2>
%buffer_0_4_33 = AIE.buffer(%tile_0_4) {sym_name = "buf11"} : memref<16x16xi32, 2>
%buffer_0_4_34 = AIE.buffer(%tile_0_4) {sym_name = "buf10"} : memref<16x16xi32, 2>
%buffer_0_3 = AIE.buffer(%tile_0_3) {sym_name = "buf9"} : memref<16x16xi32, 2>
%buffer_0_3_35 = AIE.buffer(%tile_0_3) {sym_name = "buf8"} : memref<16x16xi32, 2>
%buffer_0_3_36 = AIE.buffer(%tile_0_3) {sym_name = "buf7"} : memref<16x16xi32, 2>
%buffer_0_3_37 = AIE.buffer(%tile_0_3) {sym_name = "buf6"} : memref<16x16xi32, 2>
%buffer_0_3_38 = AIE.buffer(%tile_0_3) {sym_name = "buf5"} : memref<16x16xi32, 2>
%buffer_0_2 = AIE.buffer(%tile_0_2) {sym_name = "buf4"} : memref<16x16xi32, 2>
%buffer_0_2_39 = AIE.buffer(%tile_0_2) {sym_name = "buf3"} : memref<16x16xi32, 2>
%buffer_0_2_40 = AIE.buffer(%tile_0_2) {sym_name = "buf2"} : memref<16x16xi32, 2>
%buffer_0_2_41 = AIE.buffer(%tile_0_2) {sym_name = "buf1"} : memref<16x16xi32, 2>
%buffer_0_2_42 = AIE.buffer(%tile_0_2) {sym_name = "buf0"} : memref<16x16xi32, 2>
%mem_0_5 = AIE.mem(%tile_0_5) {
%0 = AIE.dmaStart(S2MM, 0, ^bb1, ^bb7)
^bb1: // 2 preds: ^bb0, ^bb2
AIE.useLock(%lock_0_5_21, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_5_29 : memref<16x16xi32, 2>, 0, 256>, 0)
AIE.useLock(%lock_0_5_22, Release, 1)
AIE.nextBd ^bb2
^bb2: // pred: ^bb1
AIE.useLock(%lock_0_5_21, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_5_28 : memref<16x16xi32, 2>, 0, 256>, 0)
AIE.useLock(%lock_0_5_22, Release, 1)
AIE.nextBd ^bb1
^bb3: // pred: ^bb4
AIE.end
^bb4: // pred: ^bb7
%1 = AIE.dmaStart(S2MM, 1, ^bb5, ^bb3)
^bb5: // 2 preds: ^bb4, ^bb6
AIE.useLock(%lock_0_5, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_5_30 : memref<16x16xi32, 2>, 0, 256>, 0)
AIE.useLock(%lock_0_5_20, Release, 1)
AIE.nextBd ^bb6
^bb6: // pred: ^bb5
AIE.useLock(%lock_0_5, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_5_27 : memref<16x16xi32, 2>, 0, 256>, 0)
AIE.useLock(%lock_0_5_20, Release, 1)
AIE.nextBd ^bb5
^bb7: // pred: ^bb0
%2 = AIE.dmaStart(MM2S, 0, ^bb8, ^bb4)
^bb8: // 2 preds: ^bb7, ^bb8
AIE.useLock(%lock_0_5_24, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_5 : memref<16x16xi32, 2>, 0, 256>, 0)
AIE.useLock(%lock_0_5_23, Release, 1)
AIE.nextBd ^bb8
}
%core_0_5 = AIE.core(%tile_0_5) {
%c32 = arith.constant 32 : index
%c0_i32 = arith.constant 0 : i32
%c256 = arith.constant 256 : index
%c16 = arith.constant 16 : index
%c1 = arith.constant 1 : index
%c0 = arith.constant 0 : index
cf.br ^bb1
^bb1: // 2 preds: ^bb0, ^bb1
scf.for %arg0 = %c0 to %c16 step %c1 {
scf.for %arg1 = %c0 to %c16 step %c1 {
memref.store %c0_i32, %buffer_0_5[%arg0, %arg1] : memref<16x16xi32, 2>
}
}
scf.for %arg0 = %c0 to %c256 step %c32 {
AIE.useLock(%lock_0_5_22, AcquireGreaterEqual, 1)
AIE.useLock(%lock_0_5_20, AcquireGreaterEqual, 1)
scf.for %arg1 = %c0 to %c16 step %c1 {
scf.for %arg2 = %c0 to %c16 step %c1 {
scf.for %arg3 = %c0 to %c16 step %c1 {
%0 = memref.load %buffer_0_5_29[%arg1, %arg3] : memref<16x16xi32, 2>
%1 = memref.load %buffer_0_5_30[%arg3, %arg2] : memref<16x16xi32, 2>
%2 = memref.load %buffer_0_5[%arg1, %arg2] : memref<16x16xi32, 2>
%3 = arith.muli %0, %1 : i32
%4 = arith.addi %2, %3 : i32
memref.store %4, %buffer_0_5[%arg1, %arg2] : memref<16x16xi32, 2>
}
}
}
AIE.useLock(%lock_0_5_21, Release, 1)
AIE.useLock(%lock_0_5, Release, 1)
AIE.useLock(%lock_0_5_22, AcquireGreaterEqual, 1)
AIE.useLock(%lock_0_5_20, AcquireGreaterEqual, 1)
scf.for %arg1 = %c0 to %c16 step %c1 {
scf.for %arg2 = %c0 to %c16 step %c1 {
scf.for %arg3 = %c0 to %c16 step %c1 {
%0 = memref.load %buffer_0_5_28[%arg1, %arg3] : memref<16x16xi32, 2>
%1 = memref.load %buffer_0_5_27[%arg3, %arg2] : memref<16x16xi32, 2>
%2 = memref.load %buffer_0_5[%arg1, %arg2] : memref<16x16xi32, 2>
%3 = arith.muli %0, %1 : i32
%4 = arith.addi %2, %3 : i32
memref.store %4, %buffer_0_5[%arg1, %arg2] : memref<16x16xi32, 2>
}
}
}
AIE.useLock(%lock_0_5_21, Release, 1)
AIE.useLock(%lock_0_5, Release, 1)
}
AIE.useLock(%lock_0_5_23, AcquireGreaterEqual, 1)
AIE.useLock(%lock_0_5_24, Release, 1)
cf.br ^bb1
} {elf_file = "segment_0_core_0_5.elf"}
%mem_0_4 = AIE.mem(%tile_0_4) {
%0 = AIE.dmaStart(S2MM, 0, ^bb1, ^bb7)
^bb1: // 2 preds: ^bb0, ^bb2
AIE.useLock(%lock_0_4_16, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_4_33 : memref<16x16xi32, 2>, 0, 256>, 0)
AIE.useLock(%lock_0_4_17, Release, 1)
AIE.nextBd ^bb2
^bb2: // pred: ^bb1
AIE.useLock(%lock_0_4_16, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_4_32 : memref<16x16xi32, 2>, 0, 256>, 0)
AIE.useLock(%lock_0_4_17, Release, 1)
AIE.nextBd ^bb1
^bb3: // pred: ^bb4
AIE.end
^bb4: // pred: ^bb7
%1 = AIE.dmaStart(S2MM, 1, ^bb5, ^bb3)
^bb5: // 2 preds: ^bb4, ^bb6
AIE.useLock(%lock_0_4, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_4_34 : memref<16x16xi32, 2>, 0, 256>, 0)
AIE.useLock(%lock_0_4_15, Release, 1)
AIE.nextBd ^bb6
^bb6: // pred: ^bb5
AIE.useLock(%lock_0_4, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_4_31 : memref<16x16xi32, 2>, 0, 256>, 0)
AIE.useLock(%lock_0_4_15, Release, 1)
AIE.nextBd ^bb5
^bb7: // pred: ^bb0
%2 = AIE.dmaStart(MM2S, 0, ^bb8, ^bb4)
^bb8: // 2 preds: ^bb7, ^bb8
AIE.useLock(%lock_0_4_19, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_4 : memref<16x16xi32, 2>, 0, 256>, 0)
AIE.useLock(%lock_0_4_18, Release, 1)
AIE.nextBd ^bb8
}
%core_0_4 = AIE.core(%tile_0_4) {
%c32 = arith.constant 32 : index
%c0_i32 = arith.constant 0 : i32
%c256 = arith.constant 256 : index
%c16 = arith.constant 16 : index
%c1 = arith.constant 1 : index
%c0 = arith.constant 0 : index
cf.br ^bb1
^bb1: // 2 preds: ^bb0, ^bb1
scf.for %arg0 = %c0 to %c16 step %c1 {
scf.for %arg1 = %c0 to %c16 step %c1 {
memref.store %c0_i32, %buffer_0_4[%arg0, %arg1] : memref<16x16xi32, 2>
}
}
scf.for %arg0 = %c0 to %c256 step %c32 {
AIE.useLock(%lock_0_4_17, AcquireGreaterEqual, 1)
AIE.useLock(%lock_0_4_15, AcquireGreaterEqual, 1)
scf.for %arg1 = %c0 to %c16 step %c1 {
scf.for %arg2 = %c0 to %c16 step %c1 {
scf.for %arg3 = %c0 to %c16 step %c1 {
%0 = memref.load %buffer_0_4_33[%arg1, %arg3] : memref<16x16xi32, 2>
%1 = memref.load %buffer_0_4_34[%arg3, %arg2] : memref<16x16xi32, 2>
%2 = memref.load %buffer_0_4[%arg1, %arg2] : memref<16x16xi32, 2>
%3 = arith.muli %0, %1 : i32
%4 = arith.addi %2, %3 : i32
memref.store %4, %buffer_0_4[%arg1, %arg2] : memref<16x16xi32, 2>
}
}
}
AIE.useLock(%lock_0_4_16, Release, 1)
AIE.useLock(%lock_0_4, Release, 1)
AIE.useLock(%lock_0_4_17, AcquireGreaterEqual, 1)
AIE.useLock(%lock_0_4_15, AcquireGreaterEqual, 1)
scf.for %arg1 = %c0 to %c16 step %c1 {
scf.for %arg2 = %c0 to %c16 step %c1 {
scf.for %arg3 = %c0 to %c16 step %c1 {
%0 = memref.load %buffer_0_4_32[%arg1, %arg3] : memref<16x16xi32, 2>
%1 = memref.load %buffer_0_4_31[%arg3, %arg2] : memref<16x16xi32, 2>
%2 = memref.load %buffer_0_4[%arg1, %arg2] : memref<16x16xi32, 2>
%3 = arith.muli %0, %1 : i32
%4 = arith.addi %2, %3 : i32
memref.store %4, %buffer_0_4[%arg1, %arg2] : memref<16x16xi32, 2>
}
}
}
AIE.useLock(%lock_0_4_16, Release, 1)
AIE.useLock(%lock_0_4, Release, 1)
}
AIE.useLock(%lock_0_4_18, AcquireGreaterEqual, 1)
AIE.useLock(%lock_0_4_19, Release, 1)
cf.br ^bb1
} {elf_file = "segment_0_core_0_4.elf"}
%mem_0_3 = AIE.mem(%tile_0_3) {
%0 = AIE.dmaStart(S2MM, 0, ^bb1, ^bb7)
^bb1: // 2 preds: ^bb0, ^bb2
AIE.useLock(%lock_0_3_11, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_3_37 : memref<16x16xi32, 2>, 0, 256>, 0)
AIE.useLock(%lock_0_3_12, Release, 1)
AIE.nextBd ^bb2
^bb2: // pred: ^bb1
AIE.useLock(%lock_0_3_11, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_3_36 : memref<16x16xi32, 2>, 0, 256>, 0)
AIE.useLock(%lock_0_3_12, Release, 1)
AIE.nextBd ^bb1
^bb3: // pred: ^bb4
AIE.end
^bb4: // pred: ^bb7
%1 = AIE.dmaStart(S2MM, 1, ^bb5, ^bb3)
^bb5: // 2 preds: ^bb4, ^bb6
AIE.useLock(%lock_0_3, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_3_38 : memref<16x16xi32, 2>, 0, 256>, 0)
AIE.useLock(%lock_0_3_10, Release, 1)
AIE.nextBd ^bb6
^bb6: // pred: ^bb5
AIE.useLock(%lock_0_3, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_3_35 : memref<16x16xi32, 2>, 0, 256>, 0)
AIE.useLock(%lock_0_3_10, Release, 1)
AIE.nextBd ^bb5
^bb7: // pred: ^bb0
%2 = AIE.dmaStart(MM2S, 0, ^bb8, ^bb4)
^bb8: // 2 preds: ^bb7, ^bb8
AIE.useLock(%lock_0_3_14, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_3 : memref<16x16xi32, 2>, 0, 256>, 0)
AIE.useLock(%lock_0_3_13, Release, 1)
AIE.nextBd ^bb8
}
%core_0_3 = AIE.core(%tile_0_3) {
%c32 = arith.constant 32 : index
%c0_i32 = arith.constant 0 : i32
%c256 = arith.constant 256 : index
%c16 = arith.constant 16 : index
%c1 = arith.constant 1 : index
%c0 = arith.constant 0 : index
cf.br ^bb1
^bb1: // 2 preds: ^bb0, ^bb1
scf.for %arg0 = %c0 to %c16 step %c1 {
scf.for %arg1 = %c0 to %c16 step %c1 {
memref.store %c0_i32, %buffer_0_3[%arg0, %arg1] : memref<16x16xi32, 2>
}
}
scf.for %arg0 = %c0 to %c256 step %c32 {
AIE.useLock(%lock_0_3_12, AcquireGreaterEqual, 1)
AIE.useLock(%lock_0_3_10, AcquireGreaterEqual, 1)
scf.for %arg1 = %c0 to %c16 step %c1 {
scf.for %arg2 = %c0 to %c16 step %c1 {
scf.for %arg3 = %c0 to %c16 step %c1 {
%0 = memref.load %buffer_0_3_37[%arg1, %arg3] : memref<16x16xi32, 2>
%1 = memref.load %buffer_0_3_38[%arg3, %arg2] : memref<16x16xi32, 2>
%2 = memref.load %buffer_0_3[%arg1, %arg2] : memref<16x16xi32, 2>
%3 = arith.muli %0, %1 : i32
%4 = arith.addi %2, %3 : i32
memref.store %4, %buffer_0_3[%arg1, %arg2] : memref<16x16xi32, 2>
}
}
}
AIE.useLock(%lock_0_3_11, Release, 1)
AIE.useLock(%lock_0_3, Release, 1)
AIE.useLock(%lock_0_3_12, AcquireGreaterEqual, 1)
AIE.useLock(%lock_0_3_10, AcquireGreaterEqual, 1)
scf.for %arg1 = %c0 to %c16 step %c1 {
scf.for %arg2 = %c0 to %c16 step %c1 {
scf.for %arg3 = %c0 to %c16 step %c1 {
%0 = memref.load %buffer_0_3_36[%arg1, %arg3] : memref<16x16xi32, 2>
%1 = memref.load %buffer_0_3_35[%arg3, %arg2] : memref<16x16xi32, 2>
%2 = memref.load %buffer_0_3[%arg1, %arg2] : memref<16x16xi32, 2>
%3 = arith.muli %0, %1 : i32
%4 = arith.addi %2, %3 : i32
memref.store %4, %buffer_0_3[%arg1, %arg2] : memref<16x16xi32, 2>
}
}
}
AIE.useLock(%lock_0_3_11, Release, 1)
AIE.useLock(%lock_0_3, Release, 1)
}
AIE.useLock(%lock_0_3_13, AcquireGreaterEqual, 1)
AIE.useLock(%lock_0_3_14, Release, 1)
cf.br ^bb1
} {elf_file = "segment_0_core_0_3.elf"}
%mem_0_2 = AIE.mem(%tile_0_2) {
%0 = AIE.dmaStart(S2MM, 0, ^bb1, ^bb7)
^bb1: // 2 preds: ^bb0, ^bb2
AIE.useLock(%lock_0_2_6, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_2_41 : memref<16x16xi32, 2>, 0, 256>, 0)
AIE.useLock(%lock_0_2_7, Release, 1)
AIE.nextBd ^bb2
^bb2: // pred: ^bb1
AIE.useLock(%lock_0_2_6, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_2_40 : memref<16x16xi32, 2>, 0, 256>, 0)
AIE.useLock(%lock_0_2_7, Release, 1)
AIE.nextBd ^bb1
^bb3: // pred: ^bb4
AIE.end
^bb4: // pred: ^bb7
%1 = AIE.dmaStart(S2MM, 1, ^bb5, ^bb3)
^bb5: // 2 preds: ^bb4, ^bb6
AIE.useLock(%lock_0_2, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_2_42 : memref<16x16xi32, 2>, 0, 256>, 0)
AIE.useLock(%lock_0_2_5, Release, 1)
AIE.nextBd ^bb6
^bb6: // pred: ^bb5
AIE.useLock(%lock_0_2, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_2_39 : memref<16x16xi32, 2>, 0, 256>, 0)
AIE.useLock(%lock_0_2_5, Release, 1)
AIE.nextBd ^bb5
^bb7: // pred: ^bb0
%2 = AIE.dmaStart(MM2S, 0, ^bb8, ^bb4)
^bb8: // 2 preds: ^bb7, ^bb8
AIE.useLock(%lock_0_2_9, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_2 : memref<16x16xi32, 2>, 0, 256>, 0)
AIE.useLock(%lock_0_2_8, Release, 1)
AIE.nextBd ^bb8
}
%core_0_2 = AIE.core(%tile_0_2) {
%c32 = arith.constant 32 : index
%c0_i32 = arith.constant 0 : i32
%c256 = arith.constant 256 : index
%c16 = arith.constant 16 : index
%c1 = arith.constant 1 : index
%c0 = arith.constant 0 : index
cf.br ^bb1
^bb1: // 2 preds: ^bb0, ^bb1
scf.for %arg0 = %c0 to %c16 step %c1 {
scf.for %arg1 = %c0 to %c16 step %c1 {
memref.store %c0_i32, %buffer_0_2[%arg0, %arg1] : memref<16x16xi32, 2>
}
}
scf.for %arg0 = %c0 to %c256 step %c32 {
AIE.useLock(%lock_0_2_7, AcquireGreaterEqual, 1)
AIE.useLock(%lock_0_2_5, AcquireGreaterEqual, 1)
scf.for %arg1 = %c0 to %c16 step %c1 {
scf.for %arg2 = %c0 to %c16 step %c1 {
scf.for %arg3 = %c0 to %c16 step %c1 {
%0 = memref.load %buffer_0_2_41[%arg1, %arg3] : memref<16x16xi32, 2>
%1 = memref.load %buffer_0_2_42[%arg3, %arg2] : memref<16x16xi32, 2>
%2 = memref.load %buffer_0_2[%arg1, %arg2] : memref<16x16xi32, 2>
%3 = arith.muli %0, %1 : i32
%4 = arith.addi %2, %3 : i32
memref.store %4, %buffer_0_2[%arg1, %arg2] : memref<16x16xi32, 2>
}
}
}
AIE.useLock(%lock_0_2_6, Release, 1)
AIE.useLock(%lock_0_2, Release, 1)
AIE.useLock(%lock_0_2_7, AcquireGreaterEqual, 1)
AIE.useLock(%lock_0_2_5, AcquireGreaterEqual, 1)
scf.for %arg1 = %c0 to %c16 step %c1 {
scf.for %arg2 = %c0 to %c16 step %c1 {
scf.for %arg3 = %c0 to %c16 step %c1 {
%0 = memref.load %buffer_0_2_40[%arg1, %arg3] : memref<16x16xi32, 2>
%1 = memref.load %buffer_0_2_39[%arg3, %arg2] : memref<16x16xi32, 2>
%2 = memref.load %buffer_0_2[%arg1, %arg2] : memref<16x16xi32, 2>
%3 = arith.muli %0, %1 : i32
%4 = arith.addi %2, %3 : i32
memref.store %4, %buffer_0_2[%arg1, %arg2] : memref<16x16xi32, 2>
}
}
}
AIE.useLock(%lock_0_2_6, Release, 1)
AIE.useLock(%lock_0_2, Release, 1)
}
AIE.useLock(%lock_0_2_8, AcquireGreaterEqual, 1)
AIE.useLock(%lock_0_2_9, Release, 1)
cf.br ^bb1
} {elf_file = "segment_0_core_0_2.elf"}
AIE.flow(%tile_0_0, DMA : 0, %tile_0_1, DMA : 0)
AIE.flow(%tile_0_0, DMA : 1, %tile_0_1, DMA : 1)
AIE.flow(%tile_0_1, DMA : 0, %tile_0_0, DMA : 0)
AIE.flow(%tile_0_1, DMA : 1, %tile_0_2, DMA : 0)
AIE.flow(%tile_0_1, DMA : 1, %tile_0_3, DMA : 0)
AIE.flow(%tile_0_1, DMA : 2, %tile_0_4, DMA : 0)
AIE.flow(%tile_0_1, DMA : 2, %tile_0_5, DMA : 0)
AIE.flow(%tile_0_1, DMA : 3, %tile_0_2, DMA : 1)
AIE.flow(%tile_0_1, DMA : 3, %tile_0_4, DMA : 1)
AIE.flow(%tile_0_1, DMA : 4, %tile_0_3, DMA : 1)
AIE.flow(%tile_0_1, DMA : 4, %tile_0_5, DMA : 1)
AIE.flow(%tile_0_2, DMA : 0, %tile_0_1, DMA : 2)
AIE.flow(%tile_0_4, DMA : 0, %tile_0_1, DMA : 3)
AIE.flow(%tile_0_3, DMA : 0, %tile_0_1, DMA : 4)
AIE.flow(%tile_0_5, DMA : 0, %tile_0_1, DMA : 5)
%memTileDMA_0_1 = AIE.memTileDMA(%tile_0_1) {
%0 = AIE.dmaStart(S2MM, 0, ^bb1, ^bb21)
^bb1: // 2 preds: ^bb0, ^bb1
AIE.useLock(%lock_0_1_3, AcquireGreaterEqual, 2)
AIE.dmaBd(<%buffer_0_1 : memref<32x256xi32, 1>, 0, 8192>, 0)
AIE.useLock(%lock_0_1_4, Release, 2)
AIE.nextBd ^bb1
^bb2: // pred: ^bb3
AIE.end
^bb3: // pred: ^bb5
%1 = AIE.dmaStart(S2MM, 1, ^bb4, ^bb2)
^bb4: // 2 preds: ^bb3, ^bb4
AIE.useLock(%lock_0_1_1, AcquireGreaterEqual, 2)
AIE.dmaBd(<%buffer_0_1_25 : memref<256x32xi32, 1>, 0, 8192>, 0)
AIE.useLock(%lock_0_1_2, Release, 2)
AIE.nextBd ^bb4
^bb5: // pred: ^bb7
%2 = AIE.dmaStart(S2MM, 2, ^bb6, ^bb3)
^bb6: // 2 preds: ^bb5, ^bb6
AIE.useLock(%lock_0_1, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_1_26 : memref<32x32xi32, 1>, 0, 256>, 0, [<16, 32>, <16, 1>])
AIE.useLock(%lock_0_1_0, Release, 1)
AIE.nextBd ^bb6
^bb7: // pred: ^bb9
%3 = AIE.dmaStart(S2MM, 3, ^bb8, ^bb5)
^bb8: // 2 preds: ^bb7, ^bb8
AIE.useLock(%lock_0_1, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_1_26 : memref<32x32xi32, 1>, 2048, 256>, 0, [<16, 32>, <16, 1>])
AIE.useLock(%lock_0_1_0, Release, 1)
AIE.nextBd ^bb8
^bb9: // pred: ^bb11
%4 = AIE.dmaStart(S2MM, 4, ^bb10, ^bb7)
^bb10: // 2 preds: ^bb9, ^bb10
AIE.useLock(%lock_0_1, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_1_26 : memref<32x32xi32, 1>, 64, 256>, 0, [<16, 32>, <16, 1>])
AIE.useLock(%lock_0_1_0, Release, 1)
AIE.nextBd ^bb10
^bb11: // pred: ^bb13
%5 = AIE.dmaStart(S2MM, 5, ^bb12, ^bb9)
^bb12: // 2 preds: ^bb11, ^bb12
AIE.useLock(%lock_0_1, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_1_26 : memref<32x32xi32, 1>, 2112, 256>, 0, [<16, 32>, <16, 1>])
AIE.useLock(%lock_0_1_0, Release, 1)
AIE.nextBd ^bb12
^bb13: // pred: ^bb15
%6 = AIE.dmaStart(MM2S, 0, ^bb14, ^bb11)
^bb14: // 2 preds: ^bb13, ^bb14
AIE.useLock(%lock_0_1_0, AcquireGreaterEqual, 4)
AIE.dmaBd(<%buffer_0_1_26 : memref<32x32xi32, 1>, 0, 1024>, 0)
AIE.useLock(%lock_0_1, Release, 4)
AIE.nextBd ^bb14
^bb15: // pred: ^bb17
%7 = AIE.dmaStart(MM2S, 1, ^bb16, ^bb13)
^bb16: // 2 preds: ^bb15, ^bb16
AIE.useLock(%lock_0_1_4, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_1 : memref<32x256xi32, 1>, 0, 4096>, 0, [<16, 16>, <16, 256>, <16, 1>])
AIE.useLock(%lock_0_1_3, Release, 1)
AIE.nextBd ^bb16
^bb17: // pred: ^bb19
%8 = AIE.dmaStart(MM2S, 2, ^bb18, ^bb15)
^bb18: // 2 preds: ^bb17, ^bb18
AIE.useLock(%lock_0_1_4, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_1 : memref<32x256xi32, 1>, 16384, 4096>, 0, [<16, 16>, <16, 256>, <16, 1>])
AIE.useLock(%lock_0_1_3, Release, 1)
AIE.nextBd ^bb18
^bb19: // pred: ^bb21
%9 = AIE.dmaStart(MM2S, 3, ^bb20, ^bb17)
^bb20: // 2 preds: ^bb19, ^bb20
AIE.useLock(%lock_0_1_2, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_1_25 : memref<256x32xi32, 1>, 0, 4096>, 0, [<16, 512>, <16, 32>, <16, 1>])
AIE.useLock(%lock_0_1_1, Release, 1)
AIE.nextBd ^bb20
^bb21: // pred: ^bb0
%10 = AIE.dmaStart(MM2S, 4, ^bb22, ^bb19)
^bb22: // 2 preds: ^bb21, ^bb22
AIE.useLock(%lock_0_1_2, AcquireGreaterEqual, 1)
AIE.dmaBd(<%buffer_0_1_25 : memref<256x32xi32, 1>, 64, 4096>, 0, [<16, 512>, <16, 32>, <16, 1>])
AIE.useLock(%lock_0_1_1, Release, 1)
AIE.nextBd ^bb22
}
AIE.shimDMAAllocation @airMemcpyId16(S2MM, 0, 0)
memref.global "public" @airMemcpyId16 : memref<32x32xi32, 1>
AIE.shimDMAAllocation @airMemcpyId4(MM2S, 0, 0)
memref.global "public" @airMemcpyId4 : memref<32x256xi32, 1>
AIE.shimDMAAllocation @airMemcpyId5(MM2S, 1, 0)
memref.global "public" @airMemcpyId5 : memref<256x32xi32, 1>
func.func @matmul_static_dispatch_0_matmul_64x64x256_i32(%arg0: memref<64x256xi32>, %arg1: memref<256x64xi32>, %arg2: memref<64x64xi32>) {
%c0_i32 = arith.constant 0 : i32
%c1_i32 = arith.constant 1 : i32
%c32_i32 = arith.constant 32 : i32
%c256_i32 = arith.constant 256 : i32
%c2_i32 = arith.constant 2 : i32
%c64_i32 = arith.constant 64 : i32
%c2048_i32 = arith.constant 2048 : i32
memref.assume_alignment %arg0, 64 : memref<64x256xi32>
memref.assume_alignment %arg1, 64 : memref<256x64xi32>
memref.assume_alignment %arg2, 64 : memref<64x64xi32>
AIEX.ipu.dma_memcpy_nd(%c0_i32, %c0_i32, %arg0[%c0_i32, %c0_i32, %c0_i32, %c0_i32] [%c1_i32, %c1_i32, %c32_i32, %c256_i32] [%c0_i32, %c0_i32, %c256_i32]) {id = 1 : i32, metadata = @airMemcpyId4} : (i32, i32, memref<64x256xi32>, [i32, i32, i32, i32], [i32, i32, i32, i32], [i32, i32, i32])
AIEX.ipu.dma_memcpy_nd(%c0_i32, %c0_i32, %arg0[%c0_i32, %c0_i32, %c0_i32, %c0_i32] [%c1_i32, %c1_i32, %c32_i32, %c256_i32] [%c0_i32, %c0_i32, %c256_i32]) {id = 2 : i32, metadata = @airMemcpyId4} : (i32, i32, memref<64x256xi32>, [i32, i32, i32, i32], [i32, i32, i32, i32], [i32, i32, i32])
AIEX.ipu.dma_memcpy_nd(%c0_i32, %c0_i32, %arg0[%c0_i32, %c0_i32, %c32_i32, %c0_i32] [%c1_i32, %c1_i32, %c32_i32, %c256_i32] [%c0_i32, %c0_i32, %c256_i32]) {id = 3 : i32, metadata = @airMemcpyId4} : (i32, i32, memref<64x256xi32>, [i32, i32, i32, i32], [i32, i32, i32, i32], [i32, i32, i32])
AIEX.ipu.dma_memcpy_nd(%c0_i32, %c0_i32, %arg0[%c0_i32, %c0_i32, %c32_i32, %c0_i32] [%c1_i32, %c1_i32, %c32_i32, %c256_i32] [%c0_i32, %c0_i32, %c256_i32]) {id = 4 : i32, metadata = @airMemcpyId4} : (i32, i32, memref<64x256xi32>, [i32, i32, i32, i32], [i32, i32, i32, i32], [i32, i32, i32])
AIEX.ipu.dma_memcpy_nd(%c0_i32, %c0_i32, %arg1[%c0_i32, %c0_i32, %c0_i32, %c0_i32] [%c2_i32, %c2_i32, %c256_i32, %c32_i32] [%c0_i32, %c32_i32, %c64_i32]) {id = 5 : i32, metadata = @airMemcpyId5} : (i32, i32, memref<256x64xi32>, [i32, i32, i32, i32], [i32, i32, i32, i32], [i32, i32, i32])
AIEX.ipu.dma_memcpy_nd(%c0_i32, %c0_i32, %arg2[%c0_i32, %c0_i32, %c0_i32, %c0_i32] [%c2_i32, %c2_i32, %c32_i32, %c32_i32] [%c2048_i32, %c32_i32, %c64_i32]) {id = 6 : i32, metadata = @airMemcpyId16} : (i32, i32, memref<64x64xi32>, [i32, i32, i32, i32], [i32, i32, i32, i32], [i32, i32, i32])
AIEX.ipu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}
return
}
} {sym_name = "segment_0"}
}
}
}
func.func @matmul_static(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
%c65536 = arith.constant 65536 : index
%c16384 = arith.constant 16384 : index
%c0 = arith.constant 0 : index
%c268435488_i32 = arith.constant 268435488 : i32
%c1_i32 = arith.constant 1 : i32
%c64 = arith.constant 64 : index
%c256 = arith.constant 256 : index
hal.buffer_view.assert<%arg0 : !hal.buffer_view> message("input 0") shape([%c64, %c256]) type(%c268435488_i32) encoding(%c1_i32)
%0 = stream.tensor.import %arg0 : !hal.buffer_view -> tensor<64x256xi32> in !stream.resource<external>{%c65536}
hal.buffer_view.assert<%arg1 : !hal.buffer_view> message("input 1") shape([%c256, %c64]) type(%c268435488_i32) encoding(%c1_i32)
%1 = stream.tensor.import %arg1 : !hal.buffer_view -> tensor<256x64xi32> in !stream.resource<external>{%c65536}
%result, %result_timepoint = stream.resource.alloca uninitialized : !stream.resource<external>{%c16384} => !stream.timepoint
%2 = stream.cmd.execute await(%result_timepoint) => with(%0 as %arg2: !stream.resource<external>{%c65536}, %1 as %arg3: !stream.resource<external>{%c65536}, %result as %arg4: !stream.resource<external>{%c16384}) {
stream.cmd.dispatch @matmul_static_dispatch_0::@elf::@matmul_static_dispatch_0_matmul_64x64x256_i32 {
ro %arg2[%c0 for %c65536] : !stream.resource<external>{%c65536},
ro %arg3[%c0 for %c65536] : !stream.resource<external>{%c65536},
wo %arg4[%c0 for %c16384] : !stream.resource<external>{%c16384}
} attributes {hal.interface.bindings = [#hal.interface.binding<0, 0>, #hal.interface.binding<0, 1>, #hal.interface.binding<0, 2>]}
} => !stream.timepoint
%3 = stream.timepoint.await %2 => %result : !stream.resource<external>{%c16384}
%4 = stream.tensor.export %3 : tensor<64x64xi32> in !stream.resource<external>{%c16384} -> !hal.buffer_view
return %4 : !hal.buffer_view
}
}
The below gist is a simple example showing how externally written kernels can be plugged into IREE. Further examples and details can be found here:
https://github.com/openxla/iree/tree/main/samples/custom_dispatch vulkan, cpu, cuda examples
Results
#spirv_target = #hal.executable.target<"vulkan", "vulkan-spirv-fb", {spirv.target_env = #spirv.target_env<#spirv.vce<v1.6, [Shader, Float64, Float16, Int64, Int16, Int8, StorageBuffer16BitAccess, StorageUniform16, StoragePushConstant16, StorageBuffer8BitAccess, UniformAndStorageBuffer8BitAccess, StoragePushConstant8, GroupNonUniform, GroupNonUniformVote, GroupNonUniformArithmetic, GroupNonUniformBallot, GroupNonUniformShuffle, GroupNonUniformShuffleRelative, GroupNonUniformClustered, GroupNonUniformQuad, VariablePointers, VariablePointersStorageBuffer, DotProduct, DotProductInputAll, DotProductInput4x8BitPacked, DotProductInput4x8Bit, CooperativeMatrixKHR], [SPV_KHR_16bit_storage, SPV_KHR_8bit_storage, SPV_KHR_integer_dot_product, SPV_KHR_storage_buffer_storage_class, SPV_KHR_variable_pointers, SPV_KHR_cooperative_matrix]>, api=Vulkan, AMD:DiscreteGPU, #spirv.resource_limits<max_compute_shared_memory_size = 65536, max_compute_workgroup_invocations = 1024, max_compute_workgroup_size = [1024, 1024, 1024], subgroup_size = 64, min_subgroup_size = 32, max_subgroup_size = 64, cooperative_matrix_properties_khr = [#spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = i8, b_type = i8, c_type = i32, result_type = i32, acc_sat = false, scope = <Subgroup>>, #spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = f16, b_type = f16, c_type = f16, result_type = f16, acc_sat = false, scope = <Subgroup>>, #spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = f16, b_type = f16, c_type = f32, result_type = f32, acc_sat = false, scope = <Subgroup>>]>>}>
module {
func.func @forward(%arg0: tensor<1x32000xf16>) -> tensor<1xi64> {
%c1 = arith.constant 1 : index
%dim = tensor.dim %arg0, %c1 : tensor<1x32000xf16>
%dim_i32 = arith.index_cast %dim : index to i32
%4 = hal.dispatch.extern "main"[%dim](%dim_i32, %arg0) : (i32, tensor<1x32000xf16>) -> tensor<1xi64>
count(%device: !hal.device, %workload: index) -> (index, index, index) {
%c1_0 = arith.constant 1 : index
hal.return %c1_0, %c1_0, %c1_0 : index, index, index
}
layout(#hal.pipeline.layout<push_constants = 1, sets = [
<0, bindings = [
<0, storage_buffer, ReadOnly>,
<1, storage_buffer>
]>
]>)
bindings([
#hal.interface.binding<0, 0>,
#hal.interface.binding<0, 1>
])
objects(#hal.executable.objects<{
// Per-target specification of the kernel to use. The compiler will automatically select which one
// to use based on the target being compiled for, or the runtime can pick when multi-targeting.
#spirv_target = [
#hal.executable.object<{
// Path to the .spv/.hsaco, or can just embed the binary blob directly.
path = "/home/quinn/one_workgroup_argmax_subgroup_f16.spv"
}>
]
}>)
return %4 : tensor<1xi64>
}
}
Micro kernels try to get best of both world by using automatic code generation to form the outer data access pattern but using a hand written kernel for the inner computation
Results
// -----// IR Dump After LLVMCPULowerToAccelUKernels (iree-llvmcpu-lower-to-accel-ukernels) //----- //
module {
func.func @forward_dispatch_7_matmul_8x2048x2048_f32() {
%c4 = arith.constant 4 : index
%c2048 = arith.constant 2048 : index
%c8 = arith.constant 8 : index
%c0 = arith.constant 0 : index
%c428630016 = arith.constant 428630016 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<8x2048xf32>>
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c428630016) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<2048x2048xf32>>
%2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<8x2048xf32>>
%workgroup_id_x = hal.interface.workgroup.id[0] : index
%workgroup_count_x = hal.interface.workgroup.count[0] : index
%workgroup_id_y = hal.interface.workgroup.id[1] : index
%workgroup_count_y = hal.interface.workgroup.count[1] : index
%3 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_id_y]
%4 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_count_y]
scf.for %arg0 = %3 to %c8 step %4 {
%5 = affine.apply affine_map<()[s0] -> (s0 * 2048)>()[%workgroup_id_x]
%6 = affine.apply affine_map<()[s0] -> (s0 * 2048)>()[%workgroup_count_x]
scf.for %arg1 = %5 to %c2048 step %6 {
%7 = flow.dispatch.tensor.load %2, offsets = [%arg0, %arg1], sizes = [4, 2048], strides = [1, 1] : !flow.dispatch.tensor<writeonly:tensor<8x2048xf32>> -> tensor<4x2048xf32>
%8 = flow.dispatch.tensor.load %0, offsets = [%arg0, 0], sizes = [4, 2048], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<8x2048xf32>> -> tensor<4x2048xf32>
%9 = flow.dispatch.tensor.load %1, offsets = [0, %arg1], sizes = [2048, 2048], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<2048x2048xf32>> -> tensor<2048x2048xf32>
%10 = iree_codegen.ukernel.generic "accel_matmul_f32" ins(%8, %9 : tensor<4x2048xf32>, tensor<2048x2048xf32>) outs(%7 : tensor<4x2048xf32>) (%c4, %c2048, %c2048 : index, index, index) fn_def_attrs {hal.import.fields = ["processor_data", "processor_id"]} strided_outer_dims(0) -> tensor<4x2048xf32>
flow.dispatch.tensor.store %10, %2, offsets = [%arg0, %arg1], sizes = [4, 2048], strides = [1, 1] : tensor<4x2048xf32> -> !flow.dispatch.tensor<writeonly:tensor<8x2048xf32>>
}
}
return
}
}
Running command : /tmp/mravisha/llvm-aie/install/RelWithDebInfo/bin/opt -O2 --inline-threshold=10 /tmp/matmul_static_dispatch_0-53cc41.bc -o /tmp/matmul_static_dispatch_0-48121d.opt.bc
Running command : /tmp/mravisha/llvm-aie/install/RelWithDebInfo/bin/llc -O2 --march=aie2 --function-sections --filetype=obj /tmp/matmul_static_dispatch_0-48121d.opt.bc -o /tmp/matmul_static_dispatch_0-b18ca6.o
Running command : /tmp/mravisha/llvm-aie/install/RelWithDebInfo/bin/clang -O2 --target=aie2-none-elf /tmp/matmul_static_dispatch_0-b18ca6.o /proj/xcohdstaff6/abhvarma/mlir-aie/install/aie_runtime_lib/AIE2/me_basic.o /tmp/mravisha/llvm-aie/install/RelWithDebInfo/lib/aie2-none-unknown-elf/libc.a -Wl,--gc-sections -Wl,-T,/tmp/segment_0_core_0_2.elf-2af211.ld.script -o /tmp/elf-a040f9/segment_0_core_0_2.elf
Running command : /tmp/mravisha/llvm-aie/install/RelWithDebInfo/bin/clang -O2 --target=aie2-none-elf /tmp/matmul_static_dispatch_0-b18ca6.o /proj/xcohdstaff6/abhvarma/mlir-aie/install/aie_runtime_lib/AIE2/me_basic.o /tmp/mravisha/llvm-aie/install/RelWithDebInfo/lib/aie2-none-unknown-elf/libc.a -Wl,--gc-sections -Wl,-T,/tmp/segment_0_core_0_3.elf-374038.ld.script -o /tmp/elf-a040f9/segment_0_core_0_3.elf
Running command : /tmp/mravisha/llvm-aie/install/RelWithDebInfo/bin/clang -O2 --target=aie2-none-elf /tmp/matmul_static_dispatch_0-b18ca6.o /proj/xcohdstaff6/abhvarma/mlir-aie/install/aie_runtime_lib/AIE2/me_basic.o /tmp/mravisha/llvm-aie/install/RelWithDebInfo/lib/aie2-none-unknown-elf/libc.a -Wl,--gc-sections -Wl,-T,/tmp/segment_0_core_0_4.elf-1702a2.ld.script -o /tmp/elf-a040f9/segment_0_core_0_4.elf
Running command : /tmp/mravisha/llvm-aie/install/RelWithDebInfo/bin/clang -O2 --target=aie2-none-elf /tmp/matmul_static_dispatch_0-b18ca6.o /proj/xcohdstaff6/abhvarma/mlir-aie/install/aie_runtime_lib/AIE2/me_basic.o /tmp/mravisha/llvm-aie/install/RelWithDebInfo/lib/aie2-none-unknown-elf/libc.a -Wl,--gc-sections -Wl,-T,/tmp/segment_0_core_0_5.elf-f5381d.ld.script -o /tmp/elf-a040f9/segment_0_core_0_5.elf
Running command : /tmp/mravisha/llvm-aie/install/RelWithDebInfo/bin/clang++ -fPIC -c -std=c++17 -D__AIEARCH__=20 -D__AIESIM__ -D__CDO__ -D__PS_INIT_AIE__ -D__LOCK_FENCE_MODE__=2 -DAIE_OPTION_SCALAR_FLOAT_ON_VECTOR -DAIE2_FP32_EMULATION_ACCURACY_FAST -Wno-deprecated-declarations -I/tmp/elf-a040f9 -I/proj/xcohdstaff6/abhvarma/mlir-aie/install/runtime_lib/x86_64/xaiengine/cdo/include -I/proj/xbuilds/SWIP/2023.2_1013_2256/installs/lin64/Vitis/2023.2/aietools/include -o /tmp/elf-a040f9/gen_cdo.o /proj/xcohdstaff6/abhvarma/mlir-aie/install/data/generated-source/gen_cdo.cpp
Running command : /tmp/mravisha/llvm-aie/install/RelWithDebInfo/bin/clang++ -fPIC -c -std=c++17 -I/tmp/elf-a040f9 -I/proj/xcohdstaff6/abhvarma/mlir-aie/install/runtime_lib/x86_64/xaiengine/cdo/include -I/proj/xbuilds/SWIP/2023.2_1013_2256/installs/lin64/Vitis/2023.2/aietools/include -o /tmp/elf-a040f9/cdo_main.o /proj/xcohdstaff6/abhvarma/mlir-aie/install/data/generated-source/cdo_main.cpp
Running command : /tmp/mravisha/llvm-aie/install/RelWithDebInfo/bin/clang++ -L/proj/xcohdstaff6/abhvarma/mlir-aie/install/runtime_lib/x86_64/xaiengine/cdo -L/proj/xbuilds/SWIP/2023.2_1013_2256/installs/lin64/Vitis/2023.2/aietools/lib/lnx64.o -lxaienginecdo -lcdo_driver -o /tmp/elf-a040f9/cdo_main /tmp/elf-a040f9/gen_cdo.o /tmp/elf-a040f9/cdo_main.o
Running command : LD_LIBRARY_PATH=/proj/xcohdstaff6/abhvarma/mlir-aie/install/runtime_lib/x86_64/xaiengine/cdo:/proj/xbuilds/SWIP/2023.2_1013_2256/installs/lin64/Vitis/2023.2/aietools/lib/lnx64.o /tmp/elf-a040f9/cdo_main --work-dir-path /tmp/elf-a040f9/
Generating: /tmp/elf-a040f9/aie_cdo_error_handling.bin
Generating: /tmp/elf-a040f9/aie_cdo_elfs.bin
Generating: /tmp/elf-a040f9/aie_cdo_init.bin
Generating: /tmp/elf-a040f9/aie_cdo_enable.bin
Running command : /proj/xbuilds/SWIP/2023.2_1013_2256/installs/lin64/Vitis/2023.2/bin/bootgen -arch versal -image /tmp/elf-a040f9/design.bif -o /tmp/elf-a040f9/design.pdi -w
****** Bootgen v2023.2
**** Build date : Oct 11 2023-12:50:27
** Copyright 1986-2022 Xilinx, Inc. All Rights Reserved.
** Copyright 2022-2023 Advanced Micro Devices, Inc. All Rights Reserved.
[INFO] : Bootimage generated successfully
Running command : /proj/xbuilds/SWIP/2023.2_1013_2256/installs/lin64/Vitis/2023.2/bin/xclbinutil --input /proj/xcohdstaff6/abhvarma/mlir-aie/install/data/1x4.xclbin --add-kernel /tmp/elf-a040f9/kernels.json --add-replace-section AIE_PARTITION:JSON:/tmp/elf-a040f9/aie_partition.json --force --output /tmp/elf-a040f9/final.xclbin
XRT Build Version: 2.16.0 (Vitis)
Build Date: 2023-07-13 16:00:55
Hash ID: 157faa07876c55bb8aa8ec51b28608a6a0f6638e
Reading xclbin file into memory. File: /proj/xcohdstaff6/abhvarma/mlir-aie/install/data/1x4.xclbin
Section 'AIE_PARTITION'(32) was successfully removed
Section: 'AIE_PARTITION'(32) was successfully added.
Size : 16960 bytes
Format : JSON
File : '/tmp/elf-a040f9/aie_partition.json'
Section 'GROUP_TOPOLOGY'(26) was successfully removed
Section 'GROUP_CONNECTIVITY'(27) was successfully removed
Successfully wrote (82484 bytes) to the output file: /tmp/elf-a040f9/final.xclbin
Leaving xclbinutil.
./test.exe -x /tmp/elf-b8a25a/final.xclbin -k MLIR_AIE -i module_matmul_static_dispatch_0_elf.insts.txt -v 10
Sequence instr count: 58
Loading xclbin: /tmp/elf-b8a25a/final.xclbin
Kernel opcode: MLIR_AIE
Name: vadd
Name: DPU_1x4
Name: PP_FD_PRE
Name: PP_SSIM_PRE
Name: PP_EGC_PRE
Name: PP_EGC_POST
Name: PP_FD_POST
Name: PP_PIXELWISE_SELECT
Name: PP_BLENDING
Name: PP_ROW_FILTER
Name: MLIR_AIE
Registering xclbin: /tmp/elf-b8a25a/final.xclbin
xclbin uuid: 1
Getting hardware context.
Getting handle to kernel:MLIR_AIE
Writing data into buffer objects.
Running Kernel.
PASS!
./tools/iree-run-module --device=xrt --module=output.vmfb --input=8x16xi32=2 --input=16x32xi32=3
EXEC @matmul_static
result[0]: hal.buffer_view
8x32xi32=[96 96 96 96 96 96 96 96][96 96 96 96 96 96 96 96]....