-
Notifications
You must be signed in to change notification settings - Fork 582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support i1 datatype #18713
base: main
Are you sure you want to change the base?
Support i1 datatype #18713
Conversation
@@ -99,6 +99,11 @@ Value calculateStorageElementCountInBytes(Location loc, | |||
} | |||
} | |||
|
|||
// make sure the last dimension is byte aligned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: proper punctuation (here and elsewhere) in comments: https://google.github.io/styleguide/cppguide.html#Punctuation,_Spelling_and_Grammar
Signed-off-by: Alan Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alan and I had an offline sync, and he is revisiting the codegen side changes. I'm not an expert of host side changes, so we need some inputs from Ben.
// align tensor type to multiple of 8 bits: | ||
auto rankedTensorType = tensorType.asRankedTensorType(); | ||
auto elementSize = rankedTensorType.getElementType().getIntOrFloatBitWidth(); | ||
auto typeSize = tensorType.getNumElements() * elementSize; | ||
|
||
if (typeSize * elementSize % 8 != 0) { | ||
SmallVector<int64_t> newShape(rankedTensorType.getShape()); | ||
newShape.back() = llvm::alignTo(newShape.back(), 8 / elementSize); | ||
|
||
auto newTensorType = IREE::Flow::DispatchTensorType::get( | ||
tensorType.getAccess(), newShape, | ||
rankedTensorType.getElementType(), rankedTensorType.getEncoding()); | ||
tensorType = newTensorType; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need some input from @benvanik about how we land this properly. My understanding is that we want to align i1 shape with bytes. E.g., 6xi1
becomes 8xi1
on both stream allocation and dispatch sides. The current approach replaces flow.dispatch.tensor
type with 8xi1
, while it is leaving 6xi1
type in the stream.tensor.sizeof
op. See below snippet for more details. This is off to me because:
- I think it does not work with dynamic shapes. Because the arguments of
DispatchTieShapeOp
are not taken into accounts. - It leaks the
stream.tensor.sizeof
lowering logic to FlowToStream conversion. Is it okay?
Ben knows more details, please correct me if I'm wrong. I think we still can cook all the logics in FlowToStream conversion. We either need a type converter or introduce a legalizePackedType
method in ElementPackingUtils.[h|cpp]
which shares the logic between buildResultSizeOf
method and ConvertExecutableOp
patterns. The legalizePackedType
takes tensor type and dynamicDims and does something similar to calculateStorageElementCountInBytes
.
The other approach is updating the logics in EncodeTensors.cpp. The pass has logics to encode host tensors and device tensors, and we probably just need to update the alignTensorType logic. (I don't know, please do some study.)
@benvanik do you have any suggestions about where the change should happen?
// -----// IR Dump Before ConvertToStreamPass (iree-stream-conversion) //----- //
#executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>
#map = affine_map<(d0) -> (d0)>
#device_target_local = #hal.device.target<"local", [#executable_target_embedded_elf_x86_64_]> : !hal.device
module attributes {stream.affinity.default = #hal.device.affinity<@__device_0>} {
util.global private @__device_0 = #device_target_local
flow.executable private @add_tensors_dispatch_0 {
flow.executable.export public @add_tensors_dispatch_0_elementwise_6_i1 workgroups() -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_slice
flow.return %x, %y, %z : index, index, index
}
builtin.module {
func.func @add_tensors_dispatch_0_elementwise_6_i1(%arg0: !flow.dispatch.tensor<readonly:tensor<6xi1>>, %arg1: !flow.dispatch.tensor<readonly:tensor<6xi1>>, %arg2: !flow.dispatch.tensor<writeonly:tensor<6xi1>>) {
%0 = flow.dispatch.tensor.load %arg0, offsets = [0], sizes = [6], strides = [1] : !flow.dispatch.tensor<readonly:tensor<6xi1>> -> tensor<6xi1>
%1 = flow.dispatch.tensor.load %arg1, offsets = [0], sizes = [6], strides = [1] : !flow.dispatch.tensor<readonly:tensor<6xi1>> -> tensor<6xi1>
%2 = tensor.empty() : tensor<6xi1>
%3 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%0, %1 : tensor<6xi1>, tensor<6xi1>) outs(%2 : tensor<6xi1>) {
^bb0(%in: i1, %in_0: i1, %out: i1):
%4 = arith.addi %in, %in_0 : i1
linalg.yield %4 : i1
} -> tensor<6xi1>
flow.dispatch.tensor.store %3, %arg2, offsets = [0], sizes = [6], strides = [1] : tensor<6xi1> -> !flow.dispatch.tensor<writeonly:tensor<6xi1>>
return
}
}
}
util.func public @add_tensors(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = "sync func @add_tensors(%input0: tensor<2x3xi1>, %input1: tensor<2x3xi1>) -> (%output0: tensor<2x3xi1>)"}} {
%0 = hal.tensor.import %arg0 "input0" : !hal.buffer_view -> tensor<2x3xi1>
%1 = hal.tensor.import %arg1 "input1" : !hal.buffer_view -> tensor<2x3xi1>
%2 = flow.tensor.reshape %0 : tensor<2x3xi1> -> tensor<6xi1>
%3 = flow.tensor.reshape %1 : tensor<2x3xi1> -> tensor<6xi1>
%4 = flow.dispatch @add_tensors_dispatch_0::@add_tensors_dispatch_0_elementwise_6_i1(%2, %3) : (tensor<6xi1>, tensor<6xi1>) -> tensor<6xi1>
%5 = flow.tensor.reshape %4 : tensor<6xi1> -> tensor<2x3xi1>
%6 = hal.tensor.export %5 "output0" : tensor<2x3xi1> -> !hal.buffer_view
util.return %6 : !hal.buffer_view
}
}
// -----// IR Dump Before VerifyLoweringToTensorsPass (iree-stream-verify-lowering-to-tensors) //----- //
#executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>
#map = affine_map<(d0) -> (d0)>
#device_target_local = #hal.device.target<"local", [#executable_target_embedded_elf_x86_64_]> : !hal.device
module attributes {stream.affinity.default = #hal.device.affinity<@__device_0>} {
util.global private @__device_0 = #device_target_local
stream.executable private @add_tensors_dispatch_0 {
stream.executable.export public @add_tensors_dispatch_0_elementwise_6_i1 workgroups() -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_slice
stream.return %x, %y, %z : index, index, index
}
builtin.module {
func.func @add_tensors_dispatch_0_elementwise_6_i1(%arg0: !stream.binding, %arg1: !stream.binding, %arg2: !stream.binding) {
%c0 = arith.constant 0 : index
%0 = stream.binding.subspan %arg0[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<8xi1>>
%1 = stream.binding.subspan %arg1[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<8xi1>>
%2 = stream.binding.subspan %arg2[%c0] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<8xi1>>
%3 = flow.dispatch.tensor.load %0, offsets = [0], sizes = [6], strides = [1] : !flow.dispatch.tensor<readonly:tensor<8xi1>> -> tensor<6xi1>
%4 = flow.dispatch.tensor.load %1, offsets = [0], sizes = [6], strides = [1] : !flow.dispatch.tensor<readonly:tensor<8xi1>> -> tensor<6xi1>
%5 = tensor.empty() : tensor<6xi1>
%6 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%3, %4 : tensor<6xi1>, tensor<6xi1>) outs(%5 : tensor<6xi1>) {
^bb0(%in: i1, %in_0: i1, %out: i1):
%7 = arith.addi %in, %in_0 : i1
linalg.yield %7 : i1
} -> tensor<6xi1>
flow.dispatch.tensor.store %6, %2, offsets = [0], sizes = [6], strides = [1] : tensor<6xi1> -> !flow.dispatch.tensor<writeonly:tensor<8xi1>>
return
}
}
}
util.func public @add_tensors(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = "sync func @add_tensors(%input0: tensor<2x3xi1>, %input1: tensor<2x3xi1>) -> (%output0: tensor<2x3xi1>)"}} {
%element_type_i1 = hal.element_type<i1> : i32
%dense_row_major = hal.encoding_type<dense_row_major> : i32
%c2 = arith.constant 2 : index
%c3 = arith.constant 3 : index
hal.buffer_view.assert<%arg0 : !hal.buffer_view> message("input0") shape([%c2, %c3]) type(%element_type_i1) encoding(%dense_row_major)
%0 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<2x3xi1> : index
%1 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg0 : !hal.buffer_view -> tensor<2x3xi1> in !stream.resource<external>{%0}
%2 = stream.async.transfer %1 : !stream.resource<external>{%0} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%0}
%element_type_i1_0 = hal.element_type<i1> : i32
%dense_row_major_1 = hal.encoding_type<dense_row_major> : i32
%c2_2 = arith.constant 2 : index
%c3_3 = arith.constant 3 : index
hal.buffer_view.assert<%arg1 : !hal.buffer_view> message("input1") shape([%c2_2, %c3_3]) type(%element_type_i1_0) encoding(%dense_row_major_1)
%3 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<2x3xi1> : index
%4 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg1 : !hal.buffer_view -> tensor<2x3xi1> in !stream.resource<external>{%3}
%5 = stream.async.transfer %4 : !stream.resource<external>{%3} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%3}
%6 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<6xi1> : index
%7 = stream.tensor.clone on(#hal.device.affinity<@__device_0>) %2 : tensor<2x3xi1> in !stream.resource<*>{%0} -> tensor<6xi1> in !stream.resource<*>{%6}
%8 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<6xi1> : index
%9 = stream.tensor.clone on(#hal.device.affinity<@__device_0>) %5 : tensor<2x3xi1> in !stream.resource<*>{%3} -> tensor<6xi1> in !stream.resource<*>{%8}
%c0 = arith.constant 0 : index
%10 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<6xi1> : index
%11 = stream.async.dispatch on(#hal.device.affinity<@__device_0>) @add_tensors_dispatch_0::@add_tensors_dispatch_0_elementwise_6_i1(%7[%c0 to %6 for %6], %9[%c0 to %8 for %8]) : (!stream.resource<*>{%6}, !stream.resource<*>{%8}) -> !stream.resource<*>{%10}
%12 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<2x3xi1> : index
%13 = stream.tensor.clone on(#hal.device.affinity<@__device_0>) %11 : tensor<6xi1> in !stream.resource<*>{%10} -> tensor<2x3xi1> in !stream.resource<*>{%12}
%14 = stream.async.transfer %13 : !stream.resource<*>{%12} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<external>{%12}
%15 = stream.tensor.export on(#hal.device.affinity<@__device_0>) %14 : tensor<2x3xi1> in !stream.resource<external>{%12} -> !hal.buffer_view
util.return %15 : !hal.buffer_view
}
}
This patch enables
i1
datatype support.i1
was treated asi8
in memory. This patch avoids paddingi1
i1
andi2
where the size of vector is not a multiple of 8-bits