Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support i1 datatype #18713

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

Support i1 datatype #18713

wants to merge 4 commits into from

Conversation

lialan
Copy link
Contributor

@lialan lialan commented Oct 7, 2024

This patch enables i1 datatype support.

  • Previously i1 was treated as i8 in memory. This patch avoids padding i1
  • Fixes some corner-case issues with i1 and i2 where the size of vector is not a multiple of 8-bits
  • Have to work together with upstream change which handles sub-byte sized vector and memref types.

@@ -99,6 +99,11 @@ Value calculateStorageElementCountInBytes(Location loc,
}
}

// make sure the last dimension is byte aligned.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: proper punctuation (here and elsewhere) in comments: https://google.github.io/styleguide/cppguide.html#Punctuation,_Spelling_and_Grammar

@lialan lialan linked an issue Oct 7, 2024 that may be closed by this pull request
Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alan and I had an offline sync, and he is revisiting the codegen side changes. I'm not an expert of host side changes, so we need some inputs from Ben.

Comment on lines +940 to +953
// align tensor type to multiple of 8 bits:
auto rankedTensorType = tensorType.asRankedTensorType();
auto elementSize = rankedTensorType.getElementType().getIntOrFloatBitWidth();
auto typeSize = tensorType.getNumElements() * elementSize;

if (typeSize * elementSize % 8 != 0) {
SmallVector<int64_t> newShape(rankedTensorType.getShape());
newShape.back() = llvm::alignTo(newShape.back(), 8 / elementSize);

auto newTensorType = IREE::Flow::DispatchTensorType::get(
tensorType.getAccess(), newShape,
rankedTensorType.getElementType(), rankedTensorType.getEncoding());
tensorType = newTensorType;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some input from @benvanik about how we land this properly. My understanding is that we want to align i1 shape with bytes. E.g., 6xi1 becomes 8xi1 on both stream allocation and dispatch sides. The current approach replaces flow.dispatch.tensor type with 8xi1, while it is leaving 6xi1 type in the stream.tensor.sizeof op. See below snippet for more details. This is off to me because:

  1. I think it does not work with dynamic shapes. Because the arguments of DispatchTieShapeOp are not taken into accounts.
  2. It leaks the stream.tensor.sizeof lowering logic to FlowToStream conversion. Is it okay?

Ben knows more details, please correct me if I'm wrong. I think we still can cook all the logics in FlowToStream conversion. We either need a type converter or introduce a legalizePackedType method in ElementPackingUtils.[h|cpp] which shares the logic between buildResultSizeOf method and ConvertExecutableOp patterns. The legalizePackedType takes tensor type and dynamicDims and does something similar to calculateStorageElementCountInBytes.

The other approach is updating the logics in EncodeTensors.cpp. The pass has logics to encode host tensors and device tensors, and we probably just need to update the alignTensorType logic. (I don't know, please do some study.)

@benvanik do you have any suggestions about where the change should happen?

// -----// IR Dump Before ConvertToStreamPass (iree-stream-conversion) //----- //
#executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>
#map = affine_map<(d0) -> (d0)>
#device_target_local = #hal.device.target<"local", [#executable_target_embedded_elf_x86_64_]> : !hal.device
module attributes {stream.affinity.default = #hal.device.affinity<@__device_0>} {
  util.global private @__device_0 = #device_target_local
  flow.executable private @add_tensors_dispatch_0 {
    flow.executable.export public @add_tensors_dispatch_0_elementwise_6_i1 workgroups() -> (index, index, index) {
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice 
      flow.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @add_tensors_dispatch_0_elementwise_6_i1(%arg0: !flow.dispatch.tensor<readonly:tensor<6xi1>>, %arg1: !flow.dispatch.tensor<readonly:tensor<6xi1>>, %arg2: !flow.dispatch.tensor<writeonly:tensor<6xi1>>) {
        %0 = flow.dispatch.tensor.load %arg0, offsets = [0], sizes = [6], strides = [1] : !flow.dispatch.tensor<readonly:tensor<6xi1>> -> tensor<6xi1>
        %1 = flow.dispatch.tensor.load %arg1, offsets = [0], sizes = [6], strides = [1] : !flow.dispatch.tensor<readonly:tensor<6xi1>> -> tensor<6xi1>
        %2 = tensor.empty() : tensor<6xi1>
        %3 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%0, %1 : tensor<6xi1>, tensor<6xi1>) outs(%2 : tensor<6xi1>) {
        ^bb0(%in: i1, %in_0: i1, %out: i1):
          %4 = arith.addi %in, %in_0 : i1
          linalg.yield %4 : i1
        } -> tensor<6xi1>
        flow.dispatch.tensor.store %3, %arg2, offsets = [0], sizes = [6], strides = [1] : tensor<6xi1> -> !flow.dispatch.tensor<writeonly:tensor<6xi1>>
        return
      }
    }
  }
  util.func public @add_tensors(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = "sync func @add_tensors(%input0: tensor<2x3xi1>, %input1: tensor<2x3xi1>) -> (%output0: tensor<2x3xi1>)"}} {
    %0 = hal.tensor.import %arg0 "input0" : !hal.buffer_view -> tensor<2x3xi1>
    %1 = hal.tensor.import %arg1 "input1" : !hal.buffer_view -> tensor<2x3xi1>
    %2 = flow.tensor.reshape %0 : tensor<2x3xi1> -> tensor<6xi1>
    %3 = flow.tensor.reshape %1 : tensor<2x3xi1> -> tensor<6xi1>
    %4 = flow.dispatch @add_tensors_dispatch_0::@add_tensors_dispatch_0_elementwise_6_i1(%2, %3) : (tensor<6xi1>, tensor<6xi1>) -> tensor<6xi1>
    %5 = flow.tensor.reshape %4 : tensor<6xi1> -> tensor<2x3xi1>
    %6 = hal.tensor.export %5 "output0" : tensor<2x3xi1> -> !hal.buffer_view
    util.return %6 : !hal.buffer_view
  }
}


// -----// IR Dump Before VerifyLoweringToTensorsPass (iree-stream-verify-lowering-to-tensors) //----- //
#executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>
#map = affine_map<(d0) -> (d0)>
#device_target_local = #hal.device.target<"local", [#executable_target_embedded_elf_x86_64_]> : !hal.device
module attributes {stream.affinity.default = #hal.device.affinity<@__device_0>} {
  util.global private @__device_0 = #device_target_local
  stream.executable private @add_tensors_dispatch_0 {
    stream.executable.export public @add_tensors_dispatch_0_elementwise_6_i1 workgroups() -> (index, index, index) {
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice 
      stream.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @add_tensors_dispatch_0_elementwise_6_i1(%arg0: !stream.binding, %arg1: !stream.binding, %arg2: !stream.binding) {
        %c0 = arith.constant 0 : index
        %0 = stream.binding.subspan %arg0[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<8xi1>>
        %1 = stream.binding.subspan %arg1[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<8xi1>>
        %2 = stream.binding.subspan %arg2[%c0] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<8xi1>>
        %3 = flow.dispatch.tensor.load %0, offsets = [0], sizes = [6], strides = [1] : !flow.dispatch.tensor<readonly:tensor<8xi1>> -> tensor<6xi1>
        %4 = flow.dispatch.tensor.load %1, offsets = [0], sizes = [6], strides = [1] : !flow.dispatch.tensor<readonly:tensor<8xi1>> -> tensor<6xi1>
        %5 = tensor.empty() : tensor<6xi1>
        %6 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%3, %4 : tensor<6xi1>, tensor<6xi1>) outs(%5 : tensor<6xi1>) {
        ^bb0(%in: i1, %in_0: i1, %out: i1):
          %7 = arith.addi %in, %in_0 : i1
          linalg.yield %7 : i1
        } -> tensor<6xi1>
        flow.dispatch.tensor.store %6, %2, offsets = [0], sizes = [6], strides = [1] : tensor<6xi1> -> !flow.dispatch.tensor<writeonly:tensor<8xi1>>
        return
      }
    }
  }
  util.func public @add_tensors(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = "sync func @add_tensors(%input0: tensor<2x3xi1>, %input1: tensor<2x3xi1>) -> (%output0: tensor<2x3xi1>)"}} {
    %element_type_i1 = hal.element_type<i1> : i32
    %dense_row_major = hal.encoding_type<dense_row_major> : i32
    %c2 = arith.constant 2 : index
    %c3 = arith.constant 3 : index
    hal.buffer_view.assert<%arg0 : !hal.buffer_view> message("input0") shape([%c2, %c3]) type(%element_type_i1) encoding(%dense_row_major)
    %0 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<2x3xi1> : index
    %1 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg0 : !hal.buffer_view -> tensor<2x3xi1> in !stream.resource<external>{%0}
    %2 = stream.async.transfer %1 : !stream.resource<external>{%0} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%0}
    %element_type_i1_0 = hal.element_type<i1> : i32
    %dense_row_major_1 = hal.encoding_type<dense_row_major> : i32
    %c2_2 = arith.constant 2 : index
    %c3_3 = arith.constant 3 : index
    hal.buffer_view.assert<%arg1 : !hal.buffer_view> message("input1") shape([%c2_2, %c3_3]) type(%element_type_i1_0) encoding(%dense_row_major_1)
    %3 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<2x3xi1> : index
    %4 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg1 : !hal.buffer_view -> tensor<2x3xi1> in !stream.resource<external>{%3}
    %5 = stream.async.transfer %4 : !stream.resource<external>{%3} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%3}
    %6 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<6xi1> : index
    %7 = stream.tensor.clone on(#hal.device.affinity<@__device_0>) %2 : tensor<2x3xi1> in !stream.resource<*>{%0} -> tensor<6xi1> in !stream.resource<*>{%6}
    %8 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<6xi1> : index
    %9 = stream.tensor.clone on(#hal.device.affinity<@__device_0>) %5 : tensor<2x3xi1> in !stream.resource<*>{%3} -> tensor<6xi1> in !stream.resource<*>{%8}
    %c0 = arith.constant 0 : index
    %10 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<6xi1> : index
    %11 = stream.async.dispatch on(#hal.device.affinity<@__device_0>) @add_tensors_dispatch_0::@add_tensors_dispatch_0_elementwise_6_i1(%7[%c0 to %6 for %6], %9[%c0 to %8 for %8]) : (!stream.resource<*>{%6}, !stream.resource<*>{%8}) -> !stream.resource<*>{%10}
    %12 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<2x3xi1> : index
    %13 = stream.tensor.clone on(#hal.device.affinity<@__device_0>) %11 : tensor<6xi1> in !stream.resource<*>{%10} -> tensor<2x3xi1> in !stream.resource<*>{%12}
    %14 = stream.async.transfer %13 : !stream.resource<*>{%12} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<external>{%12}
    %15 = stream.tensor.export on(#hal.device.affinity<@__device_0>) %14 : tensor<2x3xi1> in !stream.resource<external>{%12} -> !hal.buffer_view
    util.return %15 : !hal.buffer_view
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Plumb i1 datatype through the compilation pipeline
3 participants