Pass to transfer the strided access pattern from L3 to L2 #792

yzhang93 · 2024-09-20T05:59:45Z

This PR aims to improve the control code and thus matmul performance by moving some dma instructions from L3 to L2 side. The detail of the idea is in #764 (comment).

This pass has not been plugged into e2e passes. Some change of AMDAIEDmaLoopSubsumption is needed to make it fully work. Ideally, I'm thinking to make it as a pattern inside AMDAIEDmaComposition and running all the npu dma optimization at once instead of running as a sequence of DmaComposition -> TransferStridedAccessPattern -> CanonicalizatizeDoublyStridedOp -> DmaSubsumption.

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETransferStridedAccessPattern.cpp

jtuyls · 2024-09-23T15:12:16Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETransferStridedAccessPattern.cpp

+    return emitError(rewriter.getUnknownLoc())
+           << "failed to get dim position for combination";
+  }
+  size_t dimForCombine = isCombinable.value();


Right now, you're looking for a single dimension that can be combined with the innermost dimension? Ideally, this would work for multiple dimensions as well. for example:

[[0, 0, 0, 0] [3, 2, 32, 32] [64, 32, 128, 1]]

should become:

[[0, 0] [32, 192] [128, 1]]

I agree this case should be considered. However, I'd leave out such change in this revision until the logic of new sizes/strides (see the other comments below) is confirmed correct.

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETransferStridedAccessPattern.cpp

jtuyls · 2024-09-23T15:29:01Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETransferStridedAccessPattern.cpp

+  auto getNewL2Strides = [&](SmallVector<int64_t> values) {
+    SmallVector<OpFoldResult> res = {getAsIndexOpFoldResult(ctx, 1)};
+    int64_t initial = values.back();
+    // Leave out one dimension for insertion afterwards
+    for (size_t i = values.size() - 2; i > 0; i--) {
+      initial *= values[i];
+      res.push_back(getAsIndexOpFoldResult(ctx, initial));
+    }
+    return llvm::to_vector(llvm::reverse(res));
+  };


Could you explain how to calculate the new strides? I don't understand how it's just initial *= values[i];?

Take the following dma ops for example:

%45 = amdaie.npu.circular_dma_cpy_nd %8([0] [2048] [1], [] [] []) %46 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, 0, 0, %41] [4, 2, 32, 32] [4096, 32, 128, 1])

The logic is to create L2 side strides from the innermost dimension, and then reverse the vector to have the final order. The new L2 side strides always start with [1], and should have the same number of dimensions as the original L3 side source addressing. The next dimensions are calculated by the logic initial *= l3OrigSizes[i].

The initial means the innermost continuous elements which is l3OrigSizes[-1]* l3OrigStrides.back[-1] (the implementation omit l3OrigStrides[-1] because l3OrigStrides[-1] == 1). The combined elements are now continuous on L3 side, but should have a strided addressing on L2 side, the stride should be initial * l3OrigSizes[-2]. So after this iteration, the strides are [1, 32 * 32].

Same logic for the next iteration, the strides become [1, 32 * 32, 32 * 32 * 2] = [1, 1024, 2048]. After reversion, it's [2048, 1024, 1]. At last insert the stride for the position of the combined dimension (e.g., index 1 in this example), which is l3OrigStride[dimForCombine], and get final strides [2048, 32, 1024, 1].

Let me know if this is the correct logic to get the L2 side strides, or if there's a better way to calculate this.

The combined elements are now continuous on L3 side, but should have a strided addressing on L2 side, the stride should be initial * l3OrigSizes[-2].

Yeah, I think the core idea is good: i.e. that the strides should be created as if the original L3 was contiguous and then rearranged based on which dimension(s) are combined with the innermost one.

However, I do think this needs extensive tests to ensure correctness as this will otherwise lead to hard-debug numerical errors in the future. So, it would be good to create a standalone utility function that takes in a set of static offsets/sizes/strides and produces the new static L3 and L2 offsets/sizes/strides, so that it can be tested standalone (ctest, not lit) on a lot of different cases, see for example: https://github.com/nod-ai/iree-amd-aie/blob/main/compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/AMDAIEDmaUtilsTest.cpp

jtuyls · 2024-09-23T15:31:40Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETransferStridedAccessPattern.cpp

+      circularDma.getTargetMixedStrides();
+
+  // Change the source/target addressing of all users from a connection op.
+  for (Operation *user : connectionOp->getUsers()) {


What if different NPU DMA users have different strides/sizes/offsets?

I don't know if we would have such cases. I looked through the current IR, and only find the case that the connection op has multiple NpuDmaCpyNdOp users (just with different offsets) and one NpuCircularDmaCpyNdOp user.

We do see it with peeled matmul and regardless, we should check for it and return/abort.

Pass to transfer the strided access pattern from L3 to L2

8e29625

yzhang93 requested review from MaheshRavishankar, nirvedhmeshram, Abhishek-Varma and jtuyls as code owners September 20, 2024 05:59

Abhishek-Varma requested changes Sep 20, 2024

View reviewed changes

Address comments

326b930

jtuyls requested changes Sep 23, 2024

View reviewed changes

yzhang93 added 2 commits September 23, 2024 12:58

Fix MacOS-14 failure

248fdc0

Add/remove checks

ca6cf57

yzhang93 mentioned this pull request Oct 2, 2024

[ConvertToDma] Add options to tranpose dma dimensions on target side #812

Merged

yzhang93 closed this Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass to transfer the strided access pattern from L3 to L2 #792

Pass to transfer the strided access pattern from L3 to L2 #792

yzhang93 commented Sep 20, 2024 •

edited

Loading

jtuyls Sep 23, 2024

yzhang93 Sep 24, 2024

jtuyls Sep 23, 2024

yzhang93 Sep 23, 2024 •

edited

Loading

jtuyls Sep 24, 2024

jtuyls Sep 23, 2024

yzhang93 Sep 24, 2024

jtuyls Sep 24, 2024

Pass to transfer the strided access pattern from L3 to L2 #792

Pass to transfer the strided access pattern from L3 to L2 #792

Conversation

yzhang93 commented Sep 20, 2024 • edited Loading

jtuyls Sep 23, 2024

Choose a reason for hiding this comment

yzhang93 Sep 24, 2024

Choose a reason for hiding this comment

jtuyls Sep 23, 2024

Choose a reason for hiding this comment

yzhang93 Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

jtuyls Sep 24, 2024

Choose a reason for hiding this comment

jtuyls Sep 23, 2024

Choose a reason for hiding this comment

yzhang93 Sep 24, 2024

Choose a reason for hiding this comment

jtuyls Sep 24, 2024

Choose a reason for hiding this comment

yzhang93 commented Sep 20, 2024 •

edited

Loading

yzhang93 Sep 23, 2024 •

edited

Loading