Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISCV] Account for factor in interleave memory op costs #111511

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lukel97
Copy link
Contributor

@lukel97 lukel97 commented Oct 8, 2024

Currently we cost an interleaved memory op as if it were a load/store of the widened vector type.

However this doesn't take into account that we'll most likely need to perform at least Factor uops because we're writing/reading from Factor number of registers.

E.g. Today an i8 VF=2 Factor=8 interleave is costed as a single LMUL=1 op with +zvl128b, because the widened type is <16 x i8>.

This changes it to be calculated as <2 x i8> * Factor=8, i.e. 8 LMUL=1 ops.

Thankfully the FIXME about illegal vectors seems to have been fixed in #100436, and even then I think the LT.first should have been multiplied, not added.

Note we still have a quirk where the loop vectorizer will happily emit interleaved accesses for what could be strided accesses, because the costs are break-even in LoopVectorizationCostModel::setCostBasedWideningDecision:

void f(int8_t* a, int n) {
    for (int i = 0; i < n; i++) {
        a[i * 2] += 1;
    }
}

vsetvli	t1, zero, e8, m2, ta, ma
vlseg2e8.v	v24, (t0)
vadd.vi	v24, v24, 1
vsse8.v	v24, (a6), a5

I think we may need to either adjust the cost or add a hook to get the loop vectorizer to prefer gather/scatters that are strided.

Currently we cost an interleaved memory op as if it were a load/store of the widened vector type.

However this doesn't take into account that we'll most likely need to perform at least Factor uops because we're writing/reading from Factor number of registers.

E.g. Today an i8 VF=2 Factor=8 interleave is costed as a single LMUL=1 op with +zvl128b, because the widened type is <16 x i8>.

This changes it to be calculated as <2 x i8> * Factor=8, i.e. 8 LMUL=1 ops.

Thankfully the FIXME about illegal vectors seems to have been fixed in llvm#100436, and even then I think the LT.first should have been multiplied, not added.

Note we still have a quirk where the loop vectorizer will happily emit interleaved accesses for what could be strided accesses, because the costs are break-even in LoopVectorizationCostModel::setCostBasedWideningDecision:

	void f(int8_t* a, int n) {
	    for (int i = 0; i < n; i++) {
	        a[i * 2] += 1;
	    }
	}

	vsetvli	t1, zero, e8, m2, ta, ma
	vlseg2e8.v	v24, (t0)
	vadd.vi	v24, v24, 1
	vsse8.v	v24, (a6), a5

I think we may need to either adjust the cost or add a hook to get the loop vectorizer to s
@llvmbot
Copy link
Collaborator

llvmbot commented Oct 8, 2024

@llvm/pr-subscribers-backend-risc-v

Author: Luke Lau (lukel97)

Changes

Currently we cost an interleaved memory op as if it were a load/store of the widened vector type.

However this doesn't take into account that we'll most likely need to perform at least Factor uops because we're writing/reading from Factor number of registers.

E.g. Today an i8 VF=2 Factor=8 interleave is costed as a single LMUL=1 op with +zvl128b, because the widened type is <16 x i8>.

This changes it to be calculated as <2 x i8> * Factor=8, i.e. 8 LMUL=1 ops.

Thankfully the FIXME about illegal vectors seems to have been fixed in #100436, and even then I think the LT.first should have been multiplied, not added.

Note we still have a quirk where the loop vectorizer will happily emit interleaved accesses for what could be strided accesses, because the costs are break-even in LoopVectorizationCostModel::setCostBasedWideningDecision:

void f(int8_t* a, int n) {
    for (int i = 0; i &lt; n; i++) {
        a[i * 2] += 1;
    }
}

vsetvli	t1, zero, e8, m2, ta, ma
vlseg2e8.v	v24, (t0)
vadd.vi	v24, v24, 1
vsse8.v	v24, (a6), a5

I think we may need to either adjust the cost or add a hook to get the loop vectorizer to s


Full diff: https://github.com/llvm/llvm-project/pull/111511.diff

2 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+2-10)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll (+56-56)
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index a61461681f79ed..608298903cace0 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -691,19 +691,11 @@ InstructionCost RISCVTTIImpl::getInterleavedMemoryOpCost(
       auto *SubVecTy =
           VectorType::get(VTy->getElementType(),
                           VTy->getElementCount().divideCoefficientBy(Factor));
-
       if (VTy->getElementCount().isKnownMultipleOf(Factor) &&
           TLI->isLegalInterleavedAccessType(SubVecTy, Factor, Alignment,
                                             AddressSpace, DL)) {
-        // FIXME: We use the memory op cost of the *legalized* type here,
-        // because it's getMemoryOpCost returns a really expensive cost for
-        // types like <6 x i8>, which show up when doing interleaves of
-        // Factor=3 etc. Should the memory op cost of these be cheaper?
-        auto *LegalVTy = VectorType::get(VTy->getElementType(),
-                                         LT.second.getVectorElementCount());
-        InstructionCost LegalMemCost = getMemoryOpCost(
-            Opcode, LegalVTy, Alignment, AddressSpace, CostKind);
-        return LT.first + LegalMemCost;
+        return Factor * getMemoryOpCost(Opcode, SubVecTy, Alignment,
+                                        AddressSpace, CostKind);
       }
     }
   }
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
index fa346b4eac02d4..34285b26c13875 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
@@ -12,20 +12,20 @@ entry:
 ; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
 ; CHECK: Cost of 2 for VF 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 8: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 32: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 32: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
+; CHECK: Cost of 2 for VF 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF 32: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF 32: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 1: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 1: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 2: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 2: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 4: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 4: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
+; CHECK: Cost of 2 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.2, ptr %data, i64 %i, i32 0
@@ -49,16 +49,16 @@ define void @i8_factor_3(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_3'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
+; CHECK: Cost of 3 for VF 2: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 2: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
+; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
 ; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
 ; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
+; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
+; CHECK: Cost of 6 for VF 32: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 6 for VF 32: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.3, ptr %data, i64 %i, i32 0
@@ -86,16 +86,16 @@ define void @i8_factor_4(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_4'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF 2: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF 2: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF 4: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF 4: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF 8: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF 8: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF 16: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF 16: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 8 for VF 32: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 8 for VF 32: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.4, ptr %data, i64 %i, i32 0
@@ -127,14 +127,14 @@ define void @i8_factor_5(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_5'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
+; CHECK: Cost of 5 for VF 2: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 2: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
+; CHECK: Cost of 5 for VF 4: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 4: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
 ; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
 ; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
+; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.5, ptr %data, i64 %i, i32 0
@@ -170,14 +170,14 @@ define void @i8_factor_6(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_6'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
+; CHECK: Cost of 6 for VF 2: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 6 for VF 2: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
+; CHECK: Cost of 6 for VF 4: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 6 for VF 4: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
+; CHECK: Cost of 6 for VF 8: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 6 for VF 8: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
+; CHECK: Cost of 6 for VF 16: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 6 for VF 16: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.6, ptr %data, i64 %i, i32 0
@@ -217,14 +217,14 @@ define void @i8_factor_7(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_7'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
+; CHECK: Cost of 7 for VF 2: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 7 for VF 2: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
+; CHECK: Cost of 7 for VF 4: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 7 for VF 4: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
+; CHECK: Cost of 7 for VF 8: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 7 for VF 8: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
+; CHECK: Cost of 7 for VF 16: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 7 for VF 16: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.7, ptr %data, i64 %i, i32 0
@@ -268,14 +268,14 @@ define void @i8_factor_8(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_8'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
+; CHECK: Cost of 8 for VF 2: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
+; CHECK: Cost of 8 for VF 2: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
+; CHECK: Cost of 8 for VF 4: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
+; CHECK: Cost of 8 for VF 4: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
+; CHECK: Cost of 8 for VF 8: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
+; CHECK: Cost of 8 for VF 8: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
+; CHECK: Cost of 8 for VF 16: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
+; CHECK: Cost of 8 for VF 16: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.8, ptr %data, i64 %i, i32 0

@llvmbot
Copy link
Collaborator

llvmbot commented Oct 8, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Luke Lau (lukel97)

Changes

Currently we cost an interleaved memory op as if it were a load/store of the widened vector type.

However this doesn't take into account that we'll most likely need to perform at least Factor uops because we're writing/reading from Factor number of registers.

E.g. Today an i8 VF=2 Factor=8 interleave is costed as a single LMUL=1 op with +zvl128b, because the widened type is <16 x i8>.

This changes it to be calculated as <2 x i8> * Factor=8, i.e. 8 LMUL=1 ops.

Thankfully the FIXME about illegal vectors seems to have been fixed in #100436, and even then I think the LT.first should have been multiplied, not added.

Note we still have a quirk where the loop vectorizer will happily emit interleaved accesses for what could be strided accesses, because the costs are break-even in LoopVectorizationCostModel::setCostBasedWideningDecision:

void f(int8_t* a, int n) {
    for (int i = 0; i &lt; n; i++) {
        a[i * 2] += 1;
    }
}

vsetvli	t1, zero, e8, m2, ta, ma
vlseg2e8.v	v24, (t0)
vadd.vi	v24, v24, 1
vsse8.v	v24, (a6), a5

I think we may need to either adjust the cost or add a hook to get the loop vectorizer to s


Full diff: https://github.com/llvm/llvm-project/pull/111511.diff

2 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+2-10)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll (+56-56)
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index a61461681f79ed..608298903cace0 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -691,19 +691,11 @@ InstructionCost RISCVTTIImpl::getInterleavedMemoryOpCost(
       auto *SubVecTy =
           VectorType::get(VTy->getElementType(),
                           VTy->getElementCount().divideCoefficientBy(Factor));
-
       if (VTy->getElementCount().isKnownMultipleOf(Factor) &&
           TLI->isLegalInterleavedAccessType(SubVecTy, Factor, Alignment,
                                             AddressSpace, DL)) {
-        // FIXME: We use the memory op cost of the *legalized* type here,
-        // because it's getMemoryOpCost returns a really expensive cost for
-        // types like <6 x i8>, which show up when doing interleaves of
-        // Factor=3 etc. Should the memory op cost of these be cheaper?
-        auto *LegalVTy = VectorType::get(VTy->getElementType(),
-                                         LT.second.getVectorElementCount());
-        InstructionCost LegalMemCost = getMemoryOpCost(
-            Opcode, LegalVTy, Alignment, AddressSpace, CostKind);
-        return LT.first + LegalMemCost;
+        return Factor * getMemoryOpCost(Opcode, SubVecTy, Alignment,
+                                        AddressSpace, CostKind);
       }
     }
   }
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
index fa346b4eac02d4..34285b26c13875 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
@@ -12,20 +12,20 @@ entry:
 ; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
 ; CHECK: Cost of 2 for VF 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 8: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 32: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 32: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
+; CHECK: Cost of 2 for VF 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF 32: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF 32: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 1: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 1: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 2: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 2: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 4: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 4: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
+; CHECK: Cost of 2 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.2, ptr %data, i64 %i, i32 0
@@ -49,16 +49,16 @@ define void @i8_factor_3(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_3'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
+; CHECK: Cost of 3 for VF 2: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 2: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
+; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
 ; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
 ; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
+; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
+; CHECK: Cost of 6 for VF 32: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 6 for VF 32: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.3, ptr %data, i64 %i, i32 0
@@ -86,16 +86,16 @@ define void @i8_factor_4(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_4'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF 2: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF 2: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF 4: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF 4: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF 8: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF 8: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 4 for VF 16: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 4 for VF 16: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 8 for VF 32: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 8 for VF 32: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.4, ptr %data, i64 %i, i32 0
@@ -127,14 +127,14 @@ define void @i8_factor_5(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_5'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
+; CHECK: Cost of 5 for VF 2: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 2: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
+; CHECK: Cost of 5 for VF 4: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 4: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
 ; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
 ; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
+; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.5, ptr %data, i64 %i, i32 0
@@ -170,14 +170,14 @@ define void @i8_factor_6(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_6'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
+; CHECK: Cost of 6 for VF 2: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 6 for VF 2: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
+; CHECK: Cost of 6 for VF 4: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 6 for VF 4: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
+; CHECK: Cost of 6 for VF 8: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 6 for VF 8: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
+; CHECK: Cost of 6 for VF 16: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 6 for VF 16: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.6, ptr %data, i64 %i, i32 0
@@ -217,14 +217,14 @@ define void @i8_factor_7(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_7'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
+; CHECK: Cost of 7 for VF 2: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 7 for VF 2: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
+; CHECK: Cost of 7 for VF 4: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 7 for VF 4: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
+; CHECK: Cost of 7 for VF 8: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 7 for VF 8: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
+; CHECK: Cost of 7 for VF 16: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 7 for VF 16: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.7, ptr %data, i64 %i, i32 0
@@ -268,14 +268,14 @@ define void @i8_factor_8(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_8'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
+; CHECK: Cost of 8 for VF 2: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
+; CHECK: Cost of 8 for VF 2: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
+; CHECK: Cost of 8 for VF 4: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
+; CHECK: Cost of 8 for VF 4: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
+; CHECK: Cost of 8 for VF 8: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
+; CHECK: Cost of 8 for VF 8: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
+; CHECK: Cost of 8 for VF 16: INTERLEAVE-GROUP with factor 8 at %l0, ir<%p0>
+; CHECK: Cost of 8 for VF 16: INTERLEAVE-GROUP with factor 8 at <badref>, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.8, ptr %data, i64 %i, i32 0

@topperc
Copy link
Collaborator

topperc commented Oct 8, 2024

I think we may need to either adjust the cost or add a hook to get the loop vectorizer to s

Incomplete sentence

@lukel97
Copy link
Contributor Author

lukel97 commented Oct 8, 2024

I think we may need to either adjust the cost or add a hook to get the loop vectorizer to s

Incomplete sentence

My bad, fixed now

@preames
Copy link
Collaborator

preames commented Oct 8, 2024

At a conceptual level, this doesn't seem like the right costing. I suspect a better model is performing a wide load and then performing some kind of additional shuffle uop.

Here's some preliminary throughput data from the banana pi3:

perf stat ./vlseg-nf4.out 
  ~23.188822 cycles-per-inst
  ~24061.448450 cycles-per-iteration
  ~1037.631350 insts-per-iteration

perf stat ./vlseg-nf2.out 
  ~11.796910 cycles-per-inst
  ~12026.720450 cycles-per-iteration
  ~1019.480600 insts-per-iteration

These are fairly expensive operations. Both above are LMUL=m1.

Some reference points (focusing on the nf2 case):

perf stat ./vnsrl-m1.out
  ~3.935553 cycles-per-inst
  ~20508.467900 cycles-per-iteration
  ~5211.076900 insts-per-iteration

$ perf stat ./vle-m2.out 
  ~3.972999 cycles-per-inst
  ~4020.006750 cycles-per-iteration
  ~1011.831750 insts-per-iteration

./vlseg-nf2-emulated.out
  ~9039.639900 cycles-per-iteration
  ~5019.285100 insts-per-iteration 

From the look of this, it looks like the right cost model for the segment load instruction might be load cost + NF*shuffle cost.

Interestingly, the emulated version - which uses exactly that expansion - appears to be higher throughput. That was a real surprise to me, as I thought I'd heard that BP3 had fast segmented loads and stores. It's possible I've got an error in my tests, anyone else have data on this question?

@lukel97
Copy link
Contributor Author

lukel97 commented Oct 9, 2024

Interestingly, the emulated version - which uses exactly that expansion - appears to be higher throughput. That was a real surprise to me, as I thought I'd heard that BP3 had fast segmented loads and stores. It's possible I've got an error in my tests, anyone else have data on this question?

I tried this out: I think I can recreate your observation that the emulated vlseg2e8.v is faster, ~0.94B cycles vs ~1.26B cycles:

Sources

	.global start
start:
	la a0, foo
	li a1, 0
	li a2, 104857600
	la a3, dst1
	la a4, dst2
loop:
	vsetvli t0, zero, e16, m2, ta, ma
	vle16.v v8, (a0)
	vsetvli t0, zero, e8, m1, ta, ma
	vnsrl.wi v10, v8, 0
	vnsrl.wi v11, v8, 8
	addi a1, a1, 1
	blt a1, a2, loop
exit:
	li a7, 93
	ecall
 
	.data
foo:
	.zero 512
dst1:
	.zero 256
dst2:
	.zero 256	
	.global start
start:
	la a0, foo
	li a1, 0
	li a2, 104857600
	la a3, dst1
	la a4, dst2
	vsetvli t0, zero, e8, m1, ta, ma
loop:
	vlseg2e8.v v8, (a0)
	addi a1, a1, 1
	blt a1, a2, loop
exit:
	li a7, 93
	ecall
 
	.data
foo:
	.zero 512
dst1:
	.zero 256
dst2:
	.zero 256

However were you storing the deinterleaved results back somewhere? Once I added that in the native vlseg2e8.v overtook the emulated version slightly, ~1.58B cycles vs 1.68B

Sources with stores

	.global start
start:
	la a0, foo
	li a1, 0
	li a2, 104857600
	la a3, dst1
	la a4, dst2
loop:
	vsetvli t0, zero, e16, m2, ta, ma
	vle16.v v8, (a0)
	vsetvli t0, zero, e8, m1, ta, ma
	vnsrl.wi v10, v8, 0
	vnsrl.wi v11, v8, 8
	vse8.v v10, (a3)
	vse8.v v11, (a4)
	addi a1, a1, 1
	blt a1, a2, loop
exit:
	li a7, 93
	ecall
 
	.data
foo:
	.zero 512
dst1:
	.zero 256
dst2:
	.zero 256	
	.global start
start:
	la a0, foo
	li a1, 0
	li a2, 104857600
	la a3, dst1
	la a4, dst2
	vsetvli t0, zero, e8, m1, ta, ma
loop:
	vlseg2e8.v v8, (a0)
	vse8.v v8, (a3)
	vse8.v v9, (a4)
	addi a1, a1, 1
	blt a1, a2, loop
exit:
	li a7, 93
	ecall
 
	.data
foo:
	.zero 512
dst1:
	.zero 256
dst2:
	.zero 256

@lukel97
Copy link
Contributor Author

lukel97 commented Oct 9, 2024

I suspect a better model is performing a wide load and then performing some kind of additional shuffle uop.

I think you're right, I did some more benchmarking on the banana pi and I think the throughput is proportional to something like Wide load + Factor * LMUL. These cycle counts are for various segmented loads without any storing:

vlseg2e8 M1: 1.26B      2 + 2 * 1 = 4 ops   (0.315 cycles/op)
vlseg2e8 M2: 2.52B      4 + 2 * 2 = 8 ops   (0.315 cycles/op)
vlseg2e8 M4: 5.04B      8 + 2 * 4 = 16 ops  (0.315 cycles/op)
					        
vlseg3e8 M1: 2.10B      4 + 3 * 1 = 7 ops   (0.30 cycles/op)
vlseg3e8 M2: 4.20B      8 + 3 * 2 = 14 ops  (0.30 cycles/op)
					        
vlseg4e8 M1: 2.52B      4 + 4 * 1 = 8 ops   (0.315 cycles/op)
vlseg4e8 M2: 5.04B      8 + 4 * 2 = 16 ops  (0.315 cycles/op)

I'm not sure what the exact formula is for the NF=3 loads, but it seems close enough. I'll update this PR anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants