Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Ascii.Equals when widening #87141

Merged
merged 4 commits into from
Jul 6, 2023
Merged

Conversation

BrennanConroy
Copy link
Member

@BrennanConroy BrennanConroy commented Jun 5, 2023

While trying to replace some custom unsafe code in Kestrel in dotnet/aspnetcore#48368, we noticed that Ascii.Equals is slower than Kestrel's hand-rolled code. Upon investigation, 3 changes were found that could improve the performance.

  1. When checking for non-ascii characters in the inputs we were ORing both sides together. This shouldn't be needed as we are already checking the two inputs for bitwise equality, so this was doing an unneeded vpor ymm0,ymm0,ymm1
  2. When comparing a string to a byte[] the bytes were widened which results in 2 vectors that are half the size of the original vector. The way the code was written made it so we would fallback to Vector128 comparisons in the Vector256 case, and Vector64 in the Vector128 case. We can refactor the code to not fallback to smaller vector sizes resulting in half the loop iterations needed for the same input.
  3. Changing the equality condition after widening results in faster code
  • Slower: if (lower != rightValues0 || upper != rightValues1)
  • Faster: if (!Vector256<ushort>.AllBitsSet.Equals(Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))))
Code-gen for the different if conditions

if (lower != rightValues0 || upper != rightValues1)

M01_L00:
       vmovups   ymm1,[rax]
       vptest    ymm1,ymm0
       jne       short M01_L03
       vmovaps   ymm2,ymm1
       vpmovzxbw ymm2,xmm2
       vextracti128 xmm1,ymm1,1
       vpmovzxbw ymm1,xmm1
       vmovups   ymm3,[rdx]
       vmovups   ymm4,[rdx+20]
       vpcmpeqw  ymm2,ymm2,ymm3
       vpmovmskb r10d,ymm2
       cmp       r10d,0FFFFFFFF
       setne     r10b
       movzx     r10d,r10b
       vpcmpeqw  ymm1,ymm1,ymm4
       vpmovmskb r11d,ymm1
       cmp       r11d,0FFFFFFFF
       setne     r11b
       movzx     r11d,r11b
       or        r10b,r11b
       jne       short M01_L03
       add       rdx,40
       add       rax,20
       cmp       rdx,r9
       jbe       short M01_L00
       test      r8b,0F
       jne       short M01_L01
       mov       eax,1
       vzeroupper
       ret

if (!Vector256<ushort>.AllBitsSet.Equals(Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))))

M01_L00:
       vmovups   ymm1,[rax]
       vptest    ymm1,ymm0
       jne       short M01_L03
       vmovaps   ymm2,ymm1
       vpmovzxbw ymm2,xmm2
       vextracti128 xmm1,ymm1,1
       vpmovzxbw ymm1,xmm1
       vpcmpeqw  ymm2,ymm2,[rdx]
       vpcmpeqw  ymm1,ymm1,[rdx+20]
       vpand     ymm1,ymm2,ymm1
       vpcmpeqd  ymm2,ymm2,ymm2
       vpcmpeqw  ymm1,ymm1,ymm2
       vpmovmskb r10d,ymm1
       cmp       r10d,0FFFFFFFF
       jne       short M01_L03
       add       rdx,40
       add       rax,20
       cmp       rdx,r9
       jbe       short M01_L00
       test      r8b,0F
       jne       short M01_L01
       mov       eax,1
       vzeroupper
       ret

Much faster in the byte+char comparison case, the other two are likely improved due to removing the OR.

|             Method |        Job |              Toolchain | Size |      Mean |     Error |    StdDev |    Median |       Min |       Max | Ratio | MannWhitney(1%) | RatioSD |

|------------------- |----------- |----------------------- |----- |----------:|----------:|----------:|----------:|----------:|----------:|------:|---------------- |--------:|
|       Equals_Bytes | Job-SGZWKI | 8.0.0-fork\corerun.exe |  128 |  4.546 ns | 0.0757 ns | 0.1365 ns |  4.472 ns |  4.410 ns |  4.815 ns |  0.93 |          Faster |    0.03 |
|       Equals_Bytes | Job-JTGSGN | 8.0.0-base\corerun.exe |  128 |  4.887 ns | 0.0654 ns | 0.1259 ns |  4.869 ns |  4.700 ns |  5.344 ns |  1.00 |            Base |    0.00 |

|       Equals_Chars | Job-SGZWKI | 8.0.0-fork\corerun.exe |  128 |  9.503 ns | 0.1326 ns | 0.2425 ns |  9.432 ns |  9.183 ns |  9.831 ns |  0.98 |            Same |    0.03 |
|       Equals_Chars | Job-JTGSGN | 8.0.0-base\corerun.exe |  128 |  9.689 ns | 0.0439 ns | 0.0803 ns |  9.679 ns |  9.519 ns |  9.865 ns |  1.00 |            Base |    0.00 |

| Equals_Bytes_Chars | Job-SGZWKI | 8.0.0-fork\corerun.exe |  128 |  7.583 ns | 0.0622 ns | 0.1137 ns |  7.555 ns |  7.384 ns |  7.938 ns |  0.63 |          Faster |    0.01 |
| Equals_Bytes_Chars | Job-JTGSGN | 8.0.0-base\corerun.exe |  128 | 12.058 ns | 0.1172 ns | 0.2259 ns | 12.078 ns | 11.645 ns | 12.646 ns |  1.00 |            Base |    0.00 |

We can likely get similar gains in the EqualsIgnoreCase, and of course can be expanded to the Vector128 paths. But opening the PR now as draft to get feedback on the overall approach before expanding it.

Side-note on weird codegen observed

When doing return Vector256<ushort>.AllBitsSet.Equals(Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))) instead of the if condition it looks like extra instructions are generated

if condition:

M01_L00:
       vmovups   ymm1,[rax]
       vptest    ymm1,ymm0
       jne       short M01_L03
       vmovaps   ymm2,ymm1
       vpmovzxbw ymm2,xmm2
       vextracti128 xmm1,ymm1,1
       vpmovzxbw ymm1,xmm1
       vpcmpeqw  ymm2,ymm2,[rdx]
       vpcmpeqw  ymm1,ymm1,[rdx+20]
       vpand     ymm1,ymm2,ymm1
       vpcmpeqd  ymm2,ymm2,ymm2
       vpcmpeqw  ymm1,ymm1,ymm2
       vpmovmskb r10d,ymm1
       cmp       r10d,0FFFFFFFF
       jne       short M01_L03
       add       rdx,40
       add       rax,20
       cmp       rdx,r9
       jbe       short M01_L00
       test      r8b,0F
       jne       short M01_L01
       mov       eax,1
       vzeroupper
       ret

no if, return directly

M01_L00:
       vmovups   ymm1,[rax]
       vptest    ymm1,ymm0
       jne       short M01_L02
       vmovaps   ymm2,ymm1
       vpmovzxbw ymm2,xmm2
       vextracti128 xmm1,ymm1,1
       vpmovzxbw ymm1,xmm1
       vpcmpeqw  ymm2,ymm2,[rdx]
       vpcmpeqw  ymm1,ymm1,[rdx+20]
       vpand     ymm1,ymm2,ymm1
       vpcmpeqd  ymm2,ymm2,ymm2
       vpcmpeqw  ymm1,ymm1,ymm2
       vpmovmskb r10d,ymm1
       cmp       r10d,0FFFFFFFF
       sete      r10b          <-- extra
       movzx     r10d,r10b     <-- extra
       test      r10d,r10d     <-- extra
       je        short M01_L02
       add       rdx,40
       add       rax,20
       cmp       rdx,r9
       jbe       short M01_L00
       test      r8b,1F
       jne       short M01_L01
       mov       eax,1
       vzeroupper
       ret

@ghost ghost assigned BrennanConroy Jun 5, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jun 5, 2023
return false;
}

(Vector256<ushort> lower, Vector256<ushort> upper) = Vector256.Widen(leftNotWidened);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is me adding value:

Suggested change
(Vector256<ushort> lower, Vector256<ushort> upper) = Vector256.Widen(leftNotWidened);
var (lower, upper) = Vector256.Widen(leftNotWidened);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is me adding value:

Please stop adding value. :-P

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BCL has a rule that var can only be used when the type is apparent.

It's a battle that I lost before ever joining the team 😅


Vector256<TRight> leftValues;
Vector256<TRight> rightValues;
ref TRight oneVectorAwayFromRightEnd = ref Unsafe.Add(ref currentRightSearchSpace, length - (uint)Vector256<TLeft>.Count);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not understanding why this is valid. We're subtracting from the "right" search space the number of "left" elements in a vector?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works because TLeft and TRight are either the same type, or we are in the widen case where Vector<TLeft> is twice the size of Vector<TRight> and the widen code will advance twice Vector<TRight>.Count which is equal to 1 Vector<TLeft>.Count.

But it is written in a confusing way. The whole TLoader abstraction helps with code sharing but makes this part kind of yucky. Maybe if the compare method advanced the pointers it would be better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. At a minimum a comment explaining would be helpful.

Comment on lines 427 to 428
if (!Vector256<ushort>.AllBitsSet.Equals(
Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should prefer the operators where possible:

Suggested change
if (!Vector256<ushort>.AllBitsSet.Equals(
Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))))
if (Vector256<ushort>.AllBitsSet != (Vector256.Equals(lower, rightValues0) & Vector256.Equals(upper, rightValues1)))

bool Equals() in particular is the same as == for integral types, but isn't directly recognized as intrinsic and is an instance methjod. So the JIT has to inline and elide the reference taken for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to just make this return Vector256<ushort>.AllBitsSet == (Vector256.Equals(lower, rightValues0) & Vector256.Equals(upper, rightValues1)) instead as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at the "Side-note on weird codegen observed" section at the bottom of the PR description.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be == if you noticed a suboptimal codegen - we'd better look and fix it in JIT instead of complicating C# code just to squeeze everything here and now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be as simple as:

return ((lower ^ rightValues0) | (upper ^ rightValues1)) == Vector256.Zero;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, sometimes it can be fixed with return cond ? true : false;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

Vector256<ushort> equals = Vector256.Equals(lower, rightValues0) & Vector256.Equals(upper, rightValues1);
if (equals.AsByte().ExtractMostSignificantBits() != 0xffffffffu)
    return false;
return true;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

Not needed, ((lower ^ rightValues0) | (upper ^ rightValues1)) == Vector256.Zero; is just a canonical way to do that. Also, ExtractMostSignificantBits is very expensive on arm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EgorBo Agreed, VPMOVMSKB has no equivalent on ARM64 (see #87141 (comment)), but this method is guarded by [CompExactlyDependsOn(typeof(Avx))].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EgorBo Agreed, VPMOVMSKB has no equivalent on ARM64 (see #87141 (comment)), but this method is guarded by [CompExactlyDependsOn(typeof(Avx))].

still, == Vector.Zero is expected to be lowered to MoveMask with SSE2 or to vptest with SSE41/AVX so no reason to do it by hands

@danmoseley danmoseley added area-System.Text.Encoding and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jun 5, 2023
@ghost
Copy link

ghost commented Jun 5, 2023

Tagging subscribers to this area: @dotnet/area-system-text-encoding
See info in area-owners.md if you want to be subscribed.

Issue Details

While trying to replace some custom unsafe code in Kestrel in dotnet/aspnetcore#48368, we noticed that Ascii.Equals is slower than Kestrel's hand-rolled code. Upon investigation, 3 changes were found that could improve the performance.

  1. When checking for non-ascii characters in the inputs we were ORing both sides together. This shouldn't be needed as we are already checking the two inputs for bitwise equality, so this was doing an unneeded vpor ymm0,ymm0,ymm1
  2. When comparing a string to a byte[] the bytes were widened which results in 2 vectors that are half the size of the original vector. The way the code was written made it so we would fallback to Vector128 comparisons in the Vector256 case, and Vector64 in the Vector128 case. We can refactor the code to not fallback to smaller vector sizes resulting in half the loop iterations needed for the same input.
  3. Changing the equality condition after widening results in faster code
  • Slower: if (lower != rightValues0 || upper != rightValues1)
  • Faster: if (!Vector256<ushort>.AllBitsSet.Equals(Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))))
Code-gen for the different if conditions

if (lower != rightValues0 || upper != rightValues1)

M01_L00:
       vmovups   ymm1,[rax]
       vptest    ymm1,ymm0
       jne       short M01_L03
       vmovaps   ymm2,ymm1
       vpmovzxbw ymm2,xmm2
       vextracti128 xmm1,ymm1,1
       vpmovzxbw ymm1,xmm1
       vmovups   ymm3,[rdx]
       vmovups   ymm4,[rdx+20]
       vpcmpeqw  ymm2,ymm2,ymm3
       vpmovmskb r10d,ymm2
       cmp       r10d,0FFFFFFFF
       setne     r10b
       movzx     r10d,r10b
       vpcmpeqw  ymm1,ymm1,ymm4
       vpmovmskb r11d,ymm1
       cmp       r11d,0FFFFFFFF
       setne     r11b
       movzx     r11d,r11b
       or        r10b,r11b
       jne       short M01_L03
       add       rdx,40
       add       rax,20
       cmp       rdx,r9
       jbe       short M01_L00
       test      r8b,0F
       jne       short M01_L01
       mov       eax,1
       vzeroupper
       ret

if (!Vector256<ushort>.AllBitsSet.Equals(Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))))

M01_L00:
       vmovups   ymm1,[rax]
       vptest    ymm1,ymm0
       jne       short M01_L03
       vmovaps   ymm2,ymm1
       vpmovzxbw ymm2,xmm2
       vextracti128 xmm1,ymm1,1
       vpmovzxbw ymm1,xmm1
       vpcmpeqw  ymm2,ymm2,[rdx]
       vpcmpeqw  ymm1,ymm1,[rdx+20]
       vpand     ymm1,ymm2,ymm1
       vpcmpeqd  ymm2,ymm2,ymm2
       vpcmpeqw  ymm1,ymm1,ymm2
       vpmovmskb r10d,ymm1
       cmp       r10d,0FFFFFFFF
       jne       short M01_L03
       add       rdx,40
       add       rax,20
       cmp       rdx,r9
       jbe       short M01_L00
       test      r8b,0F
       jne       short M01_L01
       mov       eax,1
       vzeroupper
       ret

Much faster in the byte+char comparison case, the other two are likely improved due to removing the OR. But opening the PR now as draft to get feedback on the overall approach before expanding it.

|             Method |        Job |              Toolchain | Size |      Mean |     Error |    StdDev |    Median |       Min |       Max | Ratio | MannWhitney(1%) | RatioSD |

|------------------- |----------- |----------------------- |----- |----------:|----------:|----------:|----------:|----------:|----------:|------:|---------------- |--------:|
|       Equals_Bytes | Job-SGZWKI | 8.0.0-fork\corerun.exe |  128 |  4.546 ns | 0.0757 ns | 0.1365 ns |  4.472 ns |  4.410 ns |  4.815 ns |  0.93 |          Faster |    0.03 |
|       Equals_Bytes | Job-JTGSGN | 8.0.0-base\corerun.exe |  128 |  4.887 ns | 0.0654 ns | 0.1259 ns |  4.869 ns |  4.700 ns |  5.344 ns |  1.00 |            Base |    0.00 |

|       Equals_Chars | Job-SGZWKI | 8.0.0-fork\corerun.exe |  128 |  9.503 ns | 0.1326 ns | 0.2425 ns |  9.432 ns |  9.183 ns |  9.831 ns |  0.98 |            Same |    0.03 |
|       Equals_Chars | Job-JTGSGN | 8.0.0-base\corerun.exe |  128 |  9.689 ns | 0.0439 ns | 0.0803 ns |  9.679 ns |  9.519 ns |  9.865 ns |  1.00 |            Base |    0.00 |

| Equals_Bytes_Chars | Job-SGZWKI | 8.0.0-fork\corerun.exe |  128 |  7.583 ns | 0.0622 ns | 0.1137 ns |  7.555 ns |  7.384 ns |  7.938 ns |  0.63 |          Faster |    0.01 |
| Equals_Bytes_Chars | Job-JTGSGN | 8.0.0-base\corerun.exe |  128 | 12.058 ns | 0.1172 ns | 0.2259 ns | 12.078 ns | 11.645 ns | 12.646 ns |  1.00 |            Base |    0.00 |

We can likely get similar gains in the EqualsIgnoreCase, and of course can be expanded to the Vector128 paths.

Side-note on weird codegen observed

When doing return Vector256<ushort>.AllBitsSet.Equals(Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))) instead of the if condition it looks like extra instructions are generated

if condition:

M01_L00:
       vmovups   ymm1,[rax]
       vptest    ymm1,ymm0
       jne       short M01_L03
       vmovaps   ymm2,ymm1
       vpmovzxbw ymm2,xmm2
       vextracti128 xmm1,ymm1,1
       vpmovzxbw ymm1,xmm1
       vpcmpeqw  ymm2,ymm2,[rdx]
       vpcmpeqw  ymm1,ymm1,[rdx+20]
       vpand     ymm1,ymm2,ymm1
       vpcmpeqd  ymm2,ymm2,ymm2
       vpcmpeqw  ymm1,ymm1,ymm2
       vpmovmskb r10d,ymm1
       cmp       r10d,0FFFFFFFF
       jne       short M01_L03
       add       rdx,40
       add       rax,20
       cmp       rdx,r9
       jbe       short M01_L00
       test      r8b,0F
       jne       short M01_L01
       mov       eax,1
       vzeroupper
       ret

no if, return directly

M01_L00:
       vmovups   ymm1,[rax]
       vptest    ymm1,ymm0
       jne       short M01_L02
       vmovaps   ymm2,ymm1
       vpmovzxbw ymm2,xmm2
       vextracti128 xmm1,ymm1,1
       vpmovzxbw ymm1,xmm1
       vpcmpeqw  ymm2,ymm2,[rdx]
       vpcmpeqw  ymm1,ymm1,[rdx+20]
       vpand     ymm1,ymm2,ymm1
       vpcmpeqd  ymm2,ymm2,ymm2
       vpcmpeqw  ymm1,ymm1,ymm2
       vpmovmskb r10d,ymm1
       cmp       r10d,0FFFFFFFF
       sete      r10b          <-- extra
       movzx     r10d,r10b     <-- extra
       test      r10d,r10d     <-- extra
       je        short M01_L02
       add       rdx,40
       add       rax,20
       cmp       rdx,r9
       jbe       short M01_L00
       test      r8b,1F
       jne       short M01_L01
       mov       eax,1
       vzeroupper
       ret
Author: BrennanConroy
Assignees: BrennanConroy
Labels:

area-System.Text.Encoding

Milestone: -

@xtqqczze
Copy link
Contributor

xtqqczze commented Jun 8, 2023

@adamsitnik
Copy link
Member

cc @gfoidl who provided previous optimizations in #85926

rightValues = Vector256.LoadUnsafe(ref currentRightSearchSpace);

if (leftValues != rightValues || !AllCharsInVectorAreAscii(leftValues | rightValues))
if (!TLoader.Compare256(ref currentLeftSearchSpace, ref currentRightSearchSpace))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TLoader does now the comparison, so the name should be adjusted to reflect that.

Comment on lines 411 to 412
(Vector128<ushort> lower, Vector128<ushort> upper) = Vector128.Widen(Vector128.LoadUnsafe(ref ptr));
return Vector256.Create(lower, upper);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(Vector128<ushort> lower, Vector128<ushort> upper) = Vector128.Widen(Vector128.LoadUnsafe(ref ptr));
return Vector256.Create(lower, upper);
return Vector256.WidenLower(Vector128.LoadUnsafe(ref ptr).ToVector256Unsafe());

This results in better codegen when Avx2 is available.

Copy link
Contributor

@xtqqczze xtqqczze Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggests a set of missing System.Runtime.Intrinsics.Vector256 APIs:

public static System.Runtime.Intrinsics.Vector256<ushort> Widen (System.Runtime.Intrinsics.Vector128<byte> source);
 public static System.Runtime.Intrinsics.Vector256<ushort> LoadWideningUnsafe (ref byte source);

Copy link
Contributor

@xtqqczze xtqqczze Jun 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results in better codegen when Avx2 is available.

Codegen on arm64 is pretty bad though, probably should wrap with Avx2.IsSupported.

@stephentoub
Copy link
Member

@BrennanConroy, when this lands, will that be enough to enable ASP.NET to switch away from its custom implementation?

If so, @adamsitnik, can you help ensure this lands as soon as possible?

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the changes LGTM, big thanks for your contribution @BrennanConroy.

I've left some comments, but they are all subjective and related only to naming. PTAL at them and either reject or apply them and mark the PR as ready for review. Then I am simply going to merge it.

@adamsitnik adamsitnik marked this pull request as ready for review July 6, 2023 07:56
Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BrennanConroy is OOF so I applied my suggestions and added a test for BoundedMemory

Overall for cases where the inputs are equal, the perf has improved and is even faster than the ASP.NET implementation that we want to remove in dotnet/aspnetcore#48368

For cases where the inputs are not equal at first character, the perf has regressed, but it's on par with the ASP.NET implementation. It's acceptable if we want to unblock dotnet/aspnetcore#48368

Source code, results:

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1848)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=8.0.100-preview.4.23259.14
  [Host]     : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
  Job-SJHYUI : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-GSESEP : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

LaunchCount=3 MemoryRandomization=True  
Method Job Size Equal Mean Ratio
SystemAscii PR 6 False 1.653 ns 1.00
AspNet PR 6 False 3.335 ns 2.03
SystemAscii main 6 False 1.646 ns 1.00
SystemAscii PR 6 True 4.446 ns 1.06
AspNet PR 6 True 4.180 ns 0.99
SystemAscii main 6 True 4.202 ns 1.00
SystemAscii PR 32 False 2.837 ns 1.36
AspNet PR 32 False 2.814 ns 1.35
SystemAscii main 32 False 2.080 ns 1.00
SystemAscii PR 32 True 2.888 ns 0.86
AspNet PR 32 True 3.023 ns 0.90
SystemAscii main 32 True 3.356 ns 1.00
SystemAscii PR 64 False 2.821 ns 1.36
AspNet PR 64 False 2.796 ns 1.35
SystemAscii main 64 False 2.072 ns 1.00
SystemAscii PR 64 True 3.849 ns 0.73
AspNet PR 64 True 4.446 ns 0.84
SystemAscii main 64 True 5.277 ns 1.00

the CI failure is unrelated (#73040)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants