Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IndexOfAnyValues #78093

Merged
merged 9 commits into from
Nov 21, 2022
Merged

Add IndexOfAnyValues #78093

merged 9 commits into from
Nov 21, 2022

Conversation

MihaZupan
Copy link
Member

@MihaZupan MihaZupan commented Nov 9, 2022

Closes #68328
Depends on all IndexOfAny implementations for byte/char of 1-5 values being available, therefore depends on #78015 to make that a reality for Mono.

I also added vectorized paths for IndexOfAnyExcept, LastIndexOfAny, and LastIndexOfAnyExcept for 5 values to SpanHelpers and started using them from existing IndexOfAny(Span, Span) APIs for byte and char.
The changes in SpanHelpers are effectively a copy-paste from existing overloads.

The currently employed categories of specializations for IndexOfAnyValues are:

  • 1-5 values
  • Values within a single range
  • Values in the [0, 127] (ASCII) range
  • Any set of values for bytes (also vectorized)
  • Values in the [0, 255] (Latin1) range for chars (not vectorized, but using efficient checks)
  • Any set of chars via the ProbabilisticMap

Efficiency assumptions for different search primitives are based on numbers collected at 0f3a88b on an Intel i9 10900X CPU:

IndexOf, IndexOfAny, IndexOfAnyInRange and IndexOfAnyAscii numbers
char Method Length Mean Error StdDev
IndexOfAny128 100000 6.218 us 0.0121 us 0.0174 us
IndexOf 100000 4.929 us 0.0083 us 0.0119 us
IndexOfAny2Values 100000 5.764 us 0.0092 us 0.0138 us
IndexOfAny3Values 100000 6.050 us 0.0193 us 0.0283 us
IndexOfAny4Values 100000 6.868 us 0.0184 us 0.0270 us
IndexOfAny5Values 100000 7.805 us 0.0228 us 0.0335 us
IndexOfAnyInRange_Char 100000 5.194 us 0.0106 us 0.0159 us
byte Method Length Mean Error StdDev
IndexOfAny128 100000 4.138 us 0.0131 us 0.0192 us
IndexOfAny256 100000 7.578 us 0.0683 us 0.1023 us
IndexOf 100000 2.468 us 0.0050 us 0.0070 us
IndexOfAny2Values 100000 2.835 us 0.0054 us 0.0078 us
IndexOfAny3Values 100000 3.033 us 0.0233 us 0.0342 us
IndexOfAny4Values 100000 3.464 us 0.0075 us 0.0110 us
IndexOfAny5Values 100000 3.910 us 0.0098 us 0.0147 us
IndexOfAnyInRange_Byte 100000 2.594 us 0.0077 us 0.0113 us

Benchmarks

Benchmark source

ASCII

This case is already vectorized for chars (#76740), but now the init cost is removed as well.
For bytes, it's the difference between an O(n * m) loop and a vectorized O(n).

Method Length Mean Error
IndexOfAnyValues_Char 1 2.412 ns 0.0161 ns
CurrentChar 1 5.981 ns 0.0283 ns
IndexOfAnyValues_Char 7 7.077 ns 0.0288 ns
CurrentChar 7 31.202 ns 0.0640 ns
IndexOfAnyValues_Char 8 2.605 ns 0.0038 ns
CurrentChar 8 29.698 ns 0.3410 ns
IndexOfAnyValues_Char 16 2.608 ns 0.0074 ns
CurrentChar 16 64.374 ns 2.3112 ns
IndexOfAnyValues_Char 32 3.229 ns 0.0051 ns
CurrentChar 32 50.769 ns 0.1097 ns
IndexOfAnyValues_Char 100000 5,829.302 ns 36.4721 ns
CurrentChar 100000 5,848.152 ns 8.7200 ns
IndexOfAnyValues_Byte 1 1.749 ns 0.0165 ns
CurrentByte 1 23.468 ns 0.1097 ns
IndexOfAnyValues_Byte 7 5.331 ns 0.0169 ns
CurrentByte 7 155.727 ns 0.9470 ns
IndexOfAnyValues_Byte 8 2.590 ns 0.0530 ns
CurrentByte 8 177.266 ns 1.8686 ns
IndexOfAnyValues_Byte 16 2.519 ns 0.0047 ns
CurrentByte 16 343.562 ns 0.2631 ns
IndexOfAnyValues_Byte 32 3.067 ns 0.0037 ns
CurrentByte 32 684.387 ns 2.0701 ns
IndexOfAnyValues_Byte 100000 4,713.635 ns 4.6342 ns
CurrentByte 100000 2,117,894.760 ns 1,282.3044 ns

1-5 values

1-5 values are pretty much the same as calling IndexOfAny("aeiou") if the value is constant at the call site.
If the value is not constant, IndexOfAnyValues is about 1 ns cheaper as we don't have to go through the length check switch.

Single range

The single range case is pretty much identical to calling IndexOfAnyInRange directly.

[0, 255] values

This test is using "abcÀÄÈÌÐÔØÜàäèì" as the value (Latin1 for chars).
For chars, this is the difference between ProbabilisticMap and BitVector256.
For bytes, this is the difference between a simple O(n * m) loop and a vectorized O(n).

Method Length Mean Error
IndexOfAnyValues_Char 1 2.016 ns 0.0077 ns
CurrentChar 1 5.174 ns 0.1958 ns
IndexOfAnyValues_Char 7 5.872 ns 0.0270 ns
CurrentChar 7 20.783 ns 0.0390 ns
IndexOfAnyValues_Char 8 6.491 ns 0.0057 ns
CurrentChar 8 31.064 ns 0.8008 ns
IndexOfAnyValues_Char 16 11.795 ns 0.0666 ns
CurrentChar 16 38.854 ns 0.1773 ns
IndexOfAnyValues_Char 32 23.249 ns 0.0497 ns
CurrentChar 32 56.224 ns 0.3486 ns
IndexOfAnyValues_Char 100000 65,447.107 ns 662.7650 ns
CurrentChar 100000 67,424.026 ns 88.6333 ns
IndexOfAnyValues_Byte 1 1.834 ns 0.0106 ns
CurrentByte 1 5.760 ns 0.0078 ns
IndexOfAnyValues_Byte 7 4.951 ns 0.0051 ns
CurrentByte 7 32.489 ns 0.1421 ns
IndexOfAnyValues_Byte 8 4.163 ns 0.0044 ns
CurrentByte 8 46.305 ns 0.2549 ns
IndexOfAnyValues_Byte 16 4.157 ns 0.0029 ns
CurrentByte 16 81.151 ns 0.2816 ns
IndexOfAnyValues_Byte 32 5.328 ns 0.0040 ns
CurrentByte 32 149.091 ns 0.2623 ns
IndexOfAnyValues_Byte 100000 7,589.072 ns 9.0928 ns
CurrentByte 100000 441,889.687 ns 2,916.3709 ns

ProbabilisticMap for chars

Both will use a simple loop for short values and a ProbabilisticMap for longer ones.
IndexOfAnyValues can avoid the init overhead for constructing the ProbabilisticMap.

Short values list ("ažćčš" + "\u1000").

Method Length Mean Error
IndexOfAnyValues_Char 1 2.081 ns 0.0303 ns
CurrentChar 1 6.142 ns 0.0804 ns
IndexOfAnyValues_Char 7 7.091 ns 0.0792 ns
CurrentChar 7 26.791 ns 0.1502 ns
IndexOfAnyValues_Char 8 7.238 ns 0.0562 ns
CurrentChar 8 22.512 ns 0.7063 ns
IndexOfAnyValues_Char 16 16.136 ns 0.0569 ns
CurrentChar 16 31.549 ns 0.0510 ns
IndexOfAnyValues_Char 32 43.433 ns 2.7478 ns
CurrentChar 32 56.958 ns 0.2508 ns
IndexOfAnyValues_Char 100000 89,132.803 ns 80.5150 ns
CurrentChar 100000 90,523.407 ns 145.6607 ns

Long values list ("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" + "\u1000").

Method Length Mean Error
IndexOfAnyValues_Char 1 2.091 ns 0.0040 ns
CurrentChar 1 4.861 ns 0.0095 ns
IndexOfAnyValues_Char 7 6.986 ns 0.0286 ns
CurrentChar 7 20.385 ns 0.1865 ns
IndexOfAnyValues_Char 8 7.233 ns 0.0341 ns
CurrentChar 8 20.572 ns 0.0368 ns
IndexOfAnyValues_Char 16 12.478 ns 0.0383 ns
CurrentChar 16 67.647 ns 0.0471 ns
IndexOfAnyValues_Char 32 30.781 ns 0.0629 ns
CurrentChar 32 84.401 ns 0.8956 ns
IndexOfAnyValues_Char 100000 68,211.509 ns 47.4626 ns
CurrentChar 100000 67,217.969 ns 63.2336 ns

HttpRuleParser.IsToken

This is an example of an implementation we will replace with IndexOfAnyValues.

Method Length Mean Error
IndexOfAnyValues_Char 1 2.683 ns 0.0108 ns
CurrentChar 1 1.655 ns 0.0046 ns
IndexOfAnyValues_Char 7 7.184 ns 0.0185 ns
CurrentChar 7 4.989 ns 0.1115 ns
IndexOfAnyValues_Char 8 2.402 ns 0.0025 ns
CurrentChar 8 4.787 ns 0.0155 ns
IndexOfAnyValues_Char 16 2.409 ns 0.0052 ns
CurrentChar 16 9.100 ns 0.0119 ns
IndexOfAnyValues_Char 32 3.306 ns 0.0078 ns
CurrentChar 32 19.486 ns 0.5098 ns
IndexOfAnyValues_Char 100000 6,004.741 ns 4.4187 ns
CurrentChar 100000 57,569.776 ns 49.4496 ns
IndexOfAnyValues_Byte 1 1.749 ns 0.0058 ns
CurrentByte 1 1.677 ns 0.0243 ns
IndexOfAnyValues_Byte 7 4.835 ns 0.0779 ns
CurrentByte 7 5.112 ns 0.0038 ns
IndexOfAnyValues_Byte 8 2.521 ns 0.0046 ns
CurrentByte 8 5.274 ns 0.0052 ns
IndexOfAnyValues_Byte 16 2.519 ns 0.0021 ns
CurrentByte 16 9.113 ns 0.0166 ns
IndexOfAnyValues_Byte 32 3.141 ns 0.0023 ns
CurrentByte 32 19.417 ns 1.0021 ns
IndexOfAnyValues_Byte 100000 4,762.442 ns 21.2032 ns
CurrentByte 100000 57,967.138 ns 97.4594 ns

@dotnet-issue-labeler
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@MihaZupan
Copy link
Member Author

cc: @gfoidl

@ghost ghost assigned MihaZupan Nov 9, 2022
@ghost
Copy link

ghost commented Nov 9, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Implements #68328
Depends on all IndexOfAny implementations for byte/char of 1-5 values being available, therefore depends on #78015 to make that a reality for Mono.

I also added vectorized paths for IndexOfAnyExcept, LastIndexOfAny, and LastIndexOfAnyExcept for 5 values to SpanHelpers and started using them from existing IndexOfAny(Span, Span) APIs for byte and char.
The changes in SpanHelpers are effectively a copy-paste from existing overloads.

The currently employed categories of specializations for IndexOfAnyValues are:

  • 1-5 values
  • Values within a single range
  • Values in the [0, 127] (ASCII) range
    • Additional special handling for AsciiLetter, AsciiLetterOrDigit, AsciiHexDigit, AsciiHexDigitLower, AsciiHexDigitUpper.
    • I have to investigate if these actually make a difference. If they do, we could easily add others (e.g. for Base64) later on.
  • Any set of values for bytes (also vectorized)
  • Values in the [0, 255] (Latin1) range for chars (not vectorized, but using efficient checks)
  • Any set of chars via the ProbabilisticMap

ToDo: benchmarks

Author: MihaZupan
Assignees: -
Labels:

area-System.Memory

Milestone: 8.0.0

@MihaZupan MihaZupan added the blocked Issue/PR is blocked on something - see comments label Nov 9, 2022
Copy link
Member

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I've seen so far looks good 👍🏻 (need to have another look in the next days)

@MihaZupan
Copy link
Member Author

need to have another look in the next days

Thanks for the reviews, they're much appreciated.

I'll get back to this PR in about a week. Hopefully, #78015 is resolved by then.

@stephentoub stephentoub added blocked Issue/PR is blocked on something - see comments and removed blocked Issue/PR is blocked on something - see comments labels Nov 17, 2022
@MihaZupan MihaZupan marked this pull request as ready for review November 20, 2022 22:37
@MihaZupan
Copy link
Member Author

MihaZupan commented Nov 20, 2022

I've updated the PR and moved it out of draft.
I added a few more tests and updated the initial post with all the benchmark numbers.

I removed the IAsciiSet special cases as I didn't see a big improvement with them for the amount of code (and possible runtime duplication) needed for a few edge cases.

@MihaZupan
Copy link
Member Author

I've added a temporary workaround for the lack of MONO APIs while waiting on #78015: 1e7684f.

@MihaZupan MihaZupan removed the blocked Issue/PR is blocked on something - see comments label Nov 20, 2022
@gfoidl
Copy link
Member

gfoidl commented Nov 21, 2022

Had another look, no comments added, so LGTM.
It's a nice layering of the methods.

}

return new IndexOfAnyCharValuesProbabilistic(values);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a little concerned about the size impact this method is going to have. Any use of IndexOfAnyValues.Create is going to root all possible implementations (IndexOfEmptyValues, IndexOfAny1Value, IndexOfAny2Values, IndexOfAny3Values, IndexOfAny4Values, IndexOfAny5Values, IndexOfAnyLatin1CharValues, IndexOfAnyCharValuesProbabilistic, and IndexOfAnyValuesInRange), regardless of which is actually used. Basically we're adding a choke point.

I don't have a better answer, unless we want to limit what this API produces, such that we don't special-case APIs that already have a direct public entrypoint (IndexOf0/1/2/3, IndexOfRange).

Let's go with it for now, but keep an eye on it. It'd also be helpful in this regard for us to either in this PR or immediately after follow it up with using these APIs everywhere they're applicable and see what kind of impact it actually has on our size benchmarks, and then what we can do about it. For example, maybe there are ways to share most of the code associated with the vectorization of the individual algorithms, such that each doesn't bring in nearly as much as it does today (Adam and I had spoken about an approach where we'd parameterize the algorithms with a generic struct that provided the setup and comparisons as methods that the driver could then call appropriately as part of loops, unrolled call sites, etc.) For use within corelib, we might also want to make some of these helpers internal and allow those uses to bypass the public Create in order to directly instantiate the needed type. If things still end up being bad, we could consider replacing the general Create with more specialized ones focused on known characteristics of the data. Worst case, we could also employ some additional linker switches to remove various code paths if we want to trade off optimal speed for size.

All that said, I still really like the simplicity of the design we currently have, and that it affords us the ability to pick the best implementation given the supplied data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a little concerned about the size impact this method is going to have

This PR adds 50kB to System.Private.CoreLib.dll with R2R. It is not end of the world given how much we add to the product in each release.

we'd parameterize the algorithms with a generic struct

Note that this only helps with IL size. It does not help with native code size. Also, the generic structs cost some startup time and memory at runtime.

We may want to look into whether the factory can be interpreted at compile time by the native aot compiler. cc @MichalStrehovsky

Copy link
Member Author

@MihaZupan MihaZupan Nov 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with it for now, but keep an eye on it. It'd also be helpful in this regard for us to either in this PR or immediately after follow it up with using these APIs everywhere they're applicable and see what kind of impact it actually has on our size benchmarks

👍 already working on that.
FWIW, in quite a few places so far I've been able to delete substantial chunks of code with this API, so it'd be pretty cool if this was actually a net size improvement in the long run.

Over a bunch of places across runtime where I've used this API locally, I've never used it for the "(IndexOf0/1/2/3, IndexOfRange)" set you called out. Likewise, for cases with 4/5 values, I've never used it if I knew that it would just defer to the existing 4/5 value implementation that could be reached via IndexOfAny(const).
Ignoring cases where the underlying hardware doesn't support these algorithms, we're really just using Ascii in the vast majority of cases from within runtime. Regex can also make similar decisions and avoid using this API for cases that wouldn't benefit.

That said, it's still nice that this API gives you an optimal implementation even if you aren't as concerned about startup/size costs and just want to use the same API for everything without requiring you to be aware of internal implementation details.

Adam and I had spoken about an approach where we'd parameterize the algorithms with a generic struct that provided the setup and comparisons as methods that the driver could then call appropriately as part of loops, unrolled call sites, etc.

I'm curious about what this would look like.

Copy link
Member

@stephentoub stephentoub Nov 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds 50kB to System.Private.CoreLib.dll with R2R. It is not end of the world given how much we add to the product in each release.

Yeah, my concern isn't the all-up size of corelib, rather the impact on a small trimmed app. But your lack of concern lowers my concern :)

We may want to look into whether the factory can be interpreted at compile time by the native aot compiler

👍

Copy link
Member

@stephentoub stephentoub Nov 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it'd be pretty cool if this was actually a net size improvement.

It would be.

I've never used it for the "(IndexOf0/1/2/3, IndexOfRange)" set you called out

Right. Which begs the question of whether it's actually worth including those, since whether we use them or not, at present all that code is going to be kept with the current trimming.

Regex can also make similar decisions and avoid using this API for cases that wouldn't benefit.

Yes, my intention is that the source generator and compiler only use this for cases where there isn't currently a better direct API for it. In large part that's to help with readability, but it'll also help reduce startup overheads.

I'm curious about what this would look like.

The rough strawman, which we never actually tried, was something along the lines of (very rough pseudo code)

internal interface IIndexOfComparer<T>
{
    bool IsMatch(T item);
    bool IsMatch(Vector128<T> items);
    bool IsMatch(Vector256<T> items);
}

then with a shared driver like:

internal int IndexOfCore<T, TCore>(ReadOnlySpan<T> span, TComparer comparer) where TComparer : IIndexOfComparer<T>
{
    if (!Vector128.IsHardwareAccelerated || span.Length < Vector128<T>.Count)
    {
        // TODO: Unroll for short spans
        for (int i = 0; i < span.Length; i++)
            if (comparer.IsMatch(span[i]))
                return i;
    }
    else if (!Vector256.IsHardwareAccelerated || span.Length < Vector256<T>.Count)
    {
        ...
        if (comparer.IsMatch(currentVector))
            return FindIndex(currentVector);
        ...
    }
    else
    {
        ... // same for Vector256
    }
    return -1;
}

and then use like:

public static bool IndexOfAny(ReadOnlySpan<char> span, char value0, char value1) =>
    IndexOfCore(span, new IndexOfAny2(value0, value1));

private readonly struct IndexOfAny2<T> : IIndexOfComparer<T>
{
    private T _value1, value2;
    private readonly Vector128<T> _vector128_1, _vector128_2;
    private readonly Vector256<T> _vector256_1, _vector256_2;

    public IndexOfAny2(T value1, T value2)
    {
        _value1 = value1;
        _value2 = value2;
        _vector128_1 = Vector128.Create(value1);
        _vector128_2 = Vector128.Create(value2);
        ...
    }

    public bool IsMatch(T value) => value == _value1 || value == _value2;
    public bool IsMatch(Vector128 values) => values == _vector128_1 || values == _vector128_2;
    ...
}

etc. It'd end up looking similar in structure to what you currently have in this PR, actually, just at a lower level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to look into whether the factory can be interpreted at compile time by the native aot compiler. cc @MichalStrehovsky

You mean to interpret the Create method this comment is on ahead of time depending on the callsite?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexOfAnyValues.Create is expected to be often used in static constructors with constant input. For example, like this: https://github.com/dotnet/runtime/pull/78666/files#diff-2599fdb4dd17bc235b019eb03aed3a26260765050d1e48419bc3e44319ecb147R31-R32

@MihaZupan
Copy link
Member Author

Failures are #78584, #69101

@MihaZupan MihaZupan merged commit 8171bd0 into dotnet:main Nov 21, 2022
@stephentoub
Copy link
Member

Awesome.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Memory blog-candidate Completed PRs that are candidate topics for blog post coverage new-api-needs-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vectorize IndexOfAny on more than 5 chars
7 participants