-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize IndexOfAny on more than 5 chars #68328
Comments
Tagging subscribers to this area: @dotnet/area-system-memory Issue DetailsEarlier in the .NET 7 release, we combined the best of string.IndexOfAny and MemoryExtensions.IndexOfAny to provide a single implementation used by both. As part of this, whereas <= 5 characters are vectorized, > 5 characters use a "probabilistic map": runtime/src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.cs Lines 1140 to 1167 in d58efd1
essentially a Bloom filter to quickly screen out characters that are definitely not in the set so that the more expensive check need only be done if it's likely the character is in the set. While faster than comparing every input character against every character in the set, we still have to walk the input character by character. @MihaZupan's xoofx/markdig@da756f4 is an implementation of http://0x80.pl/articles/simd-byte-lookup.html#universal-algorithm, and highlights that it's possible to vectorize this operation instead. It would require startup overhead of determining whether all of the set characters are ASCII and generating a bitmap from them, but we're already doing such a loop in order to build the probabilistic map: runtime/src/libraries/System.Private.CoreLib/src/System/ProbabilisticMap.cs Lines 31 to 67 in d58efd1
We could simply augment that to also build up the ASCII bitmap, and if after initialization determine that every character was indeed ASCII, take the vectorized path of using that ASCII bitmap, otherwise fall back to using the probabilistic map as we do today.
|
Instead of or in addition to this, we might also consider a new overload: IndexOfAny(ReadOnlySpan<char>, UInt128) (or potentially a |
I’d like to take a look, but if someone can do this sooner, just let me know. It looks like the algorithm cannot be immediately transliterated into Vector128 because (at least) it relies on Ssse3.Shuffle and the generic shuffle on Vector128 doesn’t have the same behavior. Therefore we would rely on software fallback for Arm64 and potentially improve later. |
@stephentoub Is the constant In other words, in any case, if the length is > 5, is it guaranteed that the performance will be better than the probablistic map? I was thinking in general about these constants that we have everywhere. Especially that we somehow evaluate them dynamically in the startup phase of the JIT, etc., because certainly in many cases parameters like microarchitecture or the use of NativeAOT or ReadyToRun make a crucial difference (i.e. a (lightweight): Runtime Benchmarking). |
It's not just a constant. If you look at the implementation, you'll see there's work that has to be done specific to each additional character, each one getting its own vector that needs to be factored in. It's not like there's some number to just change and magically higher counts of chars are vectorized. |
The only problem I see with such an API is that the pre-computed bitmap used by the vectorized algorithm isn't the ASCII bitmap itself (described a bit under the universal algorithm description).
The 0th (highest) bit is assumed to be 0 (ASCII), so it can overlap with the 7th bit. I don't know if it's possible to quickly turn an input bitmap representing the set of characters into this modified one. As Dan pointed out, the above approach depends on the behavior of Ssse3's Shuffle. I'm not familiar with ARM instructions, but if there's no "4 bit => 4 bit lookup" equivalent, the optimal bitmap may be different. |
There could be a separate method that computes and returns the input to be used. |
I am moving it to 8 as I don't see us implementing it before we snap. Please feel free to prove me wrong by sending a PR ;) |
Re: dotnet/aspnetcore#33776 (comment) Prelude and motivating exampleSome context for the problem domain in ASP.NET Core: Source generator for IndexOfAny{Except}https://github.com/gfoidl-Tests/SourceGenerator-Vectors is quickly written prototype for a Roslyn source generator* to create vectorized code for * I'm sure some pieces may be written in a better way, as I don't have too much experience with writing generators. The aforementioned validation of e.g. Authority in ASP.NET Core could be written as private const string AlphaNumeric = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
private const string AuthoritySpecific = ":.-[]@";
private const string ValidAuthorityChars = AlphaNumeric + AuthoritySpecific;
// ...
public bool IsValidByGeneratedIndexOfAnyExcept() => IsAuthorityValid(this.Authority) < 0;
[GeneratedIndexOfAny(ValidAuthorityChars, FindAnyExcept = true)]
private static partial int IsAuthorityValid(ReadOnlySpan<char> host); so basically as two-liner. In a benchmark compared with the shipped
benchmark codeusing BenchmarkDotNet.Attributes;
Bench bench = new();
Console.WriteLine(bench.Authority);
Console.WriteLine(bench.IsValidByIndexOfAnyExcept());
Console.WriteLine(bench.IsValidByGeneratedIndexOfAnyExcept());
#if !DEBUG
BenchmarkDotNet.Running.BenchmarkRunner.Run<Bench>();
#endif
public partial class Bench
{
private const string AlphaNumeric = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
private const string AuthoritySpecific = ":.-[]@";
private const string ValidAuthorityChars = AlphaNumeric + AuthoritySpecific;
[Params("hostname:8080", "www.thelongestdomainnameintheworldandthensomeandthensomemoreandmore.com")]
public string Authority { get; set; } = "hostname:8080";
[Benchmark(Baseline = true)]
public bool IsValidByIndexOfAnyExcept() => this.Authority.AsSpan().IndexOfAnyExcept(ValidAuthorityChars) < 0;
[Benchmark]
public bool IsValidByGeneratedIndexOfAnyExcept() => IsAuthorityValid(this.Authority) < 0;
[GeneratedIndexOfAny(ValidAuthorityChars, FindAnyExcept = true)]
private static partial int IsAuthorityValid(ReadOnlySpan<char> host);
} For code like internal static partial class Demo1
{
[GeneratedIndexOfAny("1234567890")]
public static partial int FirstIndexOfNumber(ReadOnlySpan<char> value);
[GeneratedIndexOfAny("drvxyz")]
public static partial int FirstIndexOfSet0(ReadOnlySpan<char> value);
[GeneratedIndexOfAny("12🌄34")]
public static partial int FirstIndexOfNonAsciiSet(ReadOnlySpan<char> value);
}
namespace Demo.Internal
{
internal static partial class Demo2
{
[GeneratedIndexOfAny("abcd", FindAnyExcept = true)]
public static partial int FirstIndexOfNotSet0(ReadOnlySpan<char> value);
[GeneratedIndexOfAny("abcdef", FindAnyExcept = true)]
public static partial int FirstIndexOfNotSet1(ReadOnlySpan<char> value);
}
} the generator emits // ...
partial class Demo1
{
[GeneratedCode("Generator", "1.0.0.0")]
[EditorBrowsable(EditorBrowsableState.Never)]
public static partial int FirstIndexOfNumber(System.ReadOnlySpan<char> value)
{
ReadOnlySpan<bool> lookup = new bool[128] { false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, true, true, true, true, true, true, true, true, true, true, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false };
Vector128<sbyte> mask = Vector128.Create(0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00).AsSByte();
return Vector128.IsHardwareAccelerated && value.Length >= Vector128<short>.Count
? Core.IndexOfMatchCharVectorized<Core.Negate>(value, mask)
: Core.IndexOfMatchCharScalar<Core.DontNegate>(value, lookup);
}
[GeneratedCode("Generator", "1.0.0.0")]
[EditorBrowsable(EditorBrowsableState.Never)]
public static partial int FirstIndexOfSet0(System.ReadOnlySpan<char> value)
{
ReadOnlySpan<bool> lookup = new bool[128] { false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, true, false, false, false, false, false, false, false, false, false, false, false, false, false, true, false, false, false, true, false, true, true, true, false, false, false, false, false };
Vector128<sbyte> mask = Vector128.Create(0x00, 0x00, 0x80, 0x00, 0x40, 0x00, 0x80, 0x00, 0x80, 0x80, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00).AsSByte();
return Vector128.IsHardwareAccelerated && value.Length >= Vector128<short>.Count
? Core.IndexOfMatchCharVectorized<Core.Negate>(value, mask)
: Core.IndexOfMatchCharScalar<Core.DontNegate>(value, lookup);
}
[GeneratedCode("Generator", "1.0.0.0")]
[EditorBrowsable(EditorBrowsableState.Never)]
public static partial int FirstIndexOfNonAsciiSet(System.ReadOnlySpan<char> value)
{
// Given SetChars consist not of all ASCII, can't handle vectorized, so use fallback
return value.IndexOfAny("12🌄34");
}
}
namespace Demo.Internal
{
partial class Demo2
{
[GeneratedCode("Generator", "1.0.0.0")]
[EditorBrowsable(EditorBrowsableState.Never)]
public static partial int FirstIndexOfNotSet0(System.ReadOnlySpan<char> value)
{
// Given SetChars <= 5, so use specialized method from IndexOfAny{Except}
return value.IndexOfAnyExcept("abcd");
}
[GeneratedCode("Generator", "1.0.0.0")]
[EditorBrowsable(EditorBrowsableState.Never)]
public static partial int FirstIndexOfNotSet1(System.ReadOnlySpan<char> value)
{
ReadOnlySpan<bool> lookup = new bool[128] { false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, true, true, true, true, true, true, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false };
Vector128<sbyte> mask = Vector128.Create(0x00, 0x40, 0x40, 0x40, 0x40, 0x40, 0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00).AsSByte();
return Vector128.IsHardwareAccelerated && value.Length >= Vector128<short>.Count
? Core.IndexOfMatchCharVectorized<Core.DontNegate>(value, mask)
: Core.IndexOfMatchCharScalar<Core.Negate>(value, lookup);
}
}
}
// ... + helper-type with the actual workhorse-methods emitted helper-method [MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int IndexOfMatchCharScalar<TNegator>(ReadOnlySpan<char> value, ReadOnlySpan<bool> lookup)
where TNegator : struct, INegator
{
for (int i = 0; i < value.Length; ++i)
{
char c = value[i];
bool match = TNegator.NegateIfNeeded(lookup[c]);
if (match)
{
return i;
}
}
return -1;
}
public static int IndexOfMatchCharVectorized<TNegator>(ReadOnlySpan<char> value, Vector128<sbyte> bitMaskLookup)
where TNegator : struct, INegator
{
Debug.Assert(Vector128.IsHardwareAccelerated);
Debug.Assert(value.Length >= Vector128<short>.Count);
// To check if a bit in a bitmask from the Bitmask is set, in a sequential code
// we would do ((1 << bitIndex) & bitmask) != 0
// As there is no hardware instrinic for such a shift, we use a lookup that
// stores the shifted bitpositions.
// So (1 << bitIndex) becomes BitPosLook[bitIndex], which is simd-friendly.
//
// A bitmask from the bitMaskLookup is created only for values 0..7 (one byte),
// so to avoid a explicit check for values outside 0..7, i.e.
// high nibbles 8..F, we use a bitpos that always results in escaping.
Vector128<sbyte> bitPosLookup = Vector128.Create(
0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, // high-nibble 0..7
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF // high-nibble 8..F
).AsSByte();
Vector128<sbyte> nibbleMaskSByte = Vector128.Create((sbyte)0xF);
Vector128<sbyte> zeroMaskSByte = Vector128<sbyte>.Zero;
nuint idx = 0;
uint mask;
ref short ptr = ref Unsafe.As<char, short>(ref MemoryMarshal.GetReference(value));
if (value.Length >= 2 * Vector128<short>.Count)
{
nuint end = (uint)(value.Length - 2 * Vector128<short>.Count);
do
{
Vector128<short> source0 = Vector128.LoadUnsafe(ref ptr, idx);
Vector128<short> source1 = Vector128.LoadUnsafe(ref ptr, idx + 8);
Vector128<sbyte> values = NarrowWithSaturation(source0, source1);
mask = CreateEscapingMask<TNegator>(values, bitMaskLookup, bitPosLookup, nibbleMaskSByte, zeroMaskSByte);
if (mask != 0)
{
goto Found;
}
idx += 2 * (uint)Vector128<short>.Count;
}
while (idx <= end);
}
// Here we know that 8 to 15 chars are remaining. Process the first 8 chars.
if (idx <= (uint)(value.Length - Vector128<short>.Count))
{
Vector128<short> source = Vector128.LoadUnsafe(ref ptr, idx);
Vector128<sbyte> values = NarrowWithSaturation(source, source);
mask = CreateEscapingMask<TNegator>(values, bitMaskLookup, bitPosLookup, nibbleMaskSByte, zeroMaskSByte);
if (mask != 0)
{
goto Found;
}
idx += (uint)Vector128<short>.Count;
}
// Here we know that < 8 chars are remaining. We shift the space around to process
// another full vector.
nuint remaining = (uint)value.Length - idx;
if ((nint)remaining > 0)
{
remaining -= (uint)Vector128<short>.Count;
Vector128<short> source = Vector128.LoadUnsafe(ref ptr, idx + remaining);
Vector128<sbyte> values = NarrowWithSaturation(source, source);
mask = CreateEscapingMask<TNegator>(values, bitMaskLookup, bitPosLookup, nibbleMaskSByte, zeroMaskSByte);
if (mask != 0)
{
idx += remaining;
goto Found;
}
}
goto NotFound;
Found:
idx += GetIndexOfFirstNeedToEscape(mask);
return (int)idx;
NotFound:
return -1;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint GetIndexOfFirstNeedToEscape(uint mask)
{
// Found at least one byte that needs to be escaped, figure out the index of
// the first one found that needs to be escaped within the 16 bytes.
Debug.Assert(mask > 0 && mask <= 65_535);
uint tzc = uint.TrailingZeroCount(mask);
Debug.Assert(tzc < 16);
return tzc;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint CreateEscapingMask<TNegator>(
Vector128<sbyte> values,
Vector128<sbyte> bitMaskLookup,
Vector128<sbyte> bitPosLookup,
Vector128<sbyte> nibbleMaskSByte,
Vector128<sbyte> nullMaskSByte)
where TNegator : struct, INegator
{
// To check if an input byte matches or not, we use a bitmask-lookup.
// Therefore we split the input byte into the low- and high-nibble, which will get
// the row-/column-index in the bit-mask.
// The bitmask-lookup looks like:
// high-nibble
// low-nibble 0 1 2 3 4 5 6 7 8 9 A B C D E F
// 0 1 1 0 0 0 0 1 0 1 1 1 1 1 1 1 1
// 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
// 2 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1
// 3 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
// 4 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
// 5 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
// 6 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1
// 7 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1
// 8 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
// 9 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
// A 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
// B 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1
// C 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1 1
// D 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
// E 1 1 0 1 0 0 0 0 1 1 1 1 1 1 1 1
// F 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1
//
// where 1 denotes a matach, while 0 means no match.
// For high-nibbles in the range 8..F every input needs to be escaped, so we
// can omit them in the bit-mask, thus only high-nibbles in the range 0..7 need
// to be considered, hence the entries in the bit-mask can be of type byte.
//
// In the bitmask-lookup for each row (= low-nibble) a bit-mask for the
// high-nibbles (= columns) is created.
Debug.Assert(Vector128.IsHardwareAccelerated);
// Perf: the shift needs to be done as Int32, as there exists a hw-instruction and no sw-emulation needs to be done.
// Cf. https://github.com/dotnet/runtime/issues/75770
// Due to the Int32-shift we need to mask out any remaining bits from the shifted nibble.
Vector128<sbyte> highNibbles = Vector128.ShiftRightLogical(values.AsInt32(), 4).AsSByte() & nibbleMaskSByte;
Vector128<sbyte> lowNibbles = values & nibbleMaskSByte;
Vector128<sbyte> bitMask = Shuffle(bitMaskLookup, lowNibbles);
Vector128<sbyte> bitPositions = Shuffle(bitPosLookup , highNibbles);
Vector128<sbyte> mask = bitPositions & bitMask;
Vector128<sbyte> comparison = Vector128.Equals(nullMaskSByte, mask);
comparison = TNegator.NegateIfNeeded(comparison);
return comparison.ExtractMostSignificantBits();
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector128<sbyte> NarrowWithSaturation(Vector128<short> v0, Vector128<short> v1)
{
Debug.Assert(Vector128.IsHardwareAccelerated);
// TODO: https://github.com/dotnet/runtime/issues/75724
if (Sse2.IsSupported)
{
return Sse2.PackSignedSaturate(v0, v1);
}
else
{
// This is not the exact algorithm for saturation, but for the use-case
// here it's does what it should do. I.e. eliminate non-ASCII chars in the
// results.
Vector128<short> v0HighNibbles = Vector128.ShiftRightLogical(v0, 8);
Vector128<short> v1HighNibbles = Vector128.ShiftRightLogical(v1, 8);
Vector128<short> ascii0 = Vector128.Equals(Vector128<short>.Zero, v0HighNibbles);
Vector128<short> ascii1 = Vector128.Equals(Vector128<short>.Zero, v1HighNibbles);
v0 &= ascii0;
v1 &= ascii1;
return Vector128.Narrow(v0, v1);
}
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector128<sbyte> Shuffle(Vector128<sbyte> vector, Vector128<sbyte> indices)
{
// Perf: Ssse3.Shuffle produces better code
return Sse3.IsSupported
? Ssse3.Shuffle(vector, indices)
: Vector128.Shuffle(vector, indices);
}
public interface INegator
{
static abstract bool NegateIfNeeded(bool equals);
static abstract Vector128<sbyte> NegateIfNeeded(Vector128<sbyte> equals);
}
public struct DontNegate : INegator
{
public static bool NegateIfNeeded(bool equals) => equals;
public static Vector128<sbyte> NegateIfNeeded(Vector128<sbyte> equals) => equals;
}
public struct Negate : INegator
{
public static bool NegateIfNeeded(bool equals) => !equals;
public static Vector128<sbyte> NegateIfNeeded(Vector128<sbyte> equals) => ~equals;
} The behavior is as follows:
Right now only Side note: for ASP.NET Core's validation there is an "extended" case which allows chars > 0x7F (non-ASCII) too. I don't know how hot this method is, and if it is worth to vectorize it, as it's only executed if opted into non-ASCII, which I believe is seldom (maybe wrong here). Reflection of the generator approachPreliminary word: due my work for dotnet/aspnetcore#44041 I'm a bit biased towards the generator at the moment, so it may not be the most objective reflection. BenefitsNo startup and runtime overheadThe generator spits out code that is ready to use. All the validation, analysis is already done.
Hm, here I see similar problems as with "Discoverability" below. At least it seems a bit unwieldy to me, especially if the result of that init-method is a data-structure which needs to be kept around (and alive). PerformanceThe current state of the generator emits As the generator has enough time for analysis, more sophisticated searches could be applied for special cases or contiguous sets, which would add more overhead to a runtime-based version. Though one with init-method could amortized that. DownsidesDiscoverabilityNot so intuitive and discoverable as just typing Use in Regex generatorAs alluded in previous comments the regex generator would be a principal beneficiary of faster IndexOfAnyXyz-variants. In a quick test (.NET 7 daily build from today, VS 17.3.4) it didn't work that one generator emits code that needs another generator (which I hoped it worked). If this is a show-stopper then the generator-based approach has to be abandoned. Only for contant setsI don't know if this is a real downside, as most if not all (?) sets to check are known at development-time. More maintenance workUpdating the implementation in Could be provided w/o generator tooAs cited above a init-method could be used to create a data-structe that contains the relevant pieces and that is handed to an inbox-implementation. SummaryThe generator maybe like taking a sledgehammer to crack a nut, but as it has more time to analyse the given "pattern" it can emit more ideal code. So (at the moment) a generator-based approach is what I'm leaning towards. Major pain-point is on how to align with the use in the regex-generator. Notes
PS: Now I'm pretty sure to have missed a valid point, as I didn't take notes while developing the generator...mea culpa. |
Thanks for all your thinking / work on this!
Yes, generators currently can't depend on other generators. I don't know if / when / how that will change, but it's definitely a known limitation.
If we were going to utilize a generator as part of a solution here, this is the path I'd want to take: a usable in-box implementation, and then a generator that makes it easier to produce the right inputs and call that helper but that doesn't require a generator to do so.
It is. For example, RegexOptions.Compiled should be able to use this as well.
A huge downside for me is I'd really, really, really like to avoid spitting use of unsafe code / VectorXx into user assemblies. It's a large amount of code, likely processing untrusted input, and I think we should strongly err on the side of having our generators avoid such output whenever possible. In some situations it might be unavoidable, but I don't believe this is one of those cases. |
Yep, thanks for your thoughts on this. So I'll try to create an inbox-solution (after a little break 😉). |
Idea for pure inbox-solution with an init-method: API(please don't mind the naming, my brain isn't creative enough for this today) public static class MemoryExtensions
{
// ...
public static IndexOfAnyInitData IndexOfAnyInitialize(ReadOnlySpan<char> values);
public readonly struct IndexOfAnyInitData
{
public bool IsAllAscii { get; }
public UInt128 Lookup { get; }
public Vector128<sbyte> Mask { get; }
public IndexOfAnyInitData(bool isAllAscii, UInt128 lookup, Vector128<sbyte> mask);
}
} Usage exampleprivate const string AlphaNumeric = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
private const string AuthoritySpecific = ":.-[]@";
private const string ValidAuthorityChars = AlphaNumeric + AuthoritySpecific;
private static readonly MemoryExtension.IndexOfAnyInitData s_authorityInitData = MemoryExtension.IndexOfAnyInitialize(ValidAuthorityChars);
public bool IsAuthoryValid(ReadOnlySpan<char> authoritiy) => authority.IndexOfAnyExcept(ValidAuthorityChars, s_authorityInitData) < 0; Perf-number are the same as above (it's the same worker-code). ImplementationWorkhorse-code is the same above emitted from the generator. Here for reference / completeness. using System.Diagnostics;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
namespace StrippedCoreLib;
public static class MemoryExtensions
{
public readonly struct IndexOfAnyInitData
{
public bool IsAllAscii { get; }
public UInt128 Lookup { get; }
public Vector128<sbyte> Mask { get; }
public IndexOfAnyInitData(bool isAllAscii, UInt128 lookup, Vector128<sbyte> mask)
{
this.IsAllAscii = isAllAscii;
this.Lookup = lookup;
this.Mask = mask;
}
}
public static IndexOfAnyInitData IndexOfAnyInitialize(ReadOnlySpan<char> values)
{
if (values.Length <= 5)
{
return default;
}
foreach (char c in values)
{
if (!char.IsAscii(c))
{
return default;
}
}
UInt128 lookup = 0;
sbyte[] mask = new sbyte[128 / 8];
foreach (char c in values)
{
SetBitInMask(ref lookup, mask, c);
}
Vector128<sbyte> vecMask = Vector128.LoadUnsafe(ref MemoryMarshal.GetArrayDataReference(mask));
return new IndexOfAnyInitData(true, lookup, vecMask);
static void SetBitInMask(ref UInt128 lookup, sbyte[] mask, int c)
{
Debug.Assert(c < 128);
lookup |= (1UL << c);
int highNibble = c >> 4;
int lowNibble = c & 0xF;
mask[lowNibble] |= (sbyte)(1 << highNibble);
}
}
public static int IndexOfAny(this ReadOnlySpan<char> span, ReadOnlySpan<char> values, IndexOfAnyInitData initData)
{
if (values.Length <= 5 || !initData.IsAllAscii)
{
return span.IndexOfAny(values);
}
return Vector128.IsHardwareAccelerated && span.Length >= Vector128<sbyte>.Count
? IndexOfMatchVectorized<Negate>(span, initData.Mask)
: IndexOfMatchScalar<DontNegate>(span, initData.Lookup);
}
public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, ReadOnlySpan<char> values, IndexOfAnyInitData initData)
{
if (values.Length <= 5 || !initData.IsAllAscii)
{
return span.IndexOfAnyExcept(values);
}
return Vector128.IsHardwareAccelerated && span.Length >= Vector128<sbyte>.Count
? IndexOfMatchVectorized<DontNegate>(span, initData.Mask)
: IndexOfMatchScalar<Negate>(span, initData.Lookup);
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static int IndexOfMatchScalar<TNegator>(ReadOnlySpan<char> value, UInt128 lookup)
where TNegator : struct, INegator
{
for (int i = 0; i < value.Length; ++i)
{
char c = value[i];
bool match = ((1UL << c) & lookup) != 0;
match = TNegator.NegateIfNeeded(match);
if (match)
{
return i;
}
}
return -1;
}
private static int IndexOfMatchVectorized<TNegator>(ReadOnlySpan<char> value, Vector128<sbyte> bitMaskLookup)
where TNegator : struct, INegator
{
Debug.Assert(Vector128.IsHardwareAccelerated);
Debug.Assert(value.Length >= Vector128<short>.Count);
// To check if a bit in a bitmask from the Bitmask is set, in a sequential code
// we would do ((1 << bitIndex) & bitmask) != 0
// As there is no hardware instrinic for such a shift, we use a lookup that
// stores the shifted bitpositions.
// So (1 << bitIndex) becomes BitPosLook[bitIndex], which is simd-friendly.
//
// A bitmask from the bitMaskLookup is created only for values 0..7 (one byte),
// so to avoid a explicit check for values outside 0..7, i.e.
// high nibbles 8..F, we use a bitpos that always results in escaping.
Vector128<sbyte> bitPosLookup = Vector128.Create(
0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, // high-nibble 0..7
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF // high-nibble 8..F
).AsSByte();
Vector128<sbyte> nibbleMaskSByte = Vector128.Create((sbyte)0xF);
Vector128<sbyte> zeroMaskSByte = Vector128<sbyte>.Zero;
nuint idx = 0;
uint mask;
ref short ptr = ref Unsafe.As<char, short>(ref MemoryMarshal.GetReference(value));
if (value.Length >= 2 * Vector128<short>.Count)
{
nuint end = (uint)(value.Length - 2 * Vector128<short>.Count);
do
{
Vector128<short> source0 = Vector128.LoadUnsafe(ref ptr, idx);
Vector128<short> source1 = Vector128.LoadUnsafe(ref ptr, idx + 8);
Vector128<sbyte> values = NarrowWithSaturation(source0, source1);
mask = CreateEscapingMask<TNegator>(values, bitMaskLookup, bitPosLookup, nibbleMaskSByte, zeroMaskSByte);
if (mask != 0)
{
goto Found;
}
idx += 2 * (uint)Vector128<short>.Count;
}
while (idx <= end);
}
// Here we know that 8 to 15 chars are remaining. Process the first 8 chars.
if (idx <= (uint)(value.Length - Vector128<short>.Count))
{
Vector128<short> source = Vector128.LoadUnsafe(ref ptr, idx);
Vector128<sbyte> values = NarrowWithSaturation(source, source);
mask = CreateEscapingMask<TNegator>(values, bitMaskLookup, bitPosLookup, nibbleMaskSByte, zeroMaskSByte);
if (mask != 0)
{
goto Found;
}
idx += (uint)Vector128<short>.Count;
}
// Here we know that < 8 chars are remaining. We shift the space around to process
// another full vector.
nuint remaining = (uint)value.Length - idx;
if ((nint)remaining > 0)
{
remaining -= (uint)Vector128<short>.Count;
Vector128<short> source = Vector128.LoadUnsafe(ref ptr, idx + remaining);
Vector128<sbyte> values = NarrowWithSaturation(source, source);
mask = CreateEscapingMask<TNegator>(values, bitMaskLookup, bitPosLookup, nibbleMaskSByte, zeroMaskSByte);
if (mask != 0)
{
idx += remaining;
goto Found;
}
}
goto NotFound;
Found:
idx += GetIndexOfFirstNeedToEscape(mask);
return (int)idx;
NotFound:
return -1;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint GetIndexOfFirstNeedToEscape(uint mask)
{
// Found at least one byte that needs to be escaped, figure out the index of
// the first one found that needs to be escaped within the 16 bytes.
Debug.Assert(mask > 0 && mask <= 65_535);
uint tzc = uint.TrailingZeroCount(mask);
Debug.Assert(tzc < 16);
return tzc;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint CreateEscapingMask<TNegator>(
Vector128<sbyte> values,
Vector128<sbyte> bitMaskLookup,
Vector128<sbyte> bitPosLookup,
Vector128<sbyte> nibbleMaskSByte,
Vector128<sbyte> nullMaskSByte)
where TNegator : struct, INegator
{
// To check if an input byte matches or not, we use a bitmask-lookup.
// Therefore we split the input byte into the low- and high-nibble, which will get
// the row-/column-index in the bit-mask.
// The bitmask-lookup looks like:
// high-nibble
// low-nibble 0 1 2 3 4 5 6 7 8 9 A B C D E F
// 0 1 1 0 0 0 0 1 0 1 1 1 1 1 1 1 1
// 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
// 2 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1
// 3 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
// 4 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
// 5 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
// 6 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1
// 7 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1
// 8 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
// 9 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
// A 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
// B 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1
// C 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1 1
// D 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
// E 1 1 0 1 0 0 0 0 1 1 1 1 1 1 1 1
// F 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1
//
// where 1 denotes a matach, while 0 means no match.
// For high-nibbles in the range 8..F every input needs to be escaped, so we
// can omit them in the bit-mask, thus only high-nibbles in the range 0..7 need
// to be considered, hence the entries in the bit-mask can be of type byte.
//
// In the bitmask-lookup for each row (= low-nibble) a bit-mask for the
// high-nibbles (= columns) is created.
Debug.Assert(Vector128.IsHardwareAccelerated);
// Perf: the shift needs to be done as Int32, as there exists a hw-instruction and no sw-emulation needs to be done.
// Cf. https://github.com/dotnet/runtime/issues/75770
// Due to the Int32-shift we need to mask out any remaining bits from the shifted nibble.
Vector128<sbyte> highNibbles = Vector128.ShiftRightLogical(values.AsInt32(), 4).AsSByte() & nibbleMaskSByte;
Vector128<sbyte> lowNibbles = values & nibbleMaskSByte;
Vector128<sbyte> bitMask = Shuffle(bitMaskLookup, lowNibbles);
Vector128<sbyte> bitPositions = Shuffle(bitPosLookup, highNibbles);
Vector128<sbyte> mask = bitPositions & bitMask;
Vector128<sbyte> comparison = Vector128.Equals(nullMaskSByte, mask);
comparison = TNegator.NegateIfNeeded(comparison);
return comparison.ExtractMostSignificantBits();
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector128<sbyte> NarrowWithSaturation(Vector128<short> v0, Vector128<short> v1)
{
Debug.Assert(Vector128.IsHardwareAccelerated);
// TODO: https://github.com/dotnet/runtime/issues/75724
if (Sse2.IsSupported)
{
return Sse2.PackSignedSaturate(v0, v1);
}
else
{
// This is not the exact algorithm for saturation, but for the use-case
// here it's does what it should do. I.e. eliminate non-ASCII chars in the
// results.
Vector128<short> v0HighNibbles = Vector128.ShiftRightLogical(v0, 8);
Vector128<short> v1HighNibbles = Vector128.ShiftRightLogical(v1, 8);
Vector128<short> ascii0 = Vector128.Equals(Vector128<short>.Zero, v0HighNibbles);
Vector128<short> ascii1 = Vector128.Equals(Vector128<short>.Zero, v1HighNibbles);
v0 &= ascii0;
v1 &= ascii1;
return Vector128.Narrow(v0, v1);
}
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector128<sbyte> Shuffle(Vector128<sbyte> vector, Vector128<sbyte> indices)
{
// Perf: Ssse3.Shuffle produces better code
return Sse3.IsSupported
? Ssse3 .Shuffle(vector, indices)
: Vector128.Shuffle(vector, indices);
}
public interface INegator
{
static abstract bool NegateIfNeeded(bool equals);
static abstract Vector128<sbyte> NegateIfNeeded(Vector128<sbyte> equals);
}
public struct DontNegate : INegator
{
public static bool NegateIfNeeded(bool equals) => equals;
public static Vector128<sbyte> NegateIfNeeded(Vector128<sbyte> equals) => equals;
}
public struct Negate : INegator
{
public static bool NegateIfNeeded(bool equals) => !equals;
public static Vector128<sbyte> NegateIfNeeded(Vector128<sbyte> equals) => ~equals;
}
} NotesIf an init-method is used to "prepare" IndexOfAny{Except} so all necessary data should be returned. Thus instead of plain
|
Thanks, @gfoidl!
Is that actually doing the "right thing" with UInt128? I'd have expected this to produce the ulong, which will overflow after 64 bits, and then the result of that is what'll be or'd in.
This feels like extra complication we could instead address just by naming, e.g. putting this on the new Ascii class, or having Ascii in the method name like IndexOfAnyAscii / IndexOfAnyExceptAscii. If you passed a non-ASCII value to the initialize method, it would throw. My expectation is the 99% use case for this method is a known set of input characters, and if you know any isn't ASCII, then don't use this method.
Why
Does this significantly help? If the only thing you had was a |
In the implementation (last comment) there is a difference between the scalar- and vectorized-path as perf-optimization. Scalar is basically bit-test The scalar-path could to the same nibble-splitting to get rid of the
I think to understand your motivation for this, but I see these drawbacks:
For this API I imagine that based on the shape of set-chars different algorithms could be applied. E.g. the set consists of a few numbers, or control-characters, etc. There may be more specific ways to handle that than the "generic" algorithm based on nibble-splitting and the lookup. Thus for this iteration of my thinking about this the API should look like: public static class MemoryExtensions
{
public abstract class IndexOfAnyInitData { }
public static IndexOfAnyInitData IndexOfAnyInitialize(ReadOnlySpan<char> values);
} So we gain the freedom to return for Downside is that An implementation could look like then: private sealed class BitMaskIndexOfAnyInitData : IndexOfAnyInitData
{
public UInt128 Lookup { get; init; }
public Vector128<sbyte> Mask { get; init; }
}
public static IndexOfAnyInitData IndexOfAnyInitialize(ReadOnlySpan<char> values)
{
foreach (char c in values)
{
if (!char.IsAscii(c))
{
throw new InvalidOperationException("Non ASCII here");
}
}
// Analyze the values here and determine the best algorithm to use.
// Return the init-data specific to that algorithm.
// E.g.
return new BitMaskIndexOfAnyInitData { Lookup = lookup, Mask = vecMask };
}
public static int IndexOfAny(this ReadOnlySpan<char> span, IndexOfAnyInitData initData)
{
// Switch on the initData to select the proper algorithm
if (initData is BitMaskIndexOfAnyInitData bitMaskIndexOfAnyInitData)
{
return Vector128.IsHardwareAccelerated && span.Length >= Vector128<sbyte>.Count
? BitMaskIndexOfMatchVectorized<Negate>(span, bitMaskIndexOfAnyInitData.Mask)
: BitMaskIndexOfMatchScalar<DontNegate>(span, bitMaskIndexOfAnyInitData.Lookup);
}
throw new NotSupportedException("At the moment only the bitmask-appraoch is supported");
}
public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, IndexOfAnyInitData initData)
{
// Analogous to above method
}
The code only works for ASCII-sets, so
👍 Now I also think this method shouldn't fallback to the current Don't fallback for the case of <= 5 set-chars provides another choice for the consumer which API to use which can be based on custom benchmarks.
Oh, you are right. I though to have checked that (can't remember what I tested actually). UInt128 a = 0;
UInt128 b = UInt128.MaxValue;
UInt128 c = (1UL << 127);
UInt128 d = (UInt128.One << 127);
Print(a);
Print(b);
Print(c);
Print(d);
static void Print(UInt128 value) => Console.WriteLine($"0x{value:X32}");
|
Sure, but the UInt128 lookup = 0;
foreach (char c in "...") lookup |= (UInt128.One << c); I'm mainly just trying to understand / explore options at this point. Note that I care about overheads for small inputs and also vectorized larger inputs when the match is found earlier; I care a lot less about overheads when Vector128 isn't hardware accelerated.
It could be. Not that it's impossible to do, but it gets awkward. For example, I'd like to be able to use this in source generated regexes. If this were just relevant to finding the next place a regex could match, we'd have only one of them. But we use {Last}IndexOf{Any} in a bunch of places, e.g. if there's a set after a (lazy) loop, we'll use {Last}IndexOf{Any} to optimize the backtracking, so with this API we'd end up needing to store one of these per such construct. I also have a visceral negative reaction to a source generator not being able to compute everything it needs at compile time and instead defer work to startup time. (Even if it's so cheap it's unfounded, and even if solutions like NativeAOT's ability to execute static cctors at compile time would obviate even those concerns.)
Right. Compiled uses DynamicMethod: we can't add fields.
I'd prefer byte.
There are certainly benefits to keeping the data opaque such that the IndexOfAnyAscii and its associated initialization method can be evolved. I'd like to better understand the cost of converting the UInt128 simple lookup form into a |
Quick benchmark for any except of
public static UInt128 IndexOfAnyInitialize(ReadOnlySpan<char> values);
public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, UInt128 initData); so the vectorized version needs to convert the bit-array into the bit-mask needed at invocation.
public static Vector128<byte> IndexOfAnyInitialize(ReadOnlySpan<char> values);
public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, Vector128<byte> initData); This yields interesting results, which I did not expect to be that way (especially that nibble-splitting is so fast).
Explanation first set with authority Explanation second set with authority So concluding I'd go with following approach: public static class MemoryExtensions
{
public static Vector128<byte> IndexOfAnyInitialize(ReadOnlySpan<char> values);
public static int IndexOfAny(this ReadOnlySpan<char> span, Vector128<byte> initData);
public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, Vector128<byte> initData);
} Questions to this: Should the the scheme for populating that Should the type for init-data be
In case we see some common patterns of sets that could use a better algorithms, then another overload with an opaque init-data could be added. With this approach all our points should be satisfied? |
Thanks!
Can you share the benchmark code?
I think so. Whether or not we do, any changes to it are breaking, as the value can and will be persisted in binaries. This is why I was hoping the input could be a simple bitmap that the API would then transform as it needs. I'm surprised that transformation takes 250ns, that there's no way to reinterpret a UInt128 into a
I'd prefer UInt128. The code doesn't require vectorization (it'll function fine if Vector128.IsHardwareAccelerated is false and we take a fallback path), and |
You said in #69682 (comment):
Judging from it the Compiled mode should not use it but the interpreted mode could, right? |
Assuming there's a performance improvement to be had and it doesn't too negatively impact first-use, "it" could apply to all of them, where "it" is using the algorithm. Source generator would use a source generated implementation to avoid all startup costs and optimize throughput, Compiled could potentially use an IL emitted implementation in order to optimize throughpuit but could also use a built-in IndexOfAny if that made more sense, and the interpreter could use the built-in IndexOfAny. |
If we are not going to use this API as an implementation detail for source generated regexes, we may consider dropping this API and double-down on efficient source-generated regex support with good usability instead. It would reduce concept count and it would be similar to how some other language environments work. You just specify a regex and let the compiler worry about the details. |
There will be usages outside of regex, too. E.g. in ASP.NET Core (dotnet/aspnetcore#44041). |
I think there's some confusion here. The API currently proposed will be used by regex and many other places. What I was saying won't be used by source gen regex is the hypothetical additional method for aho corasick, mentioned here as being a possible future extension, that @teo-tsirpanis is looking at. |
What are all
Nit: This overload would be ambiguous with the existing method. Consider |
At least char and byte. If you think it's better to have explicit overloads for char and byte, we could do that, too. Though I'd hope the trimmer would be able to figure out (if not now, eventually) that a method And for char at least the cases where all inputs are ASCII (in which case it can use the vectorized implementation) or they aren't (in which case it can use the probabilistic map, or other such lookup tables). I'm hopeful in the future we'll discover additional ways to optimize as well, and that can all be done under the covers without additional APIs. We could also special-case inputs that would be satisfied by other public API, e.g. detect that non-ASCII inputs are all in a single range and then delegate to IndexOfAnyInRange, but for such cases I think we should start simple and just say callers should themselves use those APIs rather than having the API try to detect and "fix" it. |
It wouldn't be ambiguous from the compiler's perspective: if you used |
I keep thinking about this API primarily as a low-level regex helper that saves us from stamping the complex unsafe vectorized code in every source-generated regex. The API can be used directly too, but I expect only a small number of people are going to be successful in figuring when and how to correctly use it to their advantage. Having the surface limited and specialized to just what regex and parsers need would be fine with me (ie
How would the |
Why do you believe that's the case? It'll certainly be used by regex, but most places currently doing an IndexOfAny with more than 5 characters can benefit from this (and others that aren't using IndexOfAny because they could do something faster based on precomputing a bitmap). Some examples from a quick search:
I'm surprised by that. Plenty of folks figured out how to write: private static readonly char[] s_chars = new char[] { ... };
...
int index = something.IndexOfAny(s_chars); This follows the exact same form, just changing the line: private static readonly char[] s_chars = new char[] { ... }; to instead be: private static readonly IndexOfAnyValues<char> s_chars = IndexOfAnyValues.Create(new char[] { ... }); and as an overload of IndexOfAny, you're looking in IntelliSense through the overloads, you see one that takes an IndexOfAnyValues, you see it has a Create method, and you substitute an IndexOfAnyValues for your s_chars. We can even have an analyzer/fixer for it.
I think there's real value having these simply as overloads of the existing IndexOfAny: it's just one more way you can provide inputs to the existing functionality. I'd be fine replacing the generic |
I expect that most people will just use the existing
I expect only a fraction of read-world places to have outlined cache like this. I expect that majority are just going to allocate the array of chars to search for inline since that it is the most straightforward way to do it. Our examples show you to allocate the chars to search for inline too: https://learn.microsoft.com/en-us/dotnet/api/system.string.indexofany?view=net-7.0. Yes, there are certainly going to be cases where this API will be used directly, especially in our own libraries. I see it as an API for advanced performance-oriented developers only. |
Our examples show an entire program in a single snippet focused on highlighting what output you get for what input. They're not real-world.
I see a lot of it in searching around GitHub. But my searches may be skewed; I asked @marklio for help in getting some data. Regardless, whether it's a majority or minority, there are lots of places that will be able to benefit from it (as outlined in #68328 (comment)). We have multiple IndexOfAny overloads taking inputs in a variety of forms; this adds just one more, and I think it's the right way to expose this. I think trying to bury it elsewhere actually makes it more confusing, as it means there's now a set of IndexOfAny methods elsewhere that aren't obviously connected and thus force people to think about whether they need to consider the other one; as an overload, it's just another way of providing the same inputs. |
namespace System;
public static partial class MemoryExtensions
{
public static int IndexOfAny<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
public static int IndexOfAnyExcept<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
public static int LastIndexOfAny<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
public static int LastIndexOfAnyExcept<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
} namespace System.Buffers;
public static class IndexOfAnyValues
{
public static IndexOfAnyValues<byte> Create(ReadOnlySpan<byte> values);
public static IndexOfAnyValues<char> Create(ReadOnlySpan<char> values);
}
public class IndexOfAnyValues<T> where T : IEquatable<T>?
{
} |
This was an interesting journey. Thanks for all the beatiful discussion and thoughts on this! |
EDITED BY @stephentoub on 10/26/2022 to add revised API shape:
API Usage
EDITED BY @MihaZupan on 10/24/2022 to add proposed API shape:
API Proposal
API Usage
Alternative Designs
Static members on
Ascii
instead of extension methods inMemoryExtensions
.Based on feedback from API review on 2022-10-25.
Original post:
Earlier in the .NET 7 release, we combined the best of string.IndexOfAny and MemoryExtensions.IndexOfAny to provide a single implementation used by both. As part of this, whereas <= 5 characters are vectorized, > 5 characters use a "probabilistic map":
runtime/src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.cs
Lines 1140 to 1167 in d58efd1
essentially a Bloom filter to quickly screen out characters that are definitely not in the set so that the more expensive check need only be done if it's likely the character is in the set. While faster than comparing every input character against every character in the set, we still have to walk the input character by character.
@MihaZupan's xoofx/markdig@da756f4 is an implementation of http://0x80.pl/articles/simd-byte-lookup.html#universal-algorithm, and highlights that it's possible to vectorize this operation instead. It would require startup overhead of determining whether all of the set characters are ASCII and generating a bitmap from them, but we're already doing such a loop in order to build the probabilistic map:
runtime/src/libraries/System.Private.CoreLib/src/System/ProbabilisticMap.cs
Lines 31 to 67 in d58efd1
We could simply augment that to also build up the ASCII bitmap, and if after initialization determine that every character was indeed ASCII, take the vectorized path of using that ASCII bitmap, otherwise fall back to using the probabilistic map as we do today.
The text was updated successfully, but these errors were encountered: