Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster IndexOf for substrings #63285

Merged
merged 48 commits into from
Jan 25, 2022
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
34bf37a
Improve "lastChar == firstChar" case, also, use IndexOf directly if v…
EgorBo Jan 7, 2022
03ae4da
Merge branch 'main' of https://github.com/dotnet/runtime into vectori…
EgorBo Jan 7, 2022
5cfdb16
Try plain IndexOf first, to optimize cases where even first char of v…
EgorBo Jan 7, 2022
0638617
Merge branch 'main' of github.com:dotnet/runtime into vectorize-indexof
EgorBo Jan 7, 2022
e36fdc6
add 1-byte implementation
EgorBo Jan 7, 2022
85c2320
copyrights
EgorBo Jan 7, 2022
8918ab6
fix copy-paste mistake
EgorBo Jan 7, 2022
cb32d34
Initial LastIndexOf impl
EgorBo Jan 8, 2022
cda6b50
More efficient LastIndexOf
EgorBo Jan 8, 2022
8af9270
fix bug in Char version (we need two clear two lowest bits in the mas…
EgorBo Jan 8, 2022
87c26d0
use ResetLowestSetBit
EgorBo Jan 8, 2022
652b42d
Fix bug
EgorBo Jan 8, 2022
53cefad
Add two-byte LastIndexOf
EgorBo Jan 8, 2022
d465407
Fix build
EgorBo Jan 8, 2022
22921fd
Merge branch 'main' of https://github.com/dotnet/runtime into vectori…
EgorBo Jan 9, 2022
2c851bc
Minor optimizations
EgorBo Jan 9, 2022
9308d82
optimize cases with two-byte/two-char values
EgorBo Jan 9, 2022
dac974a
Remove gotos, fix build
EgorBo Jan 9, 2022
3554ad3
fix bug in LastIndexOf
EgorBo Jan 9, 2022
de87ec2
Make sure String.LastIndexOf is optimized
EgorBo Jan 9, 2022
cb7541f
Merge branch 'main' of https://github.com/dotnet/runtime into vectori…
EgorBo Jan 13, 2022
b0b04ad
Use xplat simd helpers - implicit ARM support
EgorBo Jan 13, 2022
141e236
fix arm
EgorBo Jan 14, 2022
e664ad3
Delete \
EgorBo Jan 15, 2022
bff8419
Use Vector128.IsHardwareAccelerated
EgorBo Jan 15, 2022
dcc9d81
Merge branch 'main' of https://github.com/dotnet/runtime into vectori…
EgorBo Jan 15, 2022
3def5e0
Fix build
EgorBo Jan 15, 2022
a52138b
Use IsAllZero
EgorBo Jan 15, 2022
f86e323
Address feedback
EgorBo Jan 15, 2022
f2372a0
Address feedback
EgorBo Jan 15, 2022
38ef9a9
micro-optimization, do-while is better here since mask is guaranteed …
EgorBo Jan 15, 2022
f5e6192
Merge branch 'main' of https://github.com/dotnet/runtime into vectori…
EgorBo Jan 17, 2022
4827ddc
Address feedabc
EgorBo Jan 18, 2022
d601351
Use clever trick I borrowed from IndexOfAny for trailing elements
EgorBo Jan 18, 2022
3a005bc
give up on +1 bump for SequenceCompare
EgorBo Jan 18, 2022
9fefe81
Clean up
EgorBo Jan 18, 2022
c118dd2
Clean up
EgorBo Jan 18, 2022
7e8b100
fix build
EgorBo Jan 18, 2022
d805701
Merge branch 'main' of github.com:dotnet/runtime into vectorize-indexof
EgorBo Jan 21, 2022
3bb56f0
Merge branch 'main' of https://github.com/dotnet/runtime into vectori…
EgorBo Jan 21, 2022
e9df891
Add debug asserts
EgorBo Jan 22, 2022
cb60535
Clean up: give up on the unrolled trick - too little value from code …
EgorBo Jan 22, 2022
7c55951
Add a test
EgorBo Jan 22, 2022
3f0b4c3
Fix build
EgorBo Jan 22, 2022
4d860a1
Merge branch 'main' of https://github.com/dotnet/runtime into vectori…
EgorBo Jan 24, 2022
48f4fc7
Add byte-specific test
EgorBo Jan 24, 2022
7c3b834
Fix build
EgorBo Jan 25, 2022
c68a07a
Update IndexOfSequence.byte.cs
EgorBo Jan 25, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions THIRD-PARTY-NOTICES.TXT
Original file line number Diff line number Diff line change
Expand Up @@ -697,6 +697,35 @@ License for fastmod (https://github.com/lemire/fastmod) and ibm-fpgen (https://g
See the License for the specific language governing permissions and
limitations under the License.

License for sse4-strstr (https://github.com/WojciechMula/sse4-strstr)
--------------------------------------

Copyright (c) 2008-2016, Wojciech Muła
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

License notice for The C++ REST SDK
-----------------------------------

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -580,12 +580,25 @@ ref Unsafe.As<T, char>(ref MemoryMarshal.GetReference(span)),
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int LastIndexOf<T>(this ReadOnlySpan<T> span, ReadOnlySpan<T> value) where T : IEquatable<T>
{
if (Unsafe.SizeOf<T>() == sizeof(byte) && RuntimeHelpers.IsBitwiseEquatable<T>())
return SpanHelpers.LastIndexOf(
ref Unsafe.As<T, byte>(ref MemoryMarshal.GetReference(span)),
span.Length,
ref Unsafe.As<T, byte>(ref MemoryMarshal.GetReference(value)),
value.Length);
if (RuntimeHelpers.IsBitwiseEquatable<T>())
{
if (Unsafe.SizeOf<T>() == sizeof(byte))
{
return SpanHelpers.LastIndexOf(
ref Unsafe.As<T, byte>(ref MemoryMarshal.GetReference(span)),
span.Length,
ref Unsafe.As<T, byte>(ref MemoryMarshal.GetReference(value)),
value.Length);
}
if (Unsafe.SizeOf<T>() == sizeof(char))
{
return SpanHelpers.LastIndexOf(
ref Unsafe.As<T, char>(ref MemoryMarshal.GetReference(span)),
span.Length,
ref Unsafe.As<T, char>(ref MemoryMarshal.GetReference(value)),
value.Length);
}
}

return SpanHelpers.LastIndexOf<T>(ref MemoryMarshal.GetReference(span), span.Length, ref MemoryMarshal.GetReference(value), value.Length);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -708,5 +708,25 @@ public static nuint RotateRight(nuint value, int offset)
return (nuint)RotateRight((uint)value, offset);
#endif
}

/// <summary>
/// Reset the lowest significant bit in the given value
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static uint ResetLowestSetBit(uint value)
{
// It's lowered to BLSR on x86
return value & (value - 1);
}

/// <summary>
/// Reset specific bit in the given value
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static uint ResetBit(uint value, int bitPos)
{
// TODO: Recognize BTR on x86 and LSL+BIC on ARM
return value & ~(uint)(1 << bitPos);
}
}
}
253 changes: 247 additions & 6 deletions src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,26 @@ public static int IndexOf(ref byte searchSpace, int searchSpaceLength, ref byte
if (valueLength == 0)
return 0; // A zero-length sequence is always treated as "found" at the start of the search space.

int valueTailLength = valueLength - 1;

if (valueTailLength == 0)
{
// for single-byte values use plain IndexOf
return IndexOf(ref searchSpace, value, searchSpaceLength);
}

byte valueHead = value;
ref byte valueTail = ref Unsafe.Add(ref value, 1);
int valueTailLength = valueLength - 1;
int offset = 0;
nuint valueTailNLength = (nuint)(uint)valueTailLength;

if (Vector128.IsHardwareAccelerated && searchSpaceLength - valueTailLength >= Vector128<byte>.Count)
{
goto SEARCH_TWO_BYTES;
}

int remainingSearchSpaceLength = searchSpaceLength - valueTailLength;

int offset = 0;
while (remainingSearchSpaceLength > 0)
{
// Do a quick search for the first element of "value".
Expand All @@ -42,13 +56,119 @@ public static int IndexOf(ref byte searchSpace, int searchSpaceLength, ref byte
break; // The unsearched portion is now shorter than the sequence we're looking for. So it can't be there.

// Found the first element of "value". See if the tail matches.
if (SequenceEqual(ref Unsafe.Add(ref searchSpace, offset + 1), ref valueTail, (nuint)valueTailLength)) // The (nuint)-cast is necessary to pick the correct overload
if (SequenceEqual(ref Unsafe.Add(ref searchSpace, offset + 1), ref valueTail, valueTailNLength)) // The (nuint)-cast is necessary to pick the correct overload
return offset; // The tail matched. Return a successful find.

remainingSearchSpaceLength--;
offset++;
}
return -1;

// Based on http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd "Algorithm 1: Generic SIMD" by Wojciech Muła
// Some details about the implementation can also be found in https://github.com/dotnet/runtime/pull/63285
SEARCH_TWO_BYTES:
if (Avx2.IsSupported && searchSpaceLength - valueTailLength >= Vector256<byte>.Count)
Copy link
Member

@stephentoub stephentoub Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Avx2.IsSupported rather than Vector256.IsHardwareAccelerated? If we need to use Avx2 here, that seems like a failure of the Vector types we should fix.

Copy link
Member Author

@EgorBo EgorBo Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub I personally am not a fan of IsHardwareAccelerated property at all but the same applies to Vector<>.IsHardwareAccelerated they both return true on an e.g. Core i7 Ivy Bridge with AVX1 but most likely use slow paths for many APIs with integers. In IndexOf I rely on AVX2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I maintain that if this can't be successful just using Vector*, something is wrong. A key point of these APIs is to not have to use or understand the low-level intrinsics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I stated in one of the other comments, I'm fine with us updating Vector256<T>.IsHardwareAccelerated to only return true on AVX2+ where both floating-point and integer operations can be accelerated using "single instructions".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a follow up issue opened then, to remove use of Avx2.IsSupported here? presumably after VectorNNN is fully employed in these files, there should be no using statements for System.Runtime.Intrinsics.Arm nor for System.Runtime.Intrinsics.X86, if I correctly understand the ambitions of the new Vector APIs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a follow up issue opened then, to remove use of Avx2.IsSupported here?

Yes, if it's not going to be addressed in this PR, we need to address it subsequently soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding, is this something you're able to follow up on?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #64309

{
// Find the last unique (which is not equal to ch1) byte
// the algorithm is fine if both are equal, just a little bit less efficient
byte ch2Val = Unsafe.Add(ref value, valueTailLength);
int ch1ch2Distance = valueTailLength;
while (ch2Val == value && ch1ch2Distance > 1)
ch2Val = Unsafe.Add(ref value, --ch1ch2Distance);

Vector256<byte> ch1 = Vector256.Create(value);
Vector256<byte> ch2 = Vector256.Create(ch2Val);

do
{
Vector256<byte> cmpCh1 = Vector256.Equals(ch1, Vector256.LoadUnsafe(ref searchSpace, (nuint)offset));
Vector256<byte> cmpCh2 = Vector256.Equals(ch2, Vector256.LoadUnsafe(ref searchSpace, (nuint)(offset + ch1ch2Distance)));
Vector256<byte> cmpAnd = (cmpCh1 & cmpCh2).AsByte();

// Early out: cmpAnd is all zeros
if (cmpAnd != Vector256<byte>.Zero)
{
uint mask = cmpAnd.ExtractMostSignificantBits();
do
{
int bitPos = BitOperations.TrailingZeroCount(mask);
if (valueTailNLength == 1 || // we already matched two bytes
SequenceEqual(
ref Unsafe.Add(ref searchSpace, offset + bitPos + 1),
ref valueTail,
valueTailNLength))
{
return offset + bitPos;
}
// Clear the lowest set bit
mask = BitOperations.ResetLowestSetBit(mask);
} while (mask != 0);
}

offset += Vector256<byte>.Count;

if (offset + valueTailLength == searchSpaceLength)
return -1;

// Overlap with the current chunk if there is not enough room for the next one
if (offset + valueTailLength + Vector256<byte>.Count > searchSpaceLength)
offset = searchSpaceLength - valueTailLength - Vector256<byte>.Count;

} while (true);
}

if (Vector128.IsHardwareAccelerated)
{
// Find the last unique (which is not equal to ch1) byte
// the algorithm is fine if both are equal, just a little bit less efficient
byte ch2Val = Unsafe.Add(ref value, valueTailLength);
int ch1ch2Distance = valueTailLength;
while (ch2Val == value && ch1ch2Distance > 1)
ch2Val = Unsafe.Add(ref value, --ch1ch2Distance);

Vector128<byte> ch1 = Vector128.Create(value);
Vector128<byte> ch2 = Vector128.Create(ch2Val);

do
{
Vector128<byte> cmpCh1 = Vector128.Equals(ch1, Vector128.LoadUnsafe(ref searchSpace, (nuint)offset));
Vector128<byte> cmpCh2 = Vector128.Equals(ch2, Vector128.LoadUnsafe(ref searchSpace, (nuint)(offset + ch1ch2Distance)));
Vector128<byte> cmpAnd = (cmpCh1 & cmpCh2).AsByte();

// Early out: cmpAnd is all zeros
// it's especially important for ARM where ExtractMostSignificantBits is not cheap
if (cmpAnd != Vector128<byte>.Zero)
{
uint mask = cmpAnd.ExtractMostSignificantBits();
do
{
int bitPos = BitOperations.TrailingZeroCount(mask);
if (valueTailNLength == 1 || // we already matched two bytes
SequenceEqual(
ref Unsafe.Add(ref searchSpace, offset + bitPos + 1),
ref valueTail,
valueTailNLength))
{
return offset + bitPos;
}
// Clear the lowest set bit
mask = BitOperations.ResetLowestSetBit(mask);
} while (mask != 0);
}
offset += Vector128<byte>.Count;

if (offset + valueTailLength == searchSpaceLength)
return -1;

// Overlap with the current chunk if there is not enough room for the next one
if (offset + valueTailLength + Vector128<byte>.Count > searchSpaceLength)
offset = searchSpaceLength - valueTailLength - Vector128<byte>.Count;

} while (true);
}

Debug.Fail("Unreachable");
return -1;
}

// Adapted from IndexOf(...)
Expand Down Expand Up @@ -416,11 +536,24 @@ public static int LastIndexOf(ref byte searchSpace, int searchSpaceLength, ref b
if (valueLength == 0)
return searchSpaceLength; // A zero-length sequence is always treated as "found" at the end of the search space.

byte valueHead = value;
ref byte valueTail = ref Unsafe.Add(ref value, 1);
int valueTailLength = valueLength - 1;

if (valueTailLength == 0)
{
// for single-byte values use plain LastIndexOf
return LastIndexOf(ref searchSpace, value, searchSpaceLength);
}

byte valueHead = value;
ref byte valueTail = ref Unsafe.Add(ref value, 1);
int offset = 0;
nuint valueTailNLength = (nuint)(uint)valueTailLength;

if (Vector128.IsHardwareAccelerated && searchSpaceLength - valueTailLength >= Vector128<byte>.Count)
{
goto SEARCH_TWO_BYTES;
}

while (true)
{
Debug.Assert(0 <= offset && offset <= searchSpaceLength); // Ensures no deceptive underflows in the computation of "remainingSearchSpaceLength".
Expand All @@ -434,12 +567,120 @@ public static int LastIndexOf(ref byte searchSpace, int searchSpaceLength, ref b
break;

// Found the first element of "value". See if the tail matches.
if (SequenceEqual(ref Unsafe.Add(ref searchSpace, relativeIndex + 1), ref valueTail, (nuint)(uint)valueTailLength)) // The (nunit)-cast is necessary to pick the correct overload
if (SequenceEqual(ref Unsafe.Add(ref searchSpace, relativeIndex + 1), ref valueTail, valueTailNLength)) // The (nunit)-cast is necessary to pick the correct overload
return relativeIndex; // The tail matched. Return a successful find.

offset += remainingSearchSpaceLength - relativeIndex;
}
return -1;

// Based on http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd "Algorithm 1: Generic SIMD" by Wojciech Muła
// Some details about the implementation can also be found in https://github.com/dotnet/runtime/pull/63285
SEARCH_TWO_BYTES:
if (Avx2.IsSupported && searchSpaceLength - valueTailLength >= Vector256<byte>.Count)
{
offset = searchSpaceLength - valueTailLength - Vector256<byte>.Count;

// Find the last unique (which is not equal to ch1) byte
// the algorithm is fine if both are equal, just a little bit less efficient
byte ch2Val = Unsafe.Add(ref value, valueTailLength);
int ch1ch2Distance = valueTailLength;
while (ch2Val == value && ch1ch2Distance > 1)
ch2Val = Unsafe.Add(ref value, --ch1ch2Distance);

Vector256<byte> ch1 = Vector256.Create(value);
Vector256<byte> ch2 = Vector256.Create(ch2Val);

do
{
Vector256<byte> cmpCh1 = Vector256.Equals(ch1, Vector256.LoadUnsafe(ref searchSpace, (nuint)offset));
Vector256<byte> cmpCh2 = Vector256.Equals(ch2, Vector256.LoadUnsafe(ref searchSpace, (nuint)(offset + ch1ch2Distance)));
Vector256<byte> cmpAnd = (cmpCh1 & cmpCh2).AsByte();

// Early out: cmpAnd is all zeros
if (cmpAnd != Vector256<byte>.Zero)
{
uint mask = cmpAnd.ExtractMostSignificantBits();
do
{
// unlike IndexOf, here we use LZCNT to process matches starting from the end
int bitPos = 31 - BitOperations.LeadingZeroCount(mask);
if (valueTailNLength == 1 || // we already matched two bytes
SequenceEqual(
ref Unsafe.Add(ref searchSpace, offset + bitPos + 1),
ref valueTail,
valueTailNLength))
{
return bitPos + offset;
}
// Clear the highest set bit.
mask = BitOperations.ResetBit(mask, bitPos);
} while (mask != 0);
}

offset -= Vector256<byte>.Count;
if (offset == -Vector256<byte>.Count)
return -1;
// Overlap with the current chunk if there is not enough room for the next one
if (offset < 0)
offset = 0;

} while (true);
}
if (Vector128.IsHardwareAccelerated)
{
offset = searchSpaceLength - valueTailLength - Vector128<byte>.Count;

// Find the last unique (which is not equal to ch1) byte
// the algorithm is fine if both are equal, just a little bit less efficient
byte ch2Val = Unsafe.Add(ref value, valueTailLength);
int ch1ch2Distance = valueTailLength;
while (ch2Val == value && ch1ch2Distance > 1)
ch2Val = Unsafe.Add(ref value, --ch1ch2Distance);

Vector128<byte> ch1 = Vector128.Create(value);
Vector128<byte> ch2 = Vector128.Create(ch2Val);

do
{
Vector128<byte> cmpCh1 = Vector128.Equals(ch1, Vector128.LoadUnsafe(ref searchSpace, (nuint)offset));
Vector128<byte> cmpCh2 = Vector128.Equals(ch2, Vector128.LoadUnsafe(ref searchSpace, (nuint)(offset + ch1ch2Distance)));
Vector128<byte> cmpAnd = (cmpCh1 & cmpCh2).AsByte();

// Early out: cmpAnd is all zeros
// it's especially important for ARM where ExtractMostSignificantBits is not cheap
if (cmpAnd != Vector128<byte>.Zero)
{
uint mask = cmpAnd.ExtractMostSignificantBits();
do
{
// unlike IndexOf, here we use LZCNT to process matches starting from the end
int bitPos = 31 - BitOperations.LeadingZeroCount(mask);
if (valueTailNLength == 1 || // we already matched two bytes
SequenceEqual(
ref Unsafe.Add(ref searchSpace, offset + bitPos + 1),
ref valueTail,
valueTailNLength))
{
return bitPos + offset;
}
// Clear the highest set bit.
mask = BitOperations.ResetBit(mask, bitPos);
} while (mask != 0);
}

offset -= Vector128<byte>.Count;
if (offset == -Vector128<byte>.Count)
return -1;
// Overlap with the current chunk if there is not enough room for the next one
if (offset < 0)
offset = 0;

} while (true);
}

Debug.Fail("Unreachable");
return -1;
}

[MethodImpl(MethodImplOptions.AggressiveOptimization)]
Expand Down
Loading