-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use wider look-up table while searching chars on Arm64 #88183
Use wider look-up table while searching chars on Arm64 #88183
Conversation
@kunalspathak @EgorBo, this patch combines two shuffle instructions into one on Arm64. This should improve the code quality but currently it does not (thus the draft). Assembly sequences- Without the patch
With Patch
The reason for suboptimal sequence is the following operation using the Vector128 API.
Intermediate values while computing the AND/OR operations are lowered to load/store to stack. Thus, we got
instead of
I'm marking this PR as a draft because it may cause regression as of now. If we can avoid the unnecessary operations then we can take this further. Any thoughts? Test details to reproduce the issue. Test C# exampleprivate static Vector128<byte> IndexOfAnyLookup(Vector128<byte> source, Vector128<byte> bitmapLookup0, Vector128<byte> bitmapLookup1)
{
Vector128<byte> lowNibbles = source & Vector128.Create((byte)0xF);
Vector128<byte> highNibbles = Vector128.ShiftRightLogical(source.AsInt32(), 4).AsByte() & Vector128.Create((byte)0xF);
Vector128<byte> row0 = ShuffleUnsafe(bitmapLookup0, lowNibbles);
Vector128<byte> row1 = ShuffleUnsafe(bitmapLookup1, lowNibbles);
Vector128<byte> mask = Vector128.GreaterThan(highNibbles.AsSByte(), Vector128.Create((sbyte)0x7)).AsByte();
Vector128<byte> bitsets = Vector128.ConditionalSelect(mask, row1, row0);
Vector128<byte> bitmask = ShuffleUnsafe(Vector128.Create(0x8040201008040201).AsByte(), highNibbles);
Vector128<byte> result = Vector128.Equals(bitsets & bitmask, bitmask);
return result;
}
private static Vector128<byte> IndexOfAnyLookupCombined(Vector128<byte> source, Vector128<byte> bitmapLookup0, Vector128<byte> bitmapLookup1)
{
Vector128<byte> lowNibbles = source & Vector128.Create((byte)0xF);
Vector128<byte> highNibbles = Vector128.ShiftRightLogical(source.AsInt32(), 4).AsByte() & Vector128.Create((byte)0xF);
Vector128<byte> index = highNibbles & Vector128.Create(0x08080808).AsByte() | lowNibbles;
Vector128<byte> bitsets = AdvSimd.Arm64.VectorTableLookup((bitmapLookup0, bitmapLookup1), index);
Vector128<byte> bitmask = ShuffleUnsafe(Vector128.Create(0x8040201008040201).AsByte(), highNibbles);
Vector128<byte> result = Vector128.Equals(bitsets & bitmask, bitmask);
return result;
} Full JIT Dump of the method |
CI failures seem related? |
Yup, I'll check. 👍 |
Do you have preliminary performance numbers for this? |
@SwapnilGaikwad Can you check if either #88090 or #88130 helps to get rid of these stack stores? |
Not really. I didn't as it emitted a suboptimal sequence. To gauge the benefits we may extract the assembly and benchmark it through a C wrapper.
Sure, I'll give it go. |
Below pattern is very common to see because of the consecutive register requirements. We don't have (at least currently) a way to come up with allocation such that we will satisfy the need of consecutive register without having to do this kind of reloading. ld2~ld4, might come to the rescue to some extent. stp q1, q2, [fp, #0x10] // [V11 tmp1], [V11 tmp1+0x10]
ldp q17, q18, [fp, #0x10] // [V11 tmp1], [V11 tmp1+0x10]
I think we should still get the measurements for code it generates today to see how much slowness we get, and of course on top of #88090 or #88130 that Jakob mentioned. |
After rebasing the additiona load/stores from/to stack are gone. Without the patch
Unfortunately, I couldn't get the benchmarking numbers. I'm using the dotnet performance repo for benchmarking. I'm using the following benchmark with the setup steps from here. ...
[Benchmark] public int LastIndexOfAnyBench() => ByteBuffer.LastIndexOfAny(SearchValues.Create(needleValues));
... Do you have any suggestions to use the built libraries for performance runs? Using |
IIRC, didn't it work in the past when we updated just the libraries portion and benchmarked them. Ideally the |
Tagging subscribers to this area: @dotnet/area-system-buffers Issue DetailsContributes towards the issue #84328
|
What error exactly did you get? In general the project with micro benchmarks targets multiple runtimes, including older ones that don't define this type: If you want to add benchmarks for type that has been introduced in .NET 8 you need to add it here: Then the project should compile. By using |
Thanks @adamsitnik , now I managed to compile the benchmark and get performance number.
Performance numbers doesn't seem to show any improvements. It seems that the additional instructions to prepare updated indices is taking away the gains from combined table lookups. Perfscore also indicate the same (Base: 22.80, patch: 23.30).
What do you recommend for this PR? |
Can you show the benchmark you are using? |
Here. I'll be more than happy to try if you have suggestions to stress the |
Can you please test it with the following? Notably, the public class SearchValuesAnyBenchmark
{
private static readonly SearchValues<byte> s_searchValues =
SearchValues.Create("aeiou"u8.ToArray().Concat(new[] { (byte)0xFF }).ToArray());
private byte[] _byteBuffer = null!;
[Params(8, 16, 32, 1024)]
public int Length;
[GlobalSetup]
public void Setup()
{
_byteBuffer = new byte[Length];
_byteBuffer.AsSpan().Fill((byte)'\n');
_byteBuffer[Length / 2] = (byte)'a';
}
[Benchmark] public int IndexOfAny() => _byteBuffer.AsSpan().IndexOfAny(s_searchValues);
} |
Thanks @MihaZupan for the benchmark. Please find the following results on executive the benchmark.
|
@SwapnilGaikwad - for benchmark measurements, we typically prefer to run the |
My bad. Please find the following numbers with both runtime and libraries compiled in Release mode (previously, libraries were in Release mode and runtime was in Checked mode).
|
Thanks @SwapnilGaikwad for gathering the numbers. Do you mind pasting the before vs. after code once again for the |
Whatever happens with this PR perhaps this benchmark ought to be added to dotnet/performance |
Hi @kunalspathak, The benchmark didn't inline the IndexOfAny Benchmark
The changed part is in Before Patch
After Patch
|
Looking at the disassembly, I don't see how ld2~ld4 would help improve the performance here. @tannergooding - do you have any other opinion about the algorithm? |
The perf results don't show an improvement, rather a regression (main is the baseline job, MannWhitney column showing Slower)
I believe we should close the PR. @SwapnilGaikwad thank you for your contribution! |
Contributes towards the issue #84328