Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use wider look-up table while searching chars on Arm64 #88183

Conversation

SwapnilGaikwad
Copy link
Contributor

@SwapnilGaikwad SwapnilGaikwad commented Jun 29, 2023

Contributes towards the issue #84328

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jun 29, 2023
@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Jun 29, 2023
@SwapnilGaikwad
Copy link
Contributor Author

@kunalspathak @EgorBo, this patch combines two shuffle instructions into one on Arm64. This should improve the code quality but currently it does not (thus the draft).

Assembly sequences-

Without the patch
   ; Assembly listing for method System.Text.Tests.Demo:IndexOfAnyLookup(System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte]):System.Runtime.Intrinsics.Vector128`1[ubyte]
; Emitting BLENDED_CODE for generic ARM64 - Unix
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 0 single block inlinees; 3 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  4   )  simd16  ->   d0         HFA(simd16)  single-def
;  V01 arg1         [V01,T01] (  3,  3   )  simd16  ->   d1         HFA(simd16)  single-def
;  V02 arg2         [V02,T02] (  3,  3   )  simd16  ->   d2         HFA(simd16)  single-def
;  V03 loc0         [V03,T03] (  3,  3   )  simd16  ->  d17         HFA(simd16)
;  V04 loc1         [V04,T04] (  3,  3   )  simd16  ->   d0         HFA(simd16)
;  V05 loc2         [V05,T07] (  2,  2   )  simd16  ->  d16         HFA(simd16)
;  V06 loc3         [V06,T08] (  2,  2   )  simd16  ->  d17         HFA(simd16)
;  V07 loc4         [V07,T05] (  3,  3   )  simd16  ->  d18         HFA(simd16)
;# V08 OutArgs      [V08    ] (  1,  1   )  struct ( 0) [sp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V09 tmp1         [V09    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "Inline return value spill temp"
;* V10 tmp2         [V10    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "Inline return value spill temp"
;* V11 tmp3         [V11    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "Inline return value spill temp"
;  V12 cse0         [V12,T06] (  3,  3   )  simd16  ->  d16         "CSE - aggressive"
;
; Lcl frame size = 0

G_M45512_IG01:
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
						;; size=8 bbWeight=1 PerfScore 1.50
G_M45512_IG02:
            ldr     q16, [@RWD00]
            and     v17.16b, v0.16b, v16.16b
            ushr    v0.4s, v0.4s, #4
            and     v0.16b, v0.16b, v16.16b
            tbl     v16.16b, {v1.16b}, v17.16b
            tbl     v17.16b, {v2.16b}, v17.16b
            ldr     q18, [@RWD16]
            tbl     v18.16b, {v18.16b}, v0.16b
            ldr     q19, [@RWD32]
            cmgt    v0.16b, v0.16b, v19.16b
            bsl     v0.16b, v17.16b, v16.16b
            and     v0.16b, v0.16b, v18.16b
            cmeq    v0.16b, v0.16b, v18.16b
						;; size=52 bbWeight=1 PerfScore 12.50
G_M45512_IG03:
            ldp     fp, lr, [sp], #0x10
            ret     lr
						;; size=8 bbWeight=1 PerfScore 2.00
RWD00  	dq	0F0F0F0F0F0F0F0Fh, 0F0F0F0F0F0F0F0Fh
RWD16  	dq	8040201008040201h, 8040201008040201h
RWD32  	dq	0707070707070707h, 0707070707070707


; Total bytes of code 68, prolog size 8, PerfScore 22.80, instruction count 17, allocated bytes for code 68 (MethodHash=c1384e37) for method System.Text.Tests.Demo:IndexOfAnyLookup(System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte]):System.Runtime.Intrinsics.Vector128`1[ubyte]

With Patch
; Assembly listing for method System.Text.Tests.Demo:IndexOfAnyLookupCombined(System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte]):System.Runtime.Intrinsics.Vector128`1[ubyte]
; Emitting BLENDED_CODE for generic ARM64 - Unix
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; 1 inlinees with PGO data; 0 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T01] (  4,  4   )  simd16  ->   d0         HFA(simd16)  single-def
;  V01 arg1         [V01,T02] (  3,  3   )  simd16  ->   d1         HFA(simd16)  single-def
;  V02 arg2         [V02,T03] (  3,  3   )  simd16  ->   d2         HFA(simd16)  single-def
;  V03 loc0         [V03,T07] (  2,  2   )  simd16  ->  d17         HFA(simd16)
;  V04 loc1         [V04,T04] (  3,  3   )  simd16  ->   d0         HFA(simd16)
;  V05 loc2         [V05,T08] (  2,  2   )  simd16  ->  d16         HFA(simd16)
;  V06 loc3         [V06,T05] (  3,  3   )  simd16  ->   d0         HFA(simd16)
;  V07 loc4         [V07,T09] (  2,  2   )  simd16  ->  d16         HFA(simd16)
;* V08 loc5         [V08    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)
;* V09 loc6         [V09    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)
;# V10 OutArgs      [V10    ] (  1,  1   )  struct ( 0) [sp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;  V11 tmp1         [V11,T00] (  5, 10   )  struct (32) [fp+10H]   HFA(simd16)  do-not-enreg[SF] ld-addr-op "NewObj constructor temp"
;* V12 tmp2         [V12    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "Inline return value spill temp"
;  V13 cse0         [V13,T06] (  3,  3   )  simd16  ->  d16         "CSE - aggressive"
;
; Lcl frame size = 32

G_M8077_IG01:
            stp     fp, lr, [sp, #-0x30]!
            mov     fp, sp
						;; size=8 bbWeight=1 PerfScore 1.50
G_M8077_IG02:
            ldr     q16, [@RWD00]
            and     v17.16b, v0.16b, v16.16b
            ushr    v0.4s, v0.4s, #4
            and     v0.16b, v0.16b, v16.16b
            ldr     q16, [@RWD16]
            and     v16.16b, v0.16b, v16.16b
            orr     v16.16b, v16.16b, v17.16b
            stp     xzr, xzr, [fp, #0x10]
            stp     xzr, xzr, [fp, #0x20]
            stp     q1, q2, [fp, #0x10]	// [V11 tmp1], [V11 tmp1+0x10]
            ldp     q17, q18, [fp, #0x10]	// [V11 tmp1], [V11 tmp1+0x10]
            tbl     v16.16b, {v17.16b, v18.16b}, v16.16b
            ldr     q17, [@RWD32]
            tbl     v0.16b, {v17.16b}, v0.16b
            and     v16.16b, v16.16b, v0.16b
            cmeq    v0.16b, v16.16b, v0.16b
						;; size=64 bbWeight=1 PerfScore 17.00
G_M8077_IG03:
            ldp     fp, lr, [sp], #0x30
            ret     lr
						;; size=8 bbWeight=1 PerfScore 2.00
RWD00  	dq	0F0F0F0F0F0F0F0Fh, 0F0F0F0F0F0F0F0Fh
RWD16  	dq	0808080808080808h, 0808080808080808h
RWD32  	dq	8040201008040201h, 8040201008040201h


; Total bytes of code 80, prolog size 8, PerfScore 28.50, instruction count 20, allocated bytes for code 80 (MethodHash=2430e072) for method System.Text.Tests.Demo:IndexOfAnyLookupCombined(System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte]):System.Runtime.Intrinsics.Vector128`1[ubyte]
; ============================================================

The reason for suboptimal sequence is the following operation using the Vector128 API.

Vector128<byte> index = highNibbles & Vector128.Create(0x08080808).AsByte() | lowNibbles

Intermediate values while computing the AND/OR operations are lowered to load/store to stack. Thus, we got

...
stp     xzr, xzr, [fp, #0x10]
stp     xzr, xzr, [fp, #0x20]
stp     q1, q2, [fp, #0x10]	// [V11 tmp1], [V11 tmp1+0x10]
ldp     q17, q18, [fp, #0x10]	// [V11 tmp1], [V11 tmp1+0x10]
tbl     v16.16b, {v17.16b, v18.16b}, v16.16b
...

instead of

...
tbl     v16.16b, {v1.16b, v2.16b}, v16.16b
...

I'm marking this PR as a draft because it may cause regression as of now. If we can avoid the unnecessary operations then we can take this further. Any thoughts?

Test details to reproduce the issue.

Test C# example
private static Vector128<byte> IndexOfAnyLookup(Vector128<byte> source, Vector128<byte> bitmapLookup0, Vector128<byte> bitmapLookup1)
{
    Vector128<byte> lowNibbles = source & Vector128.Create((byte)0xF);
    Vector128<byte> highNibbles = Vector128.ShiftRightLogical(source.AsInt32(), 4).AsByte() & Vector128.Create((byte)0xF);

    Vector128<byte> row0 = ShuffleUnsafe(bitmapLookup0, lowNibbles);
    Vector128<byte> row1 = ShuffleUnsafe(bitmapLookup1, lowNibbles);
    Vector128<byte> mask = Vector128.GreaterThan(highNibbles.AsSByte(), Vector128.Create((sbyte)0x7)).AsByte();
    Vector128<byte> bitsets = Vector128.ConditionalSelect(mask, row1, row0);

    Vector128<byte> bitmask = ShuffleUnsafe(Vector128.Create(0x8040201008040201).AsByte(), highNibbles);
    Vector128<byte> result = Vector128.Equals(bitsets & bitmask, bitmask);

    return result;
}

private static Vector128<byte> IndexOfAnyLookupCombined(Vector128<byte> source, Vector128<byte> bitmapLookup0, Vector128<byte> bitmapLookup1)
{
    Vector128<byte> lowNibbles = source & Vector128.Create((byte)0xF);
    Vector128<byte> highNibbles = Vector128.ShiftRightLogical(source.AsInt32(), 4).AsByte() & Vector128.Create((byte)0xF);

    Vector128<byte> index = highNibbles & Vector128.Create(0x08080808).AsByte() | lowNibbles;
    Vector128<byte> bitsets = AdvSimd.Arm64.VectorTableLookup((bitmapLookup0, bitmapLookup1), index);

    Vector128<byte> bitmask = ShuffleUnsafe(Vector128.Create(0x8040201008040201).AsByte(), highNibbles);
    Vector128<byte> result = Vector128.Equals(bitsets & bitmask, bitmask);

    return result;
}

Full JIT Dump of the method

@EgorBo
Copy link
Member

EgorBo commented Jun 29, 2023

CI failures seem related?

@SwapnilGaikwad
Copy link
Contributor Author

CI failures seem related?

Yup, I'll check. 👍

@kunalspathak
Copy link
Member

Do you have preliminary performance numbers for this?

@jakobbotsch
Copy link
Member

@SwapnilGaikwad Can you check if either #88090 or #88130 helps to get rid of these stack stores?

@SwapnilGaikwad
Copy link
Contributor Author

Do you have preliminary performance numbers for this?

Not really. I didn't as it emitted a suboptimal sequence. To gauge the benefits we may extract the assembly and benchmark it through a C wrapper.

Can you check if either #88090 or #88130 helps to get rid of these stack stores?

Sure, I'll give it go.

@kunalspathak
Copy link
Member

kunalspathak commented Jun 29, 2023

Intermediate values while computing the AND/OR operations are lowered to load/store to stack. Thus, we got
...
stp xzr, xzr, [fp, #0x10]
stp xzr, xzr, [fp, #0x20]
stp q1, q2, [fp, #0x10] // [V11 tmp1], [V11 tmp1+0x10]
ldp q17, q18, [fp, #0x10] // [V11 tmp1], [V11 tmp1+0x10]
tbl v16.16b, {v17.16b, v18.16b}, v16.16b
...
instead of

...
tbl v16.16b, {v1.16b, v2.16b}, v16.16b
...

Below pattern is very common to see because of the consecutive register requirements. We don't have (at least currently) a way to come up with allocation such that we will satisfy the need of consecutive register without having to do this kind of reloading. ld2~ld4, might come to the rescue to some extent.

stp q1, q2, [fp, #0x10] // [V11 tmp1], [V11 tmp1+0x10]
ldp q17, q18, [fp, #0x10] // [V11 tmp1], [V11 tmp1+0x10]

Not really. I didn't as it emitted a suboptimal sequence. To gauge the benefits we may extract the assembly and benchmark it through a C wrapper.

I think we should still get the measurements for code it generates today to see how much slowness we get, and of course on top of #88090 or #88130 that Jakob mentioned.

@SwapnilGaikwad
Copy link
Contributor Author

After rebasing the additiona load/stores from/to stack are gone.
The updated assembly looks as expected.

Without the patch
; Assembly listing for method System.Text.Tests.Demo:IndexOfAnyLookupCombined(System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte]):System.Runtime.Intrinsics.Vector128`1[ubyte] (FullOpts)
; Emitting BLENDED_CODE for generic ARM64 - Unix
; FullOpts code
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; 1 inlinees with PGO data; 0 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  4   )  simd16  ->   d0         HFA(simd16)  single-def
;  V01 arg1         [V01,T01] (  3,  3   )  simd16  ->   d1         HFA(simd16)  single-def
;  V02 arg2         [V02,T02] (  3,  3   )  simd16  ->   d2         HFA(simd16)  single-def
;  V03 loc0         [V03,T06] (  2,  2   )  simd16  ->  d17         HFA(simd16)
;  V04 loc1         [V04,T03] (  3,  3   )  simd16  ->   d0         HFA(simd16)
;  V05 loc2         [V05,T07] (  2,  2   )  simd16  ->  d16         HFA(simd16)
;  V06 loc3         [V06,T04] (  3,  3   )  simd16  ->   d0         HFA(simd16)
;  V07 loc4         [V07,T08] (  2,  2   )  simd16  ->  d16         HFA(simd16)
;* V08 loc5         [V08    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)
;* V09 loc6         [V09    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)
;# V10 OutArgs      [V10    ] (  1,  1   )  struct ( 0) [sp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V11 tmp1         [V11    ] (  0,  0   )  struct (32) zero-ref    HFA(simd16)  do-not-enreg[SF] ld-addr-op "NewObj constructor temp"
;* V12 tmp2         [V12    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "Inline return value spill temp"
;* V13 tmp3         [V13    ] (  0,  0   )  simd16  ->  zero-ref    "V11.[000..016)"
;* V14 tmp4         [V14    ] (  0,  0   )  simd16  ->  zero-ref    "V11.[016..032)"
;  V15 cse0         [V15,T05] (  3,  3   )  simd16  ->  d16         "CSE - aggressive"
;
; Lcl frame size = 0

G_M8077_IG01:
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
						;; size=8 bbWeight=1 PerfScore 1.50
G_M8077_IG02:
            ldr     q16, [@RWD00]
            and     v17.16b, v0.16b, v16.16b
            ushr    v0.4s, v0.4s, #4
            and     v0.16b, v0.16b, v16.16b
            ldr     q16, [@RWD16]
            and     v16.4s, v0.4s, v16.4s
            shl     v16.4s, v16.4s, #1
            orr     v16.16b, v16.16b, v17.16b
            tbl     v16.16b, {v1.16b, v2.16b}, v16.16b
            ldr     q17, [@RWD32]
            tbl     v0.16b, {v17.16b}, v0.16b
            and     v16.16b, v16.16b, v0.16b
            cmeq    v0.16b, v16.16b, v0.16b
						;; size=52 bbWeight=1 PerfScore 13.00
G_M8077_IG03:
            ldp     fp, lr, [sp], #0x10
            ret     lr
						;; size=8 bbWeight=1 PerfScore 2.00
RWD00  	dq	0F0F0F0F0F0F0F0Fh, 0F0F0F0F0F0F0F0Fh
RWD16  	dq	0808080808080808h, 0808080808080808h
RWD32  	dq	8040201008040201h, 8040201008040201h


; Total bytes of code 68, prolog size 8, PerfScore 23.30, instruction count 17, allocated bytes for code 68 (MethodHash=2430e072) for method System.Text.Tests.Demo:IndexOfAnyLookupCombined(System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte]):System.Runtime.Intrinsics.Vector128`1[ubyte] (FullOpts)

Unfortunately, I couldn't get the benchmarking numbers. I'm using the dotnet performance repo for benchmarking.
How can I force a benchmark to use the new version of system libraries? Seems the --corerun flag does not pick the libraries from the specified dirs.

I'm using the following benchmark with the setup steps from here.

...
[Benchmark] public int LastIndexOfAnyBench() => ByteBuffer.LastIndexOfAny(SearchValues.Create(needleValues));
...

Do you have any suggestions to use the built libraries for performance runs? Using corerun from Core_Root was sufficient for individual test programs to use the modified libraries.

@kunalspathak
Copy link
Member

How can I force a benchmark to use the new version of system libraries?

IIRC, didn't it work in the past when we updated just the libraries portion and benchmarked them. Ideally the corerun should work, but probably @adamsitnik might know a better way.

@jakobbotsch
Copy link
Member

After rebasing the additiona load/stores from/to stack are gone.

Good to hear and see that. Interesting coincidence, #88090, #88130 and #88238 are all PRs that solves the exact problem causing the missing promotion here in 3 different ways.

@ghost
Copy link

ghost commented Jul 3, 2023

Tagging subscribers to this area: @dotnet/area-system-buffers
See info in area-owners.md if you want to be subscribed.

Issue Details

Contributes towards the issue #84328

Author: SwapnilGaikwad
Assignees: -
Labels:

area-System.Buffers, community-contribution, needs-area-label

Milestone: -

@adamsitnik adamsitnik added arch-arm64 tenet-performance Performance related issue and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jul 3, 2023
@adamsitnik
Copy link
Member

Unfortunately, I couldn't get the benchmarking numbers.

What error exactly did you get?

In general the project with micro benchmarks targets multiple runtimes, including older ones that don't define this type:

https://github.com/dotnet/performance/blob/af8bc7a227215626a51f427a839c55f90eec8876/src/benchmarks/micro/MicroBenchmarks.csproj#L7-L8

If you want to add benchmarks for type that has been introduced in .NET 8 you need to add it here:

https://github.com/dotnet/performance/blob/af8bc7a227215626a51f427a839c55f90eec8876/src/benchmarks/micro/MicroBenchmarks.csproj#L199-L201

Then the project should compile. By using --corerun you should be able to force it to use your local copy of dotnet/runtime.

@SwapnilGaikwad
Copy link
Contributor Author

Thanks @adamsitnik , now I managed to compile the benchmark and get performance number.

Do you have preliminary performance numbers for this?

Performance numbers doesn't seem to show any improvements. It seems that the additional instructions to prepare updated indices is taking away the gains from combined table lookups. Perfscore also indicate the same (Base: 22.80, patch: 23.30).

|         Method |        Job |                                                                    Toolchain | Length |     Mean |     Error |    StdDev |   Median |      Min |      Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|--------------- |----------- |----------------------------------------------------------------------------- |------- |---------:|----------:|----------:|---------:|---------:|---------:|------:|---------------- |----------:|------------:|
| LastIndexOfAny | Job-UWUGHD |    /main/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/corerun |   1024 | 5.534 us | 0.0157 us | 0.0147 us | 5.532 us | 5.512 us | 5.555 us |  1.00 |            Base |     112 B |        1.00 |
| LastIndexOfAny | Job-DOIGSP | /runtime/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/corerun |   1024 | 5.510 us | 0.0184 us | 0.0172 us | 5.506 us | 5.488 us | 5.540 us |  1.00 |            Same |     112 B |        1.00 |

What do you recommend for this PR?

@MihaZupan
Copy link
Member

Can you show the benchmark you are using?

@SwapnilGaikwad
Copy link
Contributor Author

Can you show the benchmark you are using?

Here. I'll be more than happy to try if you have suggestions to stress the IndexOfAnyLookup(Vector128<byte>, Vector128<byte>, Vector128<byte>) for byte arguments. I feel it primarily targets the IndexOfAnyLookup(Vector128<short>, Vector128<short>, Vector128<byte>)

@MihaZupan
Copy link
Member

MihaZupan commented Jul 3, 2023

Can you please test it with the following?

Notably, the SearchValues is being cached in a static readonly instead of being recreated as part of the benchmarked method.

public class SearchValuesAnyBenchmark
{
    private static readonly SearchValues<byte> s_searchValues =
        SearchValues.Create("aeiou"u8.ToArray().Concat(new[] { (byte)0xFF }).ToArray());

    private byte[] _byteBuffer = null!;

    [Params(8, 16, 32, 1024)]
    public int Length;

    [GlobalSetup]
    public void Setup()
    {
        _byteBuffer = new byte[Length];
        _byteBuffer.AsSpan().Fill((byte)'\n');
        _byteBuffer[Length / 2] = (byte)'a';
    }

    [Benchmark] public int IndexOfAny() => _byteBuffer.AsSpan().IndexOfAny(s_searchValues);
}

@SwapnilGaikwad
Copy link
Contributor Author

Can you please test it with the following?

Thanks @MihaZupan for the benchmark. Please find the following results on executive the benchmark.

|     Method |        Job |                                                                    Toolchain | Length |      Mean |    Error |   StdDev |    Median |       Min |       Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|----------- |----------- |----------------------------------------------------------------------------- |------- |----------:|---------:|---------:|----------:|----------:|----------:|------:|---------------- |----------:|------------:|
| IndexOfAny | Job-QYJCDR |    /main/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/corerun |      8 |  15.93 ns | 0.058 ns | 0.054 ns |  15.94 ns |  15.85 ns |  16.01 ns |  1.00 |            Base |         - |          NA |
| IndexOfAny | Job-CNHREH | /runtime/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/corerun |      8 |  16.63 ns | 0.039 ns | 0.036 ns |  16.63 ns |  16.56 ns |  16.68 ns |  1.04 |          Slower |         - |          NA |
|            |            |                                                                              |        |           |          |          |           |           |           |       |                 |           |             |
| IndexOfAny | Job-VNBLMH |    /main/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/corerun |     16 |  22.41 ns | 0.097 ns | 0.091 ns |  22.43 ns |  22.26 ns |  22.55 ns |  1.00 |            Base |         - |          NA |
| IndexOfAny | Job-UHWGQK | /runtime/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/corerun |     16 |  22.87 ns | 0.180 ns | 0.168 ns |  22.83 ns |  22.69 ns |  23.13 ns |  1.02 |            Same |         - |          NA |
|            |            |                                                                              |        |           |          |          |           |           |           |       |                 |           |             |
| IndexOfAny | Job-VNBLMH |    /main/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/corerun |     32 |  32.08 ns | 0.387 ns | 0.362 ns |  32.04 ns |  31.45 ns |  32.53 ns |  1.00 |            Base |         - |          NA |
| IndexOfAny | Job-UHWGQK | /runtime/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/corerun |     32 |  31.63 ns | 0.170 ns | 0.159 ns |  31.62 ns |  31.39 ns |  31.90 ns |  0.99 |            Same |         - |          NA |
|            |            |                                                                              |        |           |          |          |           |           |           |       |                 |           |             |
| IndexOfAny | Job-VNBLMH |    /main/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/corerun |   1024 | 179.85 ns | 0.381 ns | 0.356 ns | 179.91 ns | 179.20 ns | 180.44 ns |  1.00 |            Base |         - |          NA |
| IndexOfAny | Job-UHWGQK | /runtime/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/corerun |   1024 | 180.06 ns | 0.310 ns | 0.259 ns | 180.03 ns | 179.46 ns | 180.43 ns |  1.00 |            Same |         - |          NA |

@kunalspathak
Copy link
Member

@SwapnilGaikwad - for benchmark measurements, we typically prefer to run the Release version of binaries.

@SwapnilGaikwad
Copy link
Contributor Author

@SwapnilGaikwad - for benchmark measurements, we typically prefer to run the Release version of binaries.

My bad. Please find the following numbers with both runtime and libraries compiled in Release mode (previously, libraries were in Release mode and runtime was in Checked mode).

|     Method |        Job |                                                                    Toolchain | Length |     Mean |    Error |   StdDev |   Median |      Min |      Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|----------- |----------- |----------------------------------------------------------------------------- |------- |---------:|---------:|---------:|---------:|---------:|---------:|------:|---------------- |----------:|------------:|
| IndexOfAny | Job-INLKUP |    /main/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun |      8 | 18.64 ns | 0.001 ns | 0.001 ns | 18.64 ns | 18.64 ns | 18.64 ns |  1.00 |            Base |         - |          NA |
| IndexOfAny | Job-EINVMY | /runtime/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun |      8 | 19.17 ns | 0.005 ns | 0.005 ns | 19.17 ns | 19.17 ns | 19.18 ns |  1.03 |          Slower |         - |          NA |
|            |            |                                                                              |        |          |          |          |          |          |          |       |                 |           |             |
| IndexOfAny | Job-INLKUP |    /main/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun |     16 | 18.43 ns | 0.002 ns | 0.002 ns | 18.43 ns | 18.43 ns | 18.44 ns |  1.00 |            Base |         - |          NA |
| IndexOfAny | Job-EINVMY | /runtime/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun |     16 | 19.15 ns | 0.002 ns | 0.002 ns | 19.15 ns | 19.15 ns | 19.15 ns |  1.04 |          Slower |         - |          NA |
|            |            |                                                                              |        |          |          |          |          |          |          |       |                 |           |             |
| IndexOfAny | Job-RLZORX |    /main/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun |     32 | 18.95 ns | 0.003 ns | 0.003 ns | 18.95 ns | 18.94 ns | 18.95 ns |  1.00 |            Base |         - |          NA |
| IndexOfAny | Job-VQLLAV | /runtime/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun |     32 | 19.43 ns | 0.006 ns | 0.006 ns | 19.43 ns | 19.43 ns | 19.44 ns |  1.03 |          Slower |         - |          NA |
|            |            |                                                                              |        |          |          |          |          |          |          |       |                 |           |             |
| IndexOfAny | Job-RLZORX |    /main/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun |   1024 | 80.42 ns | 0.108 ns | 0.096 ns | 80.41 ns | 80.28 ns | 80.61 ns |  1.00 |            Base |         - |          NA |
| IndexOfAny | Job-VQLLAV | /runtime/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun |   1024 | 83.00 ns | 0.100 ns | 0.094 ns | 82.98 ns | 82.89 ns | 83.19 ns |  1.03 |          Slower |         - |          NA |

@SwapnilGaikwad SwapnilGaikwad marked this pull request as ready for review July 4, 2023 09:22
@kunalspathak
Copy link
Member

Thanks @SwapnilGaikwad for gathering the numbers. Do you mind pasting the before vs. after code once again for the IndexOfAny benchmark that @MihaZupan suggested?

@danmoseley
Copy link
Member

Whatever happens with this PR perhaps this benchmark ought to be added to dotnet/performance

@SwapnilGaikwad
Copy link
Contributor Author

Thanks @SwapnilGaikwad for gathering the numbers. Do you mind pasting the before vs. after code once again for the IndexOfAny benchmark that @MihaZupan suggested?

Hi @kunalspathak,

The benchmark didn't inline the System.Buffers.IndexOfAnyAsciiSearcher:IndexOfAnyVectorized method. Thus the assembly shown below didn't change with/without patch.

IndexOfAny Benchmark
; Assembly listing for method StringSearcher.IndexOfAnyBenchmarks:IndexOfAny():int:this (FullOpts)
; Emitting BLENDED_CODE for generic ARM64 - Unix
; FullOpts code
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; 10 inlinees with PGO data; 2 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
;  V00 this         [V00,T03] (  3,  3   )     ref  ->   x0         this class-hnd single-def
;# V01 OutArgs      [V01    ] (  1,  1   )  struct ( 0) [sp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  struct (16) zero-ref    "spilled call-like call argument"
;  V03 tmp2         [V03,T02] (  4,  6.00)     ref  ->   x0         class-hnd single-def "Inlining Arg"
;* V04 tmp3         [V04    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "NewObj constructor temp"
;  V05 tmp4         [V05,T00] (  6,  7.14)     ref  ->  x21         class-hnd single-def "Inlining Arg"
;* V06 tmp5         [V06    ] (  0,  0   )  struct (16) zero-ref    multireg-arg "spilled call-like call argument"
;* V07 tmp6         [V07    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V08 tmp7         [V08    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "NewObj constructor temp"
;  V09 tmp8         [V09,T01] (  3,  6   )     int  ->  x20         "Inlining Arg"
;  V10 tmp9         [V10,T04] (  2,  4   )   byref  ->  x19         single-def "Inlining Arg"
;* V11 tmp10        [V11    ] (  0,  0   )    bool  ->  zero-ref    "Inlining Arg"
;* V12 tmp11        [V12    ] (  0,  0   )    bool  ->  zero-ref    "Inlining Arg"
;  V13 tmp12        [V13,T09] (  4,  1.50)     int  ->   x0         "guarded devirt return temp"
;  V14 tmp13        [V14,T15] (  3,  0.86)     ref  ->   x0         class-hnd exact single-def "guarded devirt this exact temp"
;* V15 tmp14        [V15    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V16 tmp15        [V16    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;  V17 tmp16        [V17,T08] (  3,  1.72)   byref  ->   x1         single-def "Inlining Arg"
;  V18 tmp17        [V18,T05] (  4,  2.58)     int  ->  [fp+14H]   spill-single-def "Inlining Arg"
;* V19 tmp18        [V19    ] (  0,  0   )    bool  ->  zero-ref    "Inline return value spill temp"
;* V20 tmp19        [V20    ] (  0,  0   )   byref  ->  zero-ref    single-def "field V02._reference (fldOffset=0x0)" P-INDEP
;* V21 tmp20        [V21    ] (  0,  0   )     int  ->  zero-ref    "field V02._length (fldOffset=0x8)" P-INDEP
;  V22 tmp21        [V22,T10] (  3,  1.50)   byref  ->  x19         "field V04._reference (fldOffset=0x0)" P-INDEP
;  V23 tmp22        [V23,T13] (  3,  1.50)     int  ->  x20         "field V04._length (fldOffset=0x8)" P-INDEP
;  V24 tmp23        [V24,T11] (  3,  1.50)   byref  ->   x1         single-def "field V06._reference (fldOffset=0x0)" P-INDEP
;  V25 tmp24        [V25,T14] (  3,  1.50)     int  ->   x2         "field V06._length (fldOffset=0x8)" P-INDEP
;* V26 tmp25        [V26    ] (  0,  0   )   byref  ->  zero-ref    single-def "field V07._reference (fldOffset=0x0)" P-INDEP
;* V27 tmp26        [V27    ] (  0,  0   )     int  ->  zero-ref    "field V07._length (fldOffset=0x8)" P-INDEP
;  V28 tmp27        [V28,T06] (  2,  2   )   byref  ->   x1         single-def "field V08._reference (fldOffset=0x0)" P-INDEP
;  V29 tmp28        [V29,T07] (  2,  2   )     int  ->   x2         "field V08._length (fldOffset=0x8)" P-INDEP
;* V30 tmp29        [V30    ] (  0,  0   )   byref  ->  zero-ref    single-def "field V15._reference (fldOffset=0x0)" P-INDEP
;* V31 tmp30        [V31    ] (  0,  0   )     int  ->  zero-ref    "field V15._length (fldOffset=0x8)" P-INDEP
;* V32 tmp31        [V32    ] (  0,  0   )   byref  ->  zero-ref    single-def "field V16._reference (fldOffset=0x0)" P-INDEP
;* V33 tmp32        [V33    ] (  0,  0   )     int  ->  zero-ref    "field V16._length (fldOffset=0x8)" P-INDEP
;  V34 cse0         [V34,T12] (  3,  1.50)     ref  ->   x1         "CSE - moderate"
;
; Lcl frame size = 8
G_M55538_IG01:  ;; offset=0000H
            stp     fp, lr, [sp, #-0x30]!
            stp     x19, x20, [sp, #0x18]
            str     x21, [sp, #0x28]
            mov     fp, sp
                                                ;; size=16 bbWeight=1 PerfScore 3.50
G_M55538_IG02:  ;; offset=0010H
            ldr     x0, [x0, #0x08]
            cbz     x0, G_M55538_IG12
                                                ;; size=8 bbWeight=1 PerfScore 4.00
G_M55538_IG03:  ;; offset=0018H
            add     x19, x0, #16
            ldr     w20, [x0, #0x08]
                                                ;; size=8 bbWeight=0.50 PerfScore 1.75
G_M55538_IG04:  ;; offset=0020H
            movz    x0, #0x5A0A
            movk    x0, #0x61EB LSL #16
            movk    x0, #0xFFFF LSL #32
            ldr     w0, [x0]
            tbz     w0, #0, G_M55538_IG13
                                                ;; size=20 bbWeight=1 PerfScore 5.50
G_M55538_IG05:  ;; offset=0034H
            movz    x0, #0x69C0      // data for StringSearcher.IndexOfAnyBenchmarks:s_searchValues
            movk    x0, #0xE900 LSL #16
            movk    x0, #0xFF81 LSL #32
            ldr     x21, [x0]
            tbz     w20, #31, G_M55538_IG07
                                                ;; size=20 bbWeight=1 PerfScore 5.50

G_M55538_IG06:  ;; offset=0048H
            movz    x1, #0xC008
            movk    x1, #0x560B LSL #16
            movk    x1, #0xFFFF LSL #32
            mov     x0, x1
            movz    x2, #0x7D80      // code for System.Diagnostics.Debug:Fail(System.String,System.String)
            movk    x2, #0x6170 LSL #16
            movk    x2, #0xFFFF LSL #32
            ldr     x2, [x2]
            blr     x2
                                                ;; size=36 bbWeight=0.50 PerfScore 3.75
G_M55538_IG07:  ;; offset=006CH
            mov     x1, x19
            sxtw    w2, w20
            cbz     x21, G_M55538_IG14
            ldr     x0, [x21]
            movz    x3, #0x8658
            movk    x3, #0x626C LSL #16
            movk    x3, #0xFFFF LSL #32
            cmp     x0, x3
            bne     G_M55538_IG10
                                                ;; size=36 bbWeight=1 PerfScore 8.00
G_M55538_IG08:  ;; offset=0090H
            mov     x0, x21
            str     w2, [fp, #0x14]     // [V18 tmp17]
            cmp     w2, #8
            blt     G_M55538_IG11
            add     x2, x0, #8
            mov     x0, x1
            ldr     w1, [fp, #0x14]     // [V18 tmp17]
            movz    x3, #0xD8F0      // code for System.Buffers.IndexOfAnyAsciiSearcher:IndexOfAnyVectorized[System.Buffers.IndexOfAnyAsciiSearcher+DontNegate](byref,int,byref):int
            movk    x3, #0x6269 LSL #16
            movk    x3, #0xFFFF LSL #32
            ldr     x3, [x3]
            blr     x3
                                                ;; size=48 bbWeight=0.43 PerfScore 4.94
G_M55538_IG09:  ;; offset=00C0H
            ldr     x21, [sp, #0x28]
            ldp     x19, x20, [sp, #0x18]
            ldp     fp, lr, [sp], #0x30
            ret     lr
                                                ;; size=16 bbWeight=1 PerfScore 5.00
G_M55538_IG10:  ;; offset=00D0H
            mov     x0, x21
            ldr     x3, [x21]
            ldr     x3, [x3, #0x48]
            ldr     x3, [x3, #0x30]
            blr     x3
            b       G_M55538_IG09
                                                ;; size=24 bbWeight=0.07 PerfScore 0.81
G_M55538_IG11:  ;; offset=00E8H
            movz    x3, #0xD8D8      // code for System.Buffers.AsciiByteSearchValues:IndexOfAnyScalar[System.Buffers.IndexOfAnyAsciiSearcher+DontNegate](byref,int):int:this
            movk    x3, #0x6269 LSL #16
            movk    x3, #0xFFFF LSL #32
            ldr     x3, [x3]
            blr     x3
            b       G_M55538_IG09
                                                ;; size=24 bbWeight=0.00 PerfScore 0.00
G_M55538_IG12:  ;; offset=0100H
            mov     x19, xzr
            mov     w20, wzr
            b       G_M55538_IG04
                                                ;; size=12 bbWeight=0.00 PerfScore 0.00
G_M55538_IG13:  ;; offset=010CH
            movz    x0, #0x5880
            movk    x0, #0x61EB LSL #16
            movk    x0, #0xFFFF LSL #32
            mov     w1, #346
            bl      CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
            b       G_M55538_IG05
                                                ;; size=24 bbWeight=0 PerfScore 0.00

G_M55538_IG14:  ;; offset=0124H
            mov     w0, #6
            movz    x1, #0xD188      // code for System.ThrowHelper:ThrowArgumentNullException(int)
            movk    x1, #0x6170 LSL #16
            movk    x1, #0xFFFF LSL #32
            ldr     x1, [x1]
            blr     x1
            brk_unix #0
                                                ;; size=28 bbWeight=0 PerfScore 0.00
; Total bytes of code 320, prolog size 16, PerfScore 74.75, instruction count 80, allocated bytes for code 320 (MethodHash=0b26270d) for method StringSearcher.IndexOfAnyBenchmarks:IndexOfAny():int:this (FullOpts)
; ============================================================

The changed part is in System.Buffers.IndexOfAnyAsciiSearcher:IndexOfAnyLookup method and it is not data dependent. Setting up the 0 to 31 indices made both the assemblies of the same size although with different instructions.

Before Patch
; Assembly listing for method System.Buffers.IndexOfAnyAsciiSearcher:IndexOfAnyLookup[System.Buffers.IndexOfAnyAsciiSearcher+DontNegate](System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte]):System.Runtime.Intrinsics.Vector128`1[ubyte] (FullOpts)
; Emitting BLENDED_CODE for generic ARM64 - Unix
; FullOpts code
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; 4 inlinees with PGO data; 0 single block inlinees; 0 inlinees without PGO data
G_M000_IG01:                ;; offset=0000H
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp

G_M000_IG02:                ;; offset=0008H
            ldr     q16, [@RWD00]
            and     v17.16b, v0.16b, v16.16b
            ushr    v0.4s, v0.4s, #4
            and     v0.16b, v0.16b, v16.16b
            tbl     v16.16b, {v1.16b}, v17.16b
            tbl     v17.16b, {v2.16b}, v17.16b
            ldr     q18, [@RWD16]
            tbl     v18.16b, {v18.16b}, v0.16b
            ldr     q19, [@RWD32]
            cmgt    v0.16b, v0.16b, v19.16b
            bsl     v0.16b, v17.16b, v16.16b
            and     v0.16b, v0.16b, v18.16b
            cmeq    v0.16b, v0.16b, v18.16b

G_M000_IG03:                ;; offset=003CH
            ldp     fp, lr, [sp], #0x10
            ret     lr

RWD00  	dq	0F0F0F0F0F0F0F0Fh, 0F0F0F0F0F0F0F0Fh
RWD16  	dq	8040201008040201h, 8040201008040201h
RWD32  	dq	0707070707070707h, 0707070707070707h
; Total bytes of code 68
After Patch
; Assembly listing for method System.Buffers.IndexOfAnyAsciiSearcher:IndexOfAnyLookup[System.Buffers.IndexOfAnyAsciiSearcher+DontNegate](System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte]):System.Runtime.Intrinsics.Vector128`1[ubyte] (FullOpts)
; Emitting BLENDED_CODE for generic ARM64 - Unix
; FullOpts code
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; 3 inlinees with PGO data; 0 single block inlinees; 0 inlinees without PGO data
G_M000_IG01:                ;; offset=0000H
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp

G_M000_IG02:                ;; offset=0008H
            ldr     q16, [@RWD00]
            and     v17.16b, v0.16b, v16.16b
            ushr    v0.4s, v0.4s, #4
            and     v0.16b, v0.16b, v16.16b
            ldr     q16, [@RWD16]
            and     v16.4s, v0.4s, v16.4s
            shl     v16.4s, v16.4s, #1
            orr     v16.16b, v16.16b, v17.16b
            tbl     v16.16b, {v1.16b, v2.16b}, v16.16b
            ldr     q17, [@RWD32]
            tbl     v0.16b, {v17.16b}, v0.16b
            and     v16.16b, v16.16b, v0.16b
            cmeq    v0.16b, v16.16b, v0.16b

G_M000_IG03:                ;; offset=003CH
            ldp     fp, lr, [sp], #0x10
            ret     lr

RWD00  	dq	0F0F0F0F0F0F0F0Fh, 0F0F0F0F0F0F0F0Fh
RWD16  	dq	0808080808080808h, 0808080808080808h
RWD32  	dq	8040201008040201h, 8040201008040201h
; Total bytes of code 68

@kunalspathak
Copy link
Member

Looking at the disassembly, I don't see how ld2~ld4 would help improve the performance here. @tannergooding - do you have any other opinion about the algorithm?

@adamsitnik
Copy link
Member

The perf results don't show an improvement, rather a regression (main is the baseline job, MannWhitney column showing Slower)

Method Toolchain Length Mean Ratio MannWhitney(2%)
IndexOfAny /main/ 8 18.64 ns 1.00 Base
IndexOfAny /runtime/ 8 19.17 ns 1.03 Slower
IndexOfAny /main/ 16 18.43 ns 1.00 Base
IndexOfAny /runtime/ 16 19.15 ns 1.04 Slower
IndexOfAny /main/ 32 18.95 ns 1.00 Base
IndexOfAny /runtime/ 32 19.43 ns 1.03 Slower
IndexOfAny /main/ 1024 80.42 ns 1.00 Base
IndexOfAny /runtime/ 1024 83.00 ns 1.03 Slower

Looking at the disassembly, I don't see how ld2~ld4 would help improve the performance here.

I believe we should close the PR.

@SwapnilGaikwad thank you for your contribution!

@adamsitnik adamsitnik closed this Oct 25, 2023
@SwapnilGaikwad SwapnilGaikwad deleted the github-asciiSearcherMergeShuffles branch October 25, 2023 11:37
@ghost ghost locked as resolved and limited conversation to collaborators Nov 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-System.Buffers community-contribution Indicates that the PR has been added by a community member tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants