[ARM64] Use SHRN in SpanHelpers.CountValueType #89632

neon-sunset · 2023-07-28T12:24:12Z

This change avoids using movemask emulation on Arm64 and uses SHRN instead.
Somewhat related to #76047

Numbers:

BenchmarkDotNet v0.13.6, macOS Sonoma 14.0 (23A5301g) [Darwin 23.0.0]
Apple M1 Pro, 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.100-rc.1.23375.11
  [Host]     : .NET 8.0.0 (8.0.23.37503), Arm64 RyuJIT AdvSIMD
  DefaultJob : .NET 8.0.0 (8.0.23.37503), Arm64 RyuJIT AdvSIMD

Method	Length	Mean	Error	StdDev	Ratio
CountLines	100	7.525 ns	0.0599 ns	0.0531 ns	1.00
CountLinesNew	100	5.762 ns	0.0071 ns	0.0063 ns	0.77

CountLines	1000	65.140 ns	0.1234 ns	0.1030 ns	1.00
CountLinesNew	1000	43.526 ns	0.1965 ns	0.1838 ns	0.67

CountLines	67694	4,145.631 ns	7.5764 ns	7.0869 ns	1.00
CountLinesNew	67694	2,577.341 ns	4.8096 ns	4.2636 ns	0.62

Benchmark:

class Counting
{
    private byte[] ThirdPartyNotices = null!;

    [Params(100, 1000, int.MaxValue)]
    public int Length;

    [GlobalSetup]
    public void Setup()
    {
        var notices = new HttpClient()
            .GetByteArrayAsync("https://raw.githubusercontent.com/dotnet/runtime/main/THIRD-PARTY-NOTICES.TXT")
            .GetAwaiter()
            .GetResult();

        ThirdPartyNotices = Length is int.MaxValue ? notices : notices[..Length];
    }
    
    [Benchmark(Baseline = true)]
    public int CountLines()
    {
        return ThirdPartyNotices.AsSpan().Count((byte)'\n');
    }

    [Benchmark]
    public int CountLinesNew()
    {
        var span = ThirdPartyNotices.AsSpan();
        return SpanUtils.CountValueType(ref span[0], (byte)'\n', span.Length);
    }
}

I need to fix the setup for benchmarking against two proper coreroots on my machine but hopefully this is sufficient.

ghost · 2023-07-28T12:24:24Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

This change avoids using movemask emulation on Arm64 and uses SHRN instead.
Somewhat related to #76047

Numbers:

BenchmarkDotNet v0.13.6, macOS Sonoma 14.0 (23A5301g) [Darwin 23.0.0]
Apple M1 Pro, 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.100-rc.1.23375.11
  [Host]     : .NET 8.0.0 (8.0.23.37503), Arm64 RyuJIT AdvSIMD
  DefaultJob : .NET 8.0.0 (8.0.23.37503), Arm64 RyuJIT AdvSIMD

Method	Length	Mean	Error	StdDev	Ratio	Allocated	Alloc Ratio
CountLines	100	7.715 ns	0.0146 ns	0.0137 ns	1.00	-	NA
CountLinesNew	100	5.741 ns	0.0132 ns	0.0124 ns	0.74	-	NA

CountLines	1000	65.859 ns	0.1259 ns	0.1178 ns	1.00	-	NA
CountLinesNew	1000	44.800 ns	0.0676 ns	0.0528 ns	0.68	-	NA

CountLines	67694	4,137.981 ns	8.3394 ns	7.3926 ns	1.00	-	NA
CountLinesNew	67694	3,517.751 ns	3.2983 ns	2.7542 ns	0.85	-	NA

Benchmark:

class Counting
{
    private U8String ThirdPartyNotices;

    [Params(100, 1000, int.MaxValue)]
    public int Length;

    [GlobalSetup]
    public void Setup()
    {
        var notices = new HttpClient()
            .GetU8StringAsync("https://raw.githubusercontent.com/dotnet/runtime/main/THIRD-PARTY-NOTICES.TXT")
            .GetAwaiter()
            .GetResult();
        ThirdPartyNotices = Length is int.MaxValue ? notices : notices[..Length];
    }
    
    // Both variants just end up calling old/new MemoryExtensions.Count(span, (byte)'\n')
    [Benchmark]
    public int CountLines() => ThirdPartyNotices.Lines.Count;

    [Benchmark]
    public int CountLinesNew() => ThirdPartyNotices.Lines.CountNew();
}

I need to fix the setup for benchmarking against two proper coreroots on my machine but hopefully this is sufficient.

Author:	neon-sunset
Assignees:	-
Labels:	`area-System.Memory`, `community-contribution`
Milestone:	-

neon-sunset · 2023-07-28T12:25:45Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.T.cs

+                                1 => 2,
+                                2 => 3,
+                                4 => 4,
+                                _ => 5


I could not come up with a better way to write this.

count += (int)((uint)BitOperations.PopCount(matches) / (4 * sizeof(T)));

Thanks, it seems JIT can't completely fold this:

[MethodImpl(MethodImplOptions.AggressiveOptimization | MethodImplOptions.NoInlining)] static int CountAdvSimd<T>(ReadOnlySpan<T> span, T value) { var eqmask = Vector128.Equals(Vector128.Create(span), Vector128.Create(value)); var matches = AdvSimd.ShiftRightLogicalNarrowingLower(eqmask.AsUInt16(), 4).AsUInt64().ToScalar(); // or (int)((uint)BitOperations.PopCount(matches) / (4 * sizeof(T))); return BitOperations.PopCount(matches) >> (Unsafe.SizeOf<T>() switch { 1 => 2, 2 => 3, 4 => 4, _ => 5 }); }

Diff where T is ushort: https://www.diffchecker.com/hzMEBfhM/

I don't see any codegen differences locally

The version by @MihaZupan is faster 🤯
diff:

G_M000_IG04: ;; offset=0028H ldr q17, [x0] cmeq v17.16b, v17.16b, v16.16b shrn v17.8b, v17.8h, #4 umov x5, v17.d[0] mov v17.d[0], x5 cnt v17.8b, v17.8b addv b17, v17.8b umov w5, v17.b[0] - add w3, w3, w5, LSR #2 + mov w5, w5 + lsr x5, x5, #2 + add w3, w5, w3 add x0, x0, #16 cmp x0, x2 bls G_M000_IG04

I think I know why. According to https://dougallj.github.io/applecpu/firestorm-int.html, ADD (register, lsr, 32-bit) has reciprocal TP of 0.333 per cycle with 2 cycle latency. However, both ADD (register, 32-bit) and LSR (immediate, 32-bit) are 0.111 of reciprocal TP with 1 cycle lat. This looks like a stall on a backwards jump because if I forcibly use V256 (V128x2) version it is marginally faster too despite codegen.

According to napkin math, slower version handles around 5.8~6 bytes per cycle while the faster one is ~8. If the above assumption is correct, it would explain the difference.

All in all this was very counterintuitive, TIL.

tannergooding · 2023-07-28T14:17:08Z

We really want to try and balance the code here. Maintaining explicitly separate code paths for different platforms is error prone and adds a lot of complexity to the managed code-base.

In this particular case, the main reason its faster is because scalar popcnt is currently pessimized due to there not being a native scalar popcnt instruction. We should probably ensure that is optimized to be fmov; cnt; fmov instead which should be faster on Arm64.

But we could also add the minimal pattern recognition to the JIT instead so that the common pattern of popcnt(vector.esmb) is optimized as well to do shrn; popcnt; addv (or an equivalent) instead.

neon-sunset · 2023-07-28T15:21:56Z

@tannergooding I see. Given prevalence of doing eq + movemask + tzcnt/popcnt everywhere, would it make sense to resurrect #83010 (or submit a new one) since VectorMask no longer seems to be the path forward (unless I misunderstood)? Not for convenience (as outlined in the issue) but for optimal codegen on each platform, perhaps on some Vector{T/XXX}Extensions class? This would allow to consolidate current code and reclaim some performance on ARM64 at the same time.

To me it seems easier to do than to teach JIT to recognize all shapes .ExtractMostSignificantBits() + SomeOp can take.

tannergooding · 2023-07-28T16:26:34Z

We're ultimately going to want/need both.

I'll be opening a proposal covering some IndexOfFirst/LastMatch and similar APIs to help clarify intent for new code for .NET 9 (just wrapping up some stuff for .NET 8 before I do). However there are going to be many users that are porting xarch centric algorithms that will always use ExtractMostSignificantBits and so we'll need to ensure its well handled for Arm64 regardless.

neon-sunset · 2023-07-28T16:57:12Z

Thank you. As for this PR, what should I do with it? Apply suggestion from @MihaZupan, keep it as is or change to draft and just let it sit until there are changes in runtime?

tannergooding · 2023-07-28T17:16:16Z

I think in this case we want to leave it as is and track doing the JIT pattern matching instead.

I wouldn't, however, be opposed to an internal GetMatchCount() API which abstracts this temporary difference to managed code until the relevant pattern matching can be handled. That at least keeps these main call-sites simple and centralizes the pattern for other devs to copy until we determine if a general-purpose API is appropriate.

EgorBo · 2023-07-28T18:01:51Z

It's already tracked, the trick used here is from the arm blog, we have #76047 to track it in jit. It's something we definitely want to fix eventually

neon-sunset · 2023-08-01T00:49:08Z

I wouldn't, however, be opposed to an internal GetMatchCount() API which abstracts this temporary difference to managed code until the relevant pattern matching can be handled. That at least keeps these main call-sites simple and centralizes the pattern for other devs to copy until we determine if a general-purpose API is appropriate.

I'll submit a separate PR for this which introduces internal GetMatchCount/GetFirstMatch and updates relevant callsites, and then close this one if applicable.

Use SHRN when counting on Arm64

de01203

dotnet-issue-labeler bot added the area-System.Memory label Jul 28, 2023

ghost added the community-contribution Indicates that the PR has been added by a community member label Jul 28, 2023

neon-sunset commented Jul 28, 2023

View reviewed changes

build-analysis bot mentioned this pull request Jul 28, 2023

Tracking issue for CI build timeouts #76454

Closed

PR feedback

fda7dd7

neon-sunset marked this pull request as draft August 1, 2023 00:50

Merge branch 'dotnet:main' into main

018355c

build-analysis bot mentioned this pull request Aug 5, 2023

Timeout in System.Net.Quic.Functional.Tests #86019

Closed

Merge branch 'dotnet:main' into main

be8ad28

neon-sunset closed this Aug 14, 2023

ghost locked as resolved and limited conversation to collaborators Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARM64] Use SHRN in SpanHelpers.CountValueType #89632

[ARM64] Use SHRN in SpanHelpers.CountValueType #89632

neon-sunset commented Jul 28, 2023 •

edited

Loading

ghost commented Jul 28, 2023

neon-sunset Jul 28, 2023 •

edited

Loading

MihaZupan Jul 28, 2023

neon-sunset Jul 28, 2023

EgorBo Jul 28, 2023

neon-sunset Jul 28, 2023

tannergooding commented Jul 28, 2023

neon-sunset commented Jul 28, 2023 •

edited

Loading

tannergooding commented Jul 28, 2023 •

edited

Loading

neon-sunset commented Jul 28, 2023

tannergooding commented Jul 28, 2023

EgorBo commented Jul 28, 2023 •

edited

Loading

neon-sunset commented Aug 1, 2023 •

edited

Loading

+=> 2,
+=> 3,
+=> 4,
+                                              _ => 5

[ARM64] Use SHRN in SpanHelpers.CountValueType #89632

[ARM64] Use SHRN in SpanHelpers.CountValueType #89632

Conversation

neon-sunset commented Jul 28, 2023 • edited Loading

ghost commented Jul 28, 2023

neon-sunset Jul 28, 2023 • edited Loading

Choose a reason for hiding this comment

MihaZupan Jul 28, 2023

Choose a reason for hiding this comment

neon-sunset Jul 28, 2023

Choose a reason for hiding this comment

EgorBo Jul 28, 2023

Choose a reason for hiding this comment

neon-sunset Jul 28, 2023

Choose a reason for hiding this comment

tannergooding commented Jul 28, 2023

neon-sunset commented Jul 28, 2023 • edited Loading

tannergooding commented Jul 28, 2023 • edited Loading

neon-sunset commented Jul 28, 2023

tannergooding commented Jul 28, 2023

EgorBo commented Jul 28, 2023 • edited Loading

neon-sunset commented Aug 1, 2023 • edited Loading

neon-sunset commented Jul 28, 2023 •

edited

Loading

neon-sunset Jul 28, 2023 •

edited

Loading

neon-sunset commented Jul 28, 2023 •

edited

Loading

tannergooding commented Jul 28, 2023 •

edited

Loading

EgorBo commented Jul 28, 2023 •

edited

Loading

neon-sunset commented Aug 1, 2023 •

edited

Loading