Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARM64] Use SHRN in SpanHelpers.CountValueType #89632

Closed
wants to merge 4 commits into from

Conversation

neon-sunset
Copy link
Contributor

@neon-sunset neon-sunset commented Jul 28, 2023

This change avoids using movemask emulation on Arm64 and uses SHRN instead.
Somewhat related to #76047

Numbers:

BenchmarkDotNet v0.13.6, macOS Sonoma 14.0 (23A5301g) [Darwin 23.0.0]
Apple M1 Pro, 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.100-rc.1.23375.11
  [Host]     : .NET 8.0.0 (8.0.23.37503), Arm64 RyuJIT AdvSIMD
  DefaultJob : .NET 8.0.0 (8.0.23.37503), Arm64 RyuJIT AdvSIMD
Method Length Mean Error StdDev Ratio
CountLines 100 7.525 ns 0.0599 ns 0.0531 ns 1.00
CountLinesNew 100 5.762 ns 0.0071 ns 0.0063 ns 0.77
CountLines 1000 65.140 ns 0.1234 ns 0.1030 ns 1.00
CountLinesNew 1000 43.526 ns 0.1965 ns 0.1838 ns 0.67
CountLines 67694 4,145.631 ns 7.5764 ns 7.0869 ns 1.00
CountLinesNew 67694 2,577.341 ns 4.8096 ns 4.2636 ns 0.62

Benchmark:

class Counting
{
    private byte[] ThirdPartyNotices = null!;

    [Params(100, 1000, int.MaxValue)]
    public int Length;

    [GlobalSetup]
    public void Setup()
    {
        var notices = new HttpClient()
            .GetByteArrayAsync("https://raw.githubusercontent.com/dotnet/runtime/main/THIRD-PARTY-NOTICES.TXT")
            .GetAwaiter()
            .GetResult();

        ThirdPartyNotices = Length is int.MaxValue ? notices : notices[..Length];
    }
    
    [Benchmark(Baseline = true)]
    public int CountLines()
    {
        return ThirdPartyNotices.AsSpan().Count((byte)'\n');
    }

    [Benchmark]
    public int CountLinesNew()
    {
        var span = ThirdPartyNotices.AsSpan();
        return SpanUtils.CountValueType(ref span[0], (byte)'\n', span.Length);
    }
}

I need to fix the setup for benchmarking against two proper coreroots on my machine but hopefully this is sufficient.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Jul 28, 2023
@ghost
Copy link

ghost commented Jul 28, 2023

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

This change avoids using movemask emulation on Arm64 and uses SHRN instead.
Somewhat related to #76047

Numbers:

BenchmarkDotNet v0.13.6, macOS Sonoma 14.0 (23A5301g) [Darwin 23.0.0]
Apple M1 Pro, 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.100-rc.1.23375.11
  [Host]     : .NET 8.0.0 (8.0.23.37503), Arm64 RyuJIT AdvSIMD
  DefaultJob : .NET 8.0.0 (8.0.23.37503), Arm64 RyuJIT AdvSIMD
Method Length Mean Error StdDev Ratio Allocated Alloc Ratio
CountLines 100 7.715 ns 0.0146 ns 0.0137 ns 1.00 - NA
CountLinesNew 100 5.741 ns 0.0132 ns 0.0124 ns 0.74 - NA
CountLines 1000 65.859 ns 0.1259 ns 0.1178 ns 1.00 - NA
CountLinesNew 1000 44.800 ns 0.0676 ns 0.0528 ns 0.68 - NA
CountLines 67694 4,137.981 ns 8.3394 ns 7.3926 ns 1.00 - NA
CountLinesNew 67694 3,517.751 ns 3.2983 ns 2.7542 ns 0.85 - NA

Benchmark:

class Counting
{
    private U8String ThirdPartyNotices;

    [Params(100, 1000, int.MaxValue)]
    public int Length;

    [GlobalSetup]
    public void Setup()
    {
        var notices = new HttpClient()
            .GetU8StringAsync("https://raw.githubusercontent.com/dotnet/runtime/main/THIRD-PARTY-NOTICES.TXT")
            .GetAwaiter()
            .GetResult();
        ThirdPartyNotices = Length is int.MaxValue ? notices : notices[..Length];
    }
    
    // Both variants just end up calling old/new MemoryExtensions.Count(span, (byte)'\n')
    [Benchmark]
    public int CountLines() => ThirdPartyNotices.Lines.Count;

    [Benchmark]
    public int CountLinesNew() => ThirdPartyNotices.Lines.CountNew();
}

I need to fix the setup for benchmarking against two proper coreroots on my machine but hopefully this is sufficient.

Author: neon-sunset
Assignees: -
Labels:

area-System.Memory, community-contribution

Milestone: -

1 => 2,
2 => 3,
4 => 4,
_ => 5
Copy link
Contributor Author

@neon-sunset neon-sunset Jul 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not come up with a better way to write this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count += (int)((uint)BitOperations.PopCount(matches) / (4 * sizeof(T)));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it seems JIT can't completely fold this:

[MethodImpl(MethodImplOptions.AggressiveOptimization | MethodImplOptions.NoInlining)]
static int CountAdvSimd<T>(ReadOnlySpan<T> span, T value)
{
    var eqmask = Vector128.Equals(Vector128.Create(span), Vector128.Create(value));
    var matches = AdvSimd.ShiftRightLogicalNarrowingLower(eqmask.AsUInt16(), 4).AsUInt64().ToScalar();
    // or (int)((uint)BitOperations.PopCount(matches) / (4 * sizeof(T)));
    return BitOperations.PopCount(matches) >> (Unsafe.SizeOf<T>() switch
    {
        1 => 2,
        2 => 3,
        4 => 4,
        _ => 5
    });
}

Diff where T is ushort: https://www.diffchecker.com/hzMEBfhM/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any codegen differences locally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version by @MihaZupan is faster 🤯
diff:

G_M000_IG04:                ;; offset=0028H
            ldr     q17, [x0]
            cmeq    v17.16b, v17.16b, v16.16b
            shrn    v17.8b, v17.8h, #4
            umov    x5, v17.d[0]
            mov     v17.d[0], x5
            cnt     v17.8b, v17.8b
            addv    b17, v17.8b
            umov    w5, v17.b[0]
-           add     w3, w3, w5,  LSR #2
+           mov     w5, w5
+           lsr     x5, x5, #2
+           add     w3, w5, w3
            add     x0, x0, #16
            cmp     x0, x2
            bls     G_M000_IG04

I think I know why. According to https://dougallj.github.io/applecpu/firestorm-int.html, ADD (register, lsr, 32-bit) has reciprocal TP of 0.333 per cycle with 2 cycle latency. However, both ADD (register, 32-bit) and LSR (immediate, 32-bit) are 0.111 of reciprocal TP with 1 cycle lat. This looks like a stall on a backwards jump because if I forcibly use V256 (V128x2) version it is marginally faster too despite codegen.

According to napkin math, slower version handles around 5.8~6 bytes per cycle while the faster one is ~8. If the above assumption is correct, it would explain the difference.

All in all this was very counterintuitive, TIL.

@tannergooding
Copy link
Member

We really want to try and balance the code here. Maintaining explicitly separate code paths for different platforms is error prone and adds a lot of complexity to the managed code-base.

In this particular case, the main reason its faster is because scalar popcnt is currently pessimized due to there not being a native scalar popcnt instruction. We should probably ensure that is optimized to be fmov; cnt; fmov instead which should be faster on Arm64.

But we could also add the minimal pattern recognition to the JIT instead so that the common pattern of popcnt(vector.esmb) is optimized as well to do shrn; popcnt; addv (or an equivalent) instead.

@neon-sunset
Copy link
Contributor Author

neon-sunset commented Jul 28, 2023

@tannergooding I see. Given prevalence of doing eq + movemask + tzcnt/popcnt everywhere, would it make sense to resurrect #83010 (or submit a new one) since VectorMask no longer seems to be the path forward (unless I misunderstood)? Not for convenience (as outlined in the issue) but for optimal codegen on each platform, perhaps on some Vector{T/XXX}Extensions class? This would allow to consolidate current code and reclaim some performance on ARM64 at the same time.

To me it seems easier to do than to teach JIT to recognize all shapes .ExtractMostSignificantBits() + SomeOp can take.

@tannergooding
Copy link
Member

tannergooding commented Jul 28, 2023

We're ultimately going to want/need both.

I'll be opening a proposal covering some IndexOfFirst/LastMatch and similar APIs to help clarify intent for new code for .NET 9 (just wrapping up some stuff for .NET 8 before I do). However there are going to be many users that are porting xarch centric algorithms that will always use ExtractMostSignificantBits and so we'll need to ensure its well handled for Arm64 regardless.

@neon-sunset
Copy link
Contributor Author

Thank you. As for this PR, what should I do with it? Apply suggestion from @MihaZupan, keep it as is or change to draft and just let it sit until there are changes in runtime?

@tannergooding
Copy link
Member

I think in this case we want to leave it as is and track doing the JIT pattern matching instead.

I wouldn't, however, be opposed to an internal GetMatchCount() API which abstracts this temporary difference to managed code until the relevant pattern matching can be handled. That at least keeps these main call-sites simple and centralizes the pattern for other devs to copy until we determine if a general-purpose API is appropriate.

@EgorBo
Copy link
Member

EgorBo commented Jul 28, 2023

It's already tracked, the trick used here is from the arm blog, we have #76047 to track it in jit. It's something we definitely want to fix eventually

@neon-sunset
Copy link
Contributor Author

neon-sunset commented Aug 1, 2023

I wouldn't, however, be opposed to an internal GetMatchCount() API which abstracts this temporary difference to managed code until the relevant pattern matching can be handled. That at least keeps these main call-sites simple and centralizes the pattern for other devs to copy until we determine if a general-purpose API is appropriate.

I'll submit a separate PR for this which introduces internal GetMatchCount/GetFirstMatch and updates relevant callsites, and then close this one if applicable.

@neon-sunset neon-sunset marked this pull request as draft August 1, 2023 00:50
@ghost ghost locked as resolved and limited conversation to collaborators Sep 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Memory community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants