-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ARM64] Use SHRN in SpanHelpers.CountValueType #89632
Conversation
Tagging subscribers to this area: @dotnet/area-system-memory Issue DetailsThis change avoids using movemask emulation on Arm64 and uses SHRN instead. Numbers:
Benchmark: class Counting
{
private U8String ThirdPartyNotices;
[Params(100, 1000, int.MaxValue)]
public int Length;
[GlobalSetup]
public void Setup()
{
var notices = new HttpClient()
.GetU8StringAsync("https://raw.githubusercontent.com/dotnet/runtime/main/THIRD-PARTY-NOTICES.TXT")
.GetAwaiter()
.GetResult();
ThirdPartyNotices = Length is int.MaxValue ? notices : notices[..Length];
}
// Both variants just end up calling old/new MemoryExtensions.Count(span, (byte)'\n')
[Benchmark]
public int CountLines() => ThirdPartyNotices.Lines.Count;
[Benchmark]
public int CountLinesNew() => ThirdPartyNotices.Lines.CountNew();
} I need to fix the setup for benchmarking against two proper coreroots on my machine but hopefully this is sufficient.
|
1 => 2, | ||
2 => 3, | ||
4 => 4, | ||
_ => 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not come up with a better way to write this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count += (int)((uint)BitOperations.PopCount(matches) / (4 * sizeof(T)));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it seems JIT can't completely fold this:
[MethodImpl(MethodImplOptions.AggressiveOptimization | MethodImplOptions.NoInlining)]
static int CountAdvSimd<T>(ReadOnlySpan<T> span, T value)
{
var eqmask = Vector128.Equals(Vector128.Create(span), Vector128.Create(value));
var matches = AdvSimd.ShiftRightLogicalNarrowingLower(eqmask.AsUInt16(), 4).AsUInt64().ToScalar();
// or (int)((uint)BitOperations.PopCount(matches) / (4 * sizeof(T)));
return BitOperations.PopCount(matches) >> (Unsafe.SizeOf<T>() switch
{
1 => 2,
2 => 3,
4 => 4,
_ => 5
});
}
Diff where T is ushort
: https://www.diffchecker.com/hzMEBfhM/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any codegen differences locally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version by @MihaZupan is faster 🤯
diff:
G_M000_IG04: ;; offset=0028H
ldr q17, [x0]
cmeq v17.16b, v17.16b, v16.16b
shrn v17.8b, v17.8h, #4
umov x5, v17.d[0]
mov v17.d[0], x5
cnt v17.8b, v17.8b
addv b17, v17.8b
umov w5, v17.b[0]
- add w3, w3, w5, LSR #2
+ mov w5, w5
+ lsr x5, x5, #2
+ add w3, w5, w3
add x0, x0, #16
cmp x0, x2
bls G_M000_IG04
I think I know why. According to https://dougallj.github.io/applecpu/firestorm-int.html, ADD (register, lsr, 32-bit)
has reciprocal TP of 0.333 per cycle with 2 cycle latency. However, both ADD (register, 32-bit)
and LSR (immediate, 32-bit)
are 0.111 of reciprocal TP with 1 cycle lat. This looks like a stall on a backwards jump because if I forcibly use V256 (V128x2) version it is marginally faster too despite codegen.
According to napkin math, slower version handles around 5.8~6 bytes per cycle while the faster one is ~8. If the above assumption is correct, it would explain the difference.
All in all this was very counterintuitive, TIL.
We really want to try and balance the code here. Maintaining explicitly separate code paths for different platforms is error prone and adds a lot of complexity to the managed code-base. In this particular case, the main reason its faster is because scalar But we could also add the minimal pattern recognition to the JIT instead so that the common pattern of |
@tannergooding I see. Given prevalence of doing eq + movemask + tzcnt/popcnt everywhere, would it make sense to resurrect #83010 (or submit a new one) since To me it seems easier to do than to teach JIT to recognize all shapes .ExtractMostSignificantBits() + SomeOp can take. |
We're ultimately going to want/need both. I'll be opening a proposal covering some |
Thank you. As for this PR, what should I do with it? Apply suggestion from @MihaZupan, keep it as is or change to draft and just let it sit until there are changes in runtime? |
I think in this case we want to leave it as is and track doing the JIT pattern matching instead. I wouldn't, however, be opposed to an internal |
It's already tracked, the trick used here is from the arm blog, we have #76047 to track it in jit. It's something we definitely want to fix eventually |
I'll submit a separate PR for this which introduces internal |
This change avoids using movemask emulation on Arm64 and uses SHRN instead.
Somewhat related to #76047
Numbers:
Benchmark:
I need to fix the setup for benchmarking against two proper coreroots on my machine but hopefully this is sufficient.