-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Span-perf for copy via loop much slower than array-version #9977
Comments
/cc @AndyAyersMS please have a look on this |
Seems quite odd that |
Here is the whole dasm of this method: ; Assembly listing for method ConsoleApp1.Benchmarks:Copy(struct,struct)
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rbp based frame
; fully interruptible
; Final local variable assignments
;
; V00 arg0 [V00 ] ( 15, 48 ) struct (16) [rbp-0x28] do-not-enreg[XSF] addr-exposed ld-addr-op
; V01 arg1 [V01,T03] ( 9, 18 ) struct (16) [rbp-0x38] do-not-enreg[SF] ld-addr-op
; V02 loc0 [V02,T00] ( 22, 85 ) int -> rbx
;# V03 OutArgs [V03 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00]
; V04 cse0 [V04,T02] ( 8, 23 ) int -> r14
; V05 cse1 [V05,T04] ( 7, 19 ) byref -> r15
; V06 cse2 [V06,T01] ( 8, 32 ) long -> rdi
;
; Lcl frame size = 40
G_M16768_IG01:
55 push rbp
4157 push r15
4156 push r14
53 push rbx
4883EC28 sub rsp, 40
488D6C2440 lea rbp, [rsp+40H]
48897DD8 mov bword ptr [rbp-28H], rdi
488975E0 mov qword ptr [rbp-20H], rsi
488955C8 mov bword ptr [rbp-38H], rdx
48894DD0 mov qword ptr [rbp-30H], rcx
G_M16768_IG02:
33DB xor ebx, ebx
488D7DD8 lea rdi, bword ptr [rbp-28H]
E8C68F5CFF call System.ReadOnlySpan`1[Int32][System.Int32]:get_Length():int:this
85C0 test eax, eax
7E2F jle SHORT G_M16768_IG04
448B75D0 mov r14d, dword ptr [rbp-30H]
4C8B7DC8 mov r15, bword ptr [rbp-38H]
G_M16768_IG03:
413BDE cmp ebx, r14d
732D jae SHORT G_M16768_IG05
4863FB movsxd rdi, ebx
3B5DE0 cmp ebx, dword ptr [rbp-20H]
7325 jae SHORT G_M16768_IG05
488B45D8 mov rax, bword ptr [rbp-28H]
8B04B8 mov eax, dword ptr [rax+4*rdi]
418904BF mov dword ptr [r15+4*rdi], eax
FFC3 inc ebx
488D7DD8 lea rdi, bword ptr [rbp-28H]
E8978F5CFF call System.ReadOnlySpan`1[Int32][System.Int32]:get_Length():int:this
3BC3 cmp eax, ebx
7FD9 jg SHORT G_M16768_IG03
G_M16768_IG04:
488D65E8 lea rsp, [rbp-18H]
5B pop rbx
415E pop r14
415F pop r15
5D pop rbp
C3 ret
G_M16768_IG05:
E853728178 call CORINFO_HELP_RNGCHKFAIL
CC int3
; Total bytes of code 110, prolog size 31 for method ConsoleApp1.Benchmarks:Copy(struct,struct)
; ============================================================ Edit: I've uploaded the whole jitdump to https://1drv.ms/t/s!Aq_kXFBmTtM2hFNDRqUcEsqNFGcA |
Since you have the ability to look at jit info, what reason does the jit give for rejecting the inline of |
Could it be that this is Unix specific? On Windows this works as expected. |
I can't verify the dasm on Windows. When I change |
Well, when I try this the Copy method generates G_M55887_IG01:
4883EC28 sub rsp, 40
G_M55887_IG02:
488B01 mov rax, bword ptr [rcx]
8B4908 mov ecx, dword ptr [rcx+8]
4533C0 xor r8d, r8d
85C9 test ecx, ecx
7E1C jle SHORT G_M55887_IG04
G_M55887_IG03:
443B4208 cmp r8d, dword ptr [rdx+8]
731B jae SHORT G_M55887_IG05
4C8B0A mov r9, bword ptr [rdx]
4D63D0 movsxd r10, r8d
468B1C90 mov r11d, dword ptr [rax+4*r10]
47891C91 mov dword ptr [r9+4*r10], r11d
41FFC0 inc r8d
443BC1 cmp r8d, ecx
7CE4 jl SHORT G_M55887_IG03
G_M55887_IG04:
4883C428 add rsp, 40
C3 ret
G_M55887_IG05:
E859F0365F call CORINFO_HELP_RNGCHKFAIL
CC int3 |
@gfoidl What sort of build are you using to get the disassembly? If you use |
Hmm, the array-variant assembly you have shown is a bit misleading, in that case loop cloning kicks in and you show only the loop version without range checks. I don't think loop cloning ever kicks in for span range checks so that version will always be slightly slower even without the weird |
@mikedn thx for your dasm in https://github.com/dotnet/coreclr/issues/17065#issuecomment-374705550 @AndyAyersMS I followed the steps described in Viewing JIT Dumps. Built from dotnet/coreclr@ff2abe5 Even if the dasm looks quite good for Windows, the perf-numbers aren't. |
Best practice would be to locally build both Release and Debug. Then use the release bits to overwrite the DLLs. Then overwrite again with just the Debug-built jit DLL so you can enable disasm and dumps and such. So you end up with all the copied bits from the same build, and everything copied is built Release except for the jit, which is Debug. |
Thanks for the info! I'll try this. |
Yep, I can reproduce your numbers on my machine with the assembly I posted above. Whatever happens on Linux that causes Best I can tell the reason why the span version is slower is a combination of lack of loop cloning and lack of hoisting of destination span field loads. This can be seen if you try the following version: for (int i = 0; i < src.Length && i < dst.Length; ++i)
dst[i] = src[i]; that generates G_M55888_IG02:
488B02 mov rax, bword ptr [rdx]
8B5208 mov edx, dword ptr [rdx+8]
4C8B01 mov r8, bword ptr [rcx]
8B4908 mov ecx, dword ptr [rcx+8]
4533C9 xor r9d, r9d
85C9 test ecx, ecx
7E1A jle SHORT G_M55888_IG05
EB13 jmp SHORT G_M55888_IG04
G_M55888_IG03:
4D63D1 movsxd r10, r9d
478B1C90 mov r11d, dword ptr [r8+4*r10]
46891C90 mov dword ptr [rax+4*r10], r11d
41FFC1 inc r9d
443BC9 cmp r9d, ecx
7D05 jge SHORT G_M55888_IG05
G_M55888_IG04:
443BCA cmp r9d, edx
7CE8 jl SHORT G_M55888_IG03
G_M55888_IG05:
C3 ret This does not have range checks and no in-loop loads of span fields. It does have an extra compare so it's still a bit slower than the array version but quite a bit faster than the original version. |
I take it we don't promote the destination span and so probably think the writes can alias the span fields? I have toyed with making the implicit byref promotion more aggressive.... |
Ah, right. It looks like we do promote but then the local variable associated with the second argument has zero references so I guess promotion was undone in that case. I'll dig a bit more. |
Right, we promote and then undo if we don't think it has sufficient benefit. Logic is in |
Yeah, it's due to the
Indeed.
Is this really the right way to handle this? I suppose this is just the cheap way, the right way is probably to rely on CSE to remove redundant loads from such arguments. But then that would require tracking memory stores more accurately, even if the memory for implicit by ref args is not aliased you still need to figure out when these arguments are modified inside the method. |
Side note: when I view the JIT dasm on Linux as described by Andy in https://github.com/dotnet/coreclr/issues/17065#issuecomment-374712572 |
@gfoidl that's good to hear. We should probably update that document since I suspect most people who want to look at jit disassembly and behavior are interested in what happens when code is optimized. So we're left with the fact that the span version has an extra load (from the promote-unpromote dance we do with I'm not a big fan of the cloner but in a case like this it actually seems to do something useful. I suppose if we can get its excesses under control, we can think about generalizing the recognizer to handle spans. As for more aggressive promotion (or less aggressive undoing) I would like to wait until we have plausible block weight estimates available (something I hope to get to fairly soon) and use that to spot cases where one of the refs is in a loop. |
Thanks for the insights 👍
I'll try to update it and send an PR. |
Which actually works pretty well already ... until you have a store through a byref that the JIT can't figure out that it doesn't alias the implicit byref argument: struct LargeStruct
{
public int a, b, c, d, e, f, g, h;
}
[MethodImpl(MethodImplOptions.NoInlining)]
static int Test(ref int mem, LargeStruct ss1, LargeStruct ss2)
{
int s = 0;
for (int i = 0; i < ss1.b; i++)
{
s += ss1.a * ss2.c / ss1.a;
//mem += ss.c;
}
return s;
} generates G_M55886_IG02:
33C9 xor ecx, ecx
4533C9 xor r9d, r9d
448B5204 mov r10d, dword ptr [rdx+4]
4585D2 test r10d, r10d
7E19 jle SHORT G_M55886_IG04
448B1A mov r11d, dword ptr [rdx]
418BC3 mov eax, r11d
410FAF4008 imul eax, dword ptr [r8+8]
99 cdq
41F7FB idiv edx:eax, r11d
G_M55886_IG03:
03C8 add ecx, eax
41FFC1 inc r9d
453BCA cmp r9d, r10d
7CF6 jl SHORT G_M55886_IG03 All the loads and invariant expression have been hoisted out of the loop. If the But of course, even if JIT alias analysis would be improved to handle such cases some things would still not work as well as they do when struct promotion is done (e.g. loop cloning runs before VN & CSE). |
If one follows the current described steps, one won't see the JIT dump for optimized code in the core lib. This PR adds the necessary step to see optimized code. Cf. https://github.com/dotnet/coreclr/issues/17065#issuecomment-374772011
For the record, this is still an issue in .NET 6. Here's the Benchmark output from my PC:
|
Description
In a simple copy-loop Span is a lot slower than an array-version.
I'd expect that Span and array have similar perf.
Note:
Span_CopyTo
is just for reference included.Benchmark
Results
Code
Disassembly
Note: The JitDisasm is produced of fresh coreclr-build on ubuntu.
dotnet --info
Also only the loops are shown.
Array-variant
Span-variant
category:cq
theme:loop-opt
skill-level:expert
cost:medium
The text was updated successfully, but these errors were encountered: