-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Span<T> performance issues on Linux versus Windows #12901
Comments
Have you tried latest 3.0 preview? Maybe it's fixed but not shipped yet. |
Thanks -- I'll take a look. @BrianHanechak can you share your measurements? |
With 3.0p5 I am seeing
I am going to trim this benchmark down as the runs take a long time. The core issue should repro even with smaller input sets. |
Looks like there are two main contributors to the perf impact.
For inlining, the inliner overestimates the size of the Net impact is that the pointer version of
Also, the method we end up calling is not as well optimized as its pointer-based counterpart. There seem to be two factors here as well: we spill the incoming span (passed in regs) to the stack and then reload it back to regs, and we have an extra move in the inner loop. The extra arg passing costs may dominate here as the loop itself only runs 4 iterations. ;; Average(byte*, int)
G_M63158_IG01:
push rbp
mov rbp, rsp
mov edi, edx
G_M63158_IG02:
xor eax, eax
xor edx, edx
test edi, edi
jle SHORT G_M63158_IG04
G_M63158_IG03:
movsxd rcx, edx
movzx rcx, byte ptr [rsi+rcx]
add eax, ecx
inc edx
cmp edx, edi
jl SHORT G_M63158_IG03
G_M63158_IG04:
cdq
idiv edx:eax, edi
movzx rax, al
G_M63158_IG05:
pop rbp
ret ;; Average(Span<byte>)
G_M63160_IG01:
push rbp
sub rsp, 16
lea rbp, [rsp+10H]
mov bword ptr [rbp-10H], rsi // spill back to stack
mov qword ptr [rbp-08H], rdx // ...
G_M63160_IG02:
xor eax, eax
xor edi, edi
mov esi, dword ptr [rbp-08H] // and reload
test esi, esi
jle SHORT G_M63160_IG04
mov rdx, bword ptr [rbp-10H] // ....
G_M63160_IG03:
movsxd rcx, edi
movzx rcx, byte ptr [rdx+rcx]
add ecx, eax
mov eax, ecx // extra move
inc edi
cmp edi, esi
jl SHORT G_M63160_IG03
G_M63160_IG04:
cdq
idiv edx:eax, esi
movzx rax, al
G_M63160_IG05:
lea rsp, [rbp]
pop rbp
ret As a workaround, adding
We could consider boosting inline profitabilty for cc @CarolEidt: an example where we're not using SysV ABI to its best advantage. I don't see anything we can do here for 3.0 so am going to move this to future. |
Interesting... I think even on Windows, it was not inlining the I did have a weird case where the Ptr version had bimodal results in the 4096x4096x16 version but not the 4096x4096x4 version. Here's the Windows results I had: BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17134.829 (1803/April2018Update/Redstone4)
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
Frequency=3609369 Hz, Resolution=277.0567 ns, Timer=TSC
.NET Core SDK=2.2.103
[Host] : .NET Core 2.2.1 (CoreCLR 4.6.27207.03, CoreFX 4.6.27207.03), 64bit RyuJIT
DefaultJob : .NET Core 2.2.1 (CoreCLR 4.6.27207.03, CoreFX 4.6.27207.03), 64bit RyuJIT
|
Windows ABI handles structs like Since the span you're working on is so short (4 elements) the cost of forming the span and passing it to the helper method are going to be prominent. Can you add your Linux results? |
Here's the windows x64 code for G_M51636_IG01:
0F1F440000 nop
G_M51636_IG02:
488B02 mov rax, bword ptr [rdx]
8B4A08 mov ecx, dword ptr [rdx+8]
33D2 xor edx, edx
4533C0 xor r8d, r8d
85C9 test ecx, ecx
7E13 jle SHORT G_M51636_IG04
G_M51636_IG03:
4D63C8 movsxd r9, r8d
460FB60C08 movzx r9, byte ptr [rax+r9]
4103D1 add edx, r9d
41FFC0 inc r8d
443BC1 cmp r8d, ecx
7CED jl SHORT G_M51636_IG03
G_M51636_IG04:
8BC2 mov eax, edx
99 cdq
F7F9 idiv edx:eax, ecx
0FB6C0 movzx rax, al
G_M51636_IG05:
C3 ret Contrast vs the Linux version, we see: simpler prolog and epilogue (the jit will do "frame pointer omission on Windows but not Linux); just a "reload" as the span is passed implicitly by-ref, and no move in the inner loop. |
Also worth noting that when public static int Main()
{
byte[] b = new byte[4];
b[3] = 100;
X x = new X();
byte r = x.Average(b); // mark as aggressive inlining
return (r == 10 ? 100 : -1);
} One might expect we'd realize we have a constant loop bound and unroll, but no: G_M64500_IG01:
4883EC28 sub rsp, 40
90 nop
G_M64500_IG02:
48B9187B75D217000000 mov rcx, 0x17D2757B18
BA04000000 mov edx, 4
E84792FEFF call CORINFO_HELP_NEWARR_1_VC
C6401364 mov byte ptr [rax+19], 100
4883C010 add rax, 16
B904000000 mov ecx, 4
33D2 xor edx, edx
4533C0 xor r8d, r8d
85C9 test ecx, ecx
7E13 jle SHORT G_M64500_IG04
G_M64500_IG03:
4D63C8 movsxd r9, r8d
460FB60C08 movzx r9, byte ptr [rax+r9]
4103D1 add edx, r9d
41FFC0 inc r8d
443BC1 cmp r8d, ecx
7CED jl SHORT G_M64500_IG03
G_M64500_IG04:
8BC2 mov eax, edx
99 cdq
F7F9 idiv edx:eax, ecx
0FB6C0 movzx rax, al
83F80A cmp eax, 10
740A je SHORT G_M64500_IG06
B8FFFFFFFF mov eax, -1
G_M64500_IG05:
4883C428 add rsp, 40
C3 ret
G_M64500_IG06:
B864000000 mov eax, 100
G_M64500_IG07:
4883C428 add rsp, 40
C3 ret First point of weakness seems to be that when we prune away a branch we don't try and revisit impacted PHIs or value numbers. So if we have a conditional def (which we always have for span construction initially) where one side gets pruned away, then even though the span fields are promoted they never get refined to single defs. So, no constant prop... cc @dotnet/jit-contrib |
Mmm, maybe I should try my SCCP experiment on this... |
Changing the public static int Main()
{
byte[] b = new byte[4];
b[3] = 100;
X x = new X();
Span<byte> s = new Span<byte>(b, 0, 4);
byte r = x.Average(s);
return (r == 10 ? 100 : -1);
} produces G_M36787_IG01:
4883EC28 sub rsp, 40
90 nop
G_M36787_IG02:
48B9187B75D217000000 mov rcx, 0x17D2757B18
BA04000000 mov edx, 4
E8478EFEFF call CORINFO_HELP_NEWARR_1_VC
C6401364 mov byte ptr [rax+19], 100
4883C010 add rax, 16
33D2 xor edx, edx
33C9 xor ecx, ecx
G_M36787_IG03:
4C63C1 movsxd r8, ecx
460FB60400 movzx r8, byte ptr [rax+r8]
4103D0 add edx, r8d
FFC1 inc ecx
83F904 cmp ecx, 4
7CEE jl SHORT G_M36787_IG03
G_M36787_IG04:
8BC2 mov eax, edx
C1F81F sar eax, 31
83E003 and eax, 3
03C2 add eax, edx
C1F802 sar eax, 2
0FB6C0 movzx rax, al
83F80A cmp eax, 10
740A je SHORT G_M36787_IG06
B8FFFFFFFF mov eax, -1
G_M36787_IG05:
4883C428 add rsp, 40
C3 ret
G_M36787_IG06:
B864000000 mov eax, 100
G_M36787_IG07:
4883C428 add rsp, 40
C3 ret |
--- D:\Projects\jit\span_base.asm 2019-06-21 08:32:51.000000000 +-0300
+++ D:\Projects\jit\span_diff.asm 2019-06-21 08:33:51.000000000 +-0300
@@ -35,30 +35,29 @@
G_M55895_IG02:
48B9187B17A4FE7F0000 mov rcx, 0x7FFEA4177B18
BA04000000 mov edx, 4
E807337E5F call CORINFO_HELP_NEWARR_1_VC
C6401364 mov byte ptr [rax+19], 100
4883C010 add rax, 16
- B904000000 mov ecx, 4
33D2 xor edx, edx
- 4533C0 xor r8d, r8d
- 85C9 test ecx, ecx
- 7E13 jle SHORT G_M55895_IG04
+ 33C9 xor ecx, ecx
G_M55895_IG03:
- 4D63C8 movsxd r9, r8d
- 460FB60C08 movzx r9, byte ptr [rax+r9]
- 4103D1 add edx, r9d
- 41FFC0 inc r8d
- 443BC1 cmp r8d, ecx
- 7CED jl SHORT G_M55895_IG03
+ 4C63C1 movsxd r8, ecx
+ 460FB60400 movzx r8, byte ptr [rax+r8]
+ 4103D0 add edx, r8d
+ FFC1 inc ecx
+ 83F904 cmp ecx, 4
+ 7CEE jl SHORT G_M55895_IG03
G_M55895_IG04:
8BC2 mov eax, edx
- 99 cdq
- F7F9 idiv edx:eax, ecx
+ C1F81F sar eax, 31
+ 83E003 and eax, 3
+ 03C2 add eax, edx
+ C1F802 sar eax, 2
0FB6C0 movzx rax, al
83F80A cmp eax, 10
740A je SHORT G_M55895_IG06
B8FFFFFFFF mov eax, -1
G_M55895_IG05:
@@ -69,7 +68,7 @@
B864000000 mov eax, 100
G_M55895_IG07:
4883C428 add rsp, 40
C3 ret
-; Total bytes of code 99, prolog size 5 for method Program:Test():int
+; Total bytes of code 96, prolog size 5 for method Program:Test():int |
cc @dotnet/jit-contrib -- a simple example where poor SysV ABI handling of structs causes a perf issue |
On Windows, I'm seeing that a function taking
Span<byte>
performs similarly to one takingbyte*
and a length parameter separately, but on Linux, theSpan<byte>
version is considerably slower.Here's the source of my test program:
https://gist.github.com/BrianHanechak/f7fd69f636d6d25adeed6a8bbae28d5d
Any ideas as to why the Linux version performs differently?
dotnet --version
reports2.2.300
on the Linux box and2.2.103
on the Windows box.category:cq
theme:inlining
skill-level:expert
cost:medium
The text was updated successfully, but these errors were encountered: