Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Span<T> performance issues on Linux versus Windows #12901

Open
BrianHanechak opened this issue Jun 17, 2019 · 12 comments
Open

Span<T> performance issues on Linux versus Windows #12901

BrianHanechak opened this issue Jun 17, 2019 · 12 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) tenet-performance Performance related issue
Milestone

Comments

@BrianHanechak
Copy link

On Windows, I'm seeing that a function taking Span<byte> performs similarly to one taking byte* and a length parameter separately, but on Linux, the Span<byte> version is considerably slower.

Here's the source of my test program:

https://gist.github.com/BrianHanechak/f7fd69f636d6d25adeed6a8bbae28d5d

Any ideas as to why the Linux version performs differently?

dotnet --version reports 2.2.300 on the Linux box and 2.2.103 on the Windows box.

category:cq
theme:inlining
skill-level:expert
cost:medium

@huoyaoyuan
Copy link
Member

Have you tried latest 3.0 preview? Maybe it's fixed but not shipped yet.

@AndyAyersMS
Copy link
Member

Thanks -- I'll take a look.

@BrianHanechak can you share your measurements?

@AndyAyersMS AndyAyersMS self-assigned this Jun 17, 2019
@AndyAyersMS
Copy link
Member

With 3.0p5 I am seeing

Method S C Mean Error StdDev
Span 4096 4 3.617 ms 0.0094 ms 0.0083 ms
Ptr 4096 4 1.757 ms 0.0060 ms 0.0056 ms
Span 4096 16 7.508 ms 0.0263 ms 0.0246 ms
Ptr 4096 16 4.857 ms 0.0216 ms 0.0202 ms

I am going to trim this benchmark down as the runs take a long time. The core issue should repro even with smaller input sets.

@AndyAyersMS
Copy link
Member

Looks like there are two main contributors to the perf impact.

  • Average is not inlined in the span case
  • Average method CQ is not as good for span as it is for pointers

For inlining, the inliner overestimates the size of the Span variant as it does not realize the calls made in Average are to intrinsics. It underestimates the benefit as the impact of the constant length 4 is not readily apparent (as it requires looking through span fields). Also, the size of the call site for the Span version is somewhat under-estimated as the inliner does not accurately model the SysV conventions.

Net impact is that the pointer version of Average looks like a very good inline candidate, and the Span version does not.

;; pointer version
Inline candidate has an arg that feeds a constant test.  Multiplier increased to 1.
Inline candidate has const arg that feeds a conditional.  Multiplier increased to 4.
Inline candidate callsite is in a loop.  Multiplier increased to 7.
calleeNativeSizeEstimate=382
callsiteNativeSizeEstimate=145
benefit multiplier=7
threshold=1015
Native estimate for function size is within threshold for inlining 38.2 <= 101.5 (multiplier = 7)

;; span version
Inline candidate has an arg that feeds a constant test.  Multiplier increased to 1.
Inline candidate callsite is in a loop.  Multiplier increased to 4.
calleeNativeSizeEstimate=776
callsiteNativeSizeEstimate=135
benefit multiplier=4
threshold=540
Native estimate for function size exceeds threshold for inlining 77.6 > 54 (multiplier = 4)

Also, the method we end up calling is not as well optimized as its pointer-based counterpart. There seem to be two factors here as well: we spill the incoming span (passed in regs) to the stack and then reload it back to regs, and we have an extra move in the inner loop. The extra arg passing costs may dominate here as the loop itself only runs 4 iterations.

;; Average(byte*, int)
G_M63158_IG01:
       push     rbp
       mov      rbp, rsp
       mov      edi, edx

G_M63158_IG02:
       xor      eax, eax
       xor      edx, edx
       test     edi, edi
       jle      SHORT G_M63158_IG04

G_M63158_IG03:
       movsxd   rcx, edx
       movzx    rcx, byte  ptr [rsi+rcx]
       add      eax, ecx
       inc      edx
       cmp      edx, edi
       jl       SHORT G_M63158_IG03

G_M63158_IG04:
       cdq      
       idiv     edx:eax, edi
       movzx    rax, al

G_M63158_IG05:
       pop      rbp
       ret      
;; Average(Span<byte>)
G_M63160_IG01:
       push     rbp
       sub      rsp, 16
       lea      rbp, [rsp+10H]
       mov      bword ptr [rbp-10H], rsi    // spill back to stack
       mov      qword ptr [rbp-08H], rdx    // ...

G_M63160_IG02:
       xor      eax, eax
       xor      edi, edi
       mov      esi, dword ptr [rbp-08H]   // and reload
       test     esi, esi
       jle      SHORT G_M63160_IG04
       mov      rdx, bword ptr [rbp-10H]   // ....

G_M63160_IG03:
       movsxd   rcx, edi
       movzx    rcx, byte  ptr [rdx+rcx]
       add      ecx, eax
       mov      eax, ecx                   // extra move
       inc      edi
       cmp      edi, esi
       jl       SHORT G_M63160_IG03

G_M63160_IG04:
       cdq      
       idiv     edx:eax, esi
       movzx    rax, al

G_M63160_IG05:
       lea      rsp, [rbp]
       pop      rbp
       ret      

As a workaround, adding AggressiveInlining to the Average method brings things back in line:

Method S C Mean Error StdDev
Span 256 4 3.339 us 0.0260 us 0.0243 us
Ptr 256 4 3.406 us 0.0062 us 0.0052 us

We could consider boosting inline profitabilty for Span and perhaps discount the cost of methods called on Span types.

cc @CarolEidt: an example where we're not using SysV ABI to its best advantage.

I don't see anything we can do here for 3.0 so am going to move this to future.

@BrianHanechak
Copy link
Author

Interesting... I think even on Windows, it was not inlining the Span<T> version, but even without inlining it was still performing very close to the pointer version. I didn't know how to get the disassembly when running on Linux.

I did have a weird case where the Ptr version had bimodal results in the 4096x4096x16 version but not the 4096x4096x4 version.

Here's the Windows results I had:

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17134.829 (1803/April2018Update/Redstone4)
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
Frequency=3609369 Hz, Resolution=277.0567 ns, Timer=TSC
.NET Core SDK=2.2.103
  [Host]     : .NET Core 2.2.1 (CoreCLR 4.6.27207.03, CoreFX 4.6.27207.03), 64bit RyuJIT
  DefaultJob : .NET Core 2.2.1 (CoreCLR 4.6.27207.03, CoreFX 4.6.27207.03), 64bit RyuJIT

Method S C Mean Error StdDev Median
Span 4096 4 119.7 ms 0.1533 ms 0.1359 ms 119.7 ms
Ptr 4096 4 109.3 ms 0.7861 ms 0.7353 ms 109.3 ms
Span 4096 16 333.1 ms 1.0173 ms 0.8495 ms 333.1 ms
Ptr 4096 16 330.8 ms 13.5623 ms 39.9887 ms 314.5 ms

@AndyAyersMS
Copy link
Member

Windows ABI handles structs like Span differently, so the codegen for the non-inlined Average is going to look a little different and likely a bit more streamlined. Will post it when I get a chance.

Since the span you're working on is so short (4 elements) the cost of forming the span and passing it to the helper method are going to be prominent.

Can you add your Linux results?

@AndyAyersMS
Copy link
Member

Here's the windows x64 code for Average:

G_M51636_IG01:
       0F1F440000           nop

G_M51636_IG02:
       488B02               mov      rax, bword ptr [rdx]
       8B4A08               mov      ecx, dword ptr [rdx+8]
       33D2                 xor      edx, edx
       4533C0               xor      r8d, r8d
       85C9                 test     ecx, ecx
       7E13                 jle      SHORT G_M51636_IG04

G_M51636_IG03:
       4D63C8               movsxd   r9, r8d
       460FB60C08           movzx    r9, byte  ptr [rax+r9]
       4103D1               add      edx, r9d
       41FFC0               inc      r8d
       443BC1               cmp      r8d, ecx
       7CED                 jl       SHORT G_M51636_IG03

G_M51636_IG04:
       8BC2                 mov      eax, edx
       99                   cdq
       F7F9                 idiv     edx:eax, ecx
       0FB6C0               movzx    rax, al

G_M51636_IG05:
       C3                   ret

Contrast vs the Linux version, we see: simpler prolog and epilogue (the jit will do "frame pointer omission on Windows but not Linux); just a "reload" as the span is passed implicitly by-ref, and no move in the inner loop.

@AndyAyersMS
Copy link
Member

Also worth noting that when Average(Span<byte>) does get inlined into a method where we know everything about span creation, we don't do a great job of optimizing.

    public static int Main()
    {
        byte[] b = new byte[4];
        b[3] = 100;
        X x = new X();
        byte r = x.Average(b);  // mark as aggressive inlining
        return (r == 10 ? 100 : -1);
    }

One might expect we'd realize we have a constant loop bound and unroll, but no:

G_M64500_IG01:
       4883EC28             sub      rsp, 40
       90                   nop

G_M64500_IG02:
       48B9187B75D217000000 mov      rcx, 0x17D2757B18
       BA04000000           mov      edx, 4
       E84792FEFF           call     CORINFO_HELP_NEWARR_1_VC
       C6401364             mov      byte  ptr [rax+19], 100
       4883C010             add      rax, 16
       B904000000           mov      ecx, 4
       33D2                 xor      edx, edx
       4533C0               xor      r8d, r8d
       85C9                 test     ecx, ecx
       7E13                 jle      SHORT G_M64500_IG04

G_M64500_IG03:
       4D63C8               movsxd   r9, r8d
       460FB60C08           movzx    r9, byte  ptr [rax+r9]
       4103D1               add      edx, r9d
       41FFC0               inc      r8d
       443BC1               cmp      r8d, ecx
       7CED                 jl       SHORT G_M64500_IG03

G_M64500_IG04:
       8BC2                 mov      eax, edx
       99                   cdq
       F7F9                 idiv     edx:eax, ecx
       0FB6C0               movzx    rax, al
       83F80A               cmp      eax, 10
       740A                 je       SHORT G_M64500_IG06
       B8FFFFFFFF           mov      eax, -1

G_M64500_IG05:
       4883C428             add      rsp, 40
       C3                   ret

G_M64500_IG06:
       B864000000           mov      eax, 100

G_M64500_IG07:
       4883C428             add      rsp, 40
       C3                   ret

First point of weakness seems to be that when we prune away a branch we don't try and revisit impacted PHIs or value numbers. So if we have a conditional def (which we always have for span construction initially) where one side gets pruned away, then even though the span fields are promoted they never get refined to single defs. So, no constant prop...

cc @dotnet/jit-contrib

@mikedn
Copy link
Contributor

mikedn commented Jun 20, 2019

First point of weakness seems to be that when we prune away a branch we don't try and revisit impacted PHIs or value numbers. So if we have a conditional def (which we always have for span construction initially) where one side gets pruned away, then even though the span fields are promoted they never get refined to single defs. So, no constant prop...

Mmm, maybe I should try my SCCP experiment on this...

@AndyAyersMS
Copy link
Member

Changing the Span constructor shows what we'd do if we could const prop...

    public static int Main()
    {
        byte[] b = new byte[4];
        b[3] = 100;
        X x = new X();
        Span<byte> s = new Span<byte>(b, 0, 4);
        byte r = x.Average(s);
        return (r == 10 ? 100 : -1);
    }

produces

G_M36787_IG01:
       4883EC28             sub      rsp, 40
       90                   nop

G_M36787_IG02:
       48B9187B75D217000000 mov      rcx, 0x17D2757B18
       BA04000000           mov      edx, 4
       E8478EFEFF           call     CORINFO_HELP_NEWARR_1_VC
       C6401364             mov      byte  ptr [rax+19], 100
       4883C010             add      rax, 16
       33D2                 xor      edx, edx
       33C9                 xor      ecx, ecx

G_M36787_IG03:
       4C63C1               movsxd   r8, ecx
       460FB60400           movzx    r8, byte  ptr [rax+r8]
       4103D0               add      edx, r8d
       FFC1                 inc      ecx
       83F904               cmp      ecx, 4
       7CEE                 jl       SHORT G_M36787_IG03

G_M36787_IG04:
       8BC2                 mov      eax, edx
       C1F81F               sar      eax, 31
       83E003               and      eax, 3
       03C2                 add      eax, edx
       C1F802               sar      eax, 2
       0FB6C0               movzx    rax, al
       83F80A               cmp      eax, 10
       740A                 je       SHORT G_M36787_IG06
       B8FFFFFFFF           mov      eax, -1

G_M36787_IG05:
       4883C428             add      rsp, 40
       C3                   ret

G_M36787_IG06:
       B864000000           mov      eax, 100

G_M36787_IG07:
       4883C428             add      rsp, 40
       C3                   ret

@mikedn
Copy link
Contributor

mikedn commented Jun 21, 2019

Mmm, maybe I should try my SCCP experiment on this...

With mikedn/coreclr@7872aa7

--- D:\Projects\jit\span_base.asm	2019-06-21 08:32:51.000000000 +-0300
+++ D:\Projects\jit\span_diff.asm	2019-06-21 08:33:51.000000000 +-0300
@@ -35,30 +35,29 @@
 G_M55895_IG02:
        48B9187B17A4FE7F0000 mov      rcx, 0x7FFEA4177B18
        BA04000000           mov      edx, 4
        E807337E5F           call     CORINFO_HELP_NEWARR_1_VC
        C6401364             mov      byte  ptr [rax+19], 100
        4883C010             add      rax, 16
-       B904000000           mov      ecx, 4
        33D2                 xor      edx, edx
-       4533C0               xor      r8d, r8d
-       85C9                 test     ecx, ecx
-       7E13                 jle      SHORT G_M55895_IG04
+       33C9                 xor      ecx, ecx
 
 G_M55895_IG03:
-       4D63C8               movsxd   r9, r8d
-       460FB60C08           movzx    r9, byte  ptr [rax+r9]
-       4103D1               add      edx, r9d
-       41FFC0               inc      r8d
-       443BC1               cmp      r8d, ecx
-       7CED                 jl       SHORT G_M55895_IG03
+       4C63C1               movsxd   r8, ecx
+       460FB60400           movzx    r8, byte  ptr [rax+r8]
+       4103D0               add      edx, r8d
+       FFC1                 inc      ecx
+       83F904               cmp      ecx, 4
+       7CEE                 jl       SHORT G_M55895_IG03
 
 G_M55895_IG04:
        8BC2                 mov      eax, edx
-       99                   cdq      
-       F7F9                 idiv     edx:eax, ecx
+       C1F81F               sar      eax, 31
+       83E003               and      eax, 3
+       03C2                 add      eax, edx
+       C1F802               sar      eax, 2
        0FB6C0               movzx    rax, al
        83F80A               cmp      eax, 10
        740A                 je       SHORT G_M55895_IG06
        B8FFFFFFFF           mov      eax, -1
 
 G_M55895_IG05:
@@ -69,7 +68,7 @@
        B864000000           mov      eax, 100
 
 G_M55895_IG07:
        4883C428             add      rsp, 40
        C3                   ret      
 
-; Total bytes of code 99, prolog size 5 for method Program:Test():int
+; Total bytes of code 96, prolog size 5 for method Program:Test():int

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@AndyAyersMS
Copy link
Member

cc @dotnet/jit-contrib -- a simple example where poor SysV ABI handling of structs causes a perf issue

@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Jan 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

6 participants