Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Span-perf for copy via loop much slower than array-version #9977

Open
gfoidl opened this issue Mar 20, 2018 · 22 comments
Open

Span-perf for copy via loop much slower than array-version #9977

gfoidl opened this issue Mar 20, 2018 · 22 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Milestone

Comments

@gfoidl
Copy link
Member

gfoidl commented Mar 20, 2018

Description

In a simple copy-loop Span is a lot slower than an array-version.
I'd expect that Span and array have similar perf.

Note: Span_CopyTo is just for reference included.

Benchmark

Results

BenchmarkDotNet=v0.10.13, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.309)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical cores and 4 physical cores
Frequency=2742189 Hz, Resolution=364.6722 ns, Timer=TSC
.NET Core SDK=2.1.300-preview3-008384
  [Host]     : .NET Core 2.1.0-preview2-26313-01 (CoreCLR 4.6.26310.01, CoreFX 4.6.26313.01), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.0-preview2-26313-01 (CoreCLR 4.6.26310.01, CoreFX 4.6.26313.01), 64bit RyuJIT

Method Mean Error StdDev Scaled
Array 4.435 us 0.0400 us 0.0374 us 1.00
Span 6.885 us 0.0307 us 0.0256 us 1.55
Span_CopyTo 1.132 us 0.0072 us 0.0067 us 0.26

Code

public class Benchmarks
{
    private readonly int[] _source;
    private readonly int[] _destination;

    public Benchmarks()
    {
        _source = Enumerable.Range(0, 10_000).ToArray();
        _destination = new int[_source.Length];
    }

    [Benchmark(Baseline = true)]
    public void Array()
    {
        Copy(_source, _destination);
    }

    [Benchmark]
    public void Span()
    {
        Copy(_source.AsSpan(), _destination.AsSpan());
    }

    [Benchmark]
    public void Span_CopyTo()
    {
        _source.AsSpan().CopyTo(_destination.AsSpan());
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void Copy(int[] src, int[] dst)
    {
        for (int i = 0; i < src.Length; ++i)
            dst[i] = src[i];
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void Copy(ReadOnlySpan<int> src, Span<int> dst)
    {
        for (int i = 0; i < src.Length; ++i)
            dst[i] = src[i];
    }
}

Disassembly

Note: The JitDisasm is produced of fresh coreclr-build on ubuntu.

dotnet --info
.NET Core SDK (reflecting any global.json):
 Version:   2.1.300-preview3-008384
 Commit:    4343118151

Runtime Environment:
 OS Name:     ubuntu
 OS Version:  16.04
 OS Platform: Linux
 RID:         ubuntu.16.04-x64
 Base Path:   /usr/share/dotnet/sdk/2.1.300-preview3-008384/

Host (useful for support):
  Version: 2.1.0-preview3-26319-04
  Commit:  939333dbc8

.NET Core SDKs installed:
  2.1.4 [/usr/share/dotnet/sdk]
  2.1.300-preview3-008384 [/usr/share/dotnet/sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.1.0-preview2-30338 [/usr/share/dotnet/shared/Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.1.0-preview2-30338 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.0.5 [/usr/share/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.1.0-preview2-26313-01 [/usr/share/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.1.0-preview3-26319-04 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

Also only the loops are shown.

Array-variant

G_M16768_IG04:
       4863C8               movsxd   rcx, eax
       448B448F10           mov      r8d, dword ptr [rdi+4*rcx+16]
       4489448E10           mov      dword ptr [rsi+4*rcx+16], r8d
       FFC0                 inc      eax
       3BD0                 cmp      edx, eax
       7FED                 jg       SHORT G_M16768_IG04

Span-variant

G_M16768_IG03:
       413BDE               cmp      ebx, r14d
       732D                 jae      SHORT G_M16768_IG05
       4863FB               movsxd   rdi, ebx
       3B5DE0               cmp      ebx, dword ptr [rbp-20H]
       7325                 jae      SHORT G_M16768_IG05
       488B45D8             mov      rax, bword ptr [rbp-28H]
       8B04B8               mov      eax, dword ptr [rax+4*rdi]
       418904BF             mov      dword ptr [r15+4*rdi], eax
       FFC3                 inc      ebx
       488D7DD8             lea      rdi, bword ptr [rbp-28H]
       E8278F5CFF           call     System.ReadOnlySpan`1[Int32][System.Int32]:get_Length():int:this
       3BC3                 cmp      eax, ebx
       7FD9                 jg       SHORT G_M16768_IG03

category:cq
theme:loop-opt
skill-level:expert
cost:medium

@gfoidl
Copy link
Member Author

gfoidl commented Mar 20, 2018

/cc @AndyAyersMS please have a look on this

@AndyAyersMS
Copy link
Member

Seems quite odd that get_Length is not inlined in the span variant. Can you show the first few lines of the jit-disasm that describe what kind of code the jit was creating?

@gfoidl
Copy link
Member Author

gfoidl commented Mar 20, 2018

Here is the whole dasm of this method:

; Assembly listing for method ConsoleApp1.Benchmarks:Copy(struct,struct)
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rbp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 arg0         [V00    ] ( 15, 48   )  struct (16) [rbp-0x28]   do-not-enreg[XSF] addr-exposed ld-addr-op
;  V01 arg1         [V01,T03] (  9, 18   )  struct (16) [rbp-0x38]   do-not-enreg[SF] ld-addr-op
;  V02 loc0         [V02,T00] ( 22, 85   )     int  ->  rbx        
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]  
;  V04 cse0         [V04,T02] (  8, 23   )     int  ->  r14        
;  V05 cse1         [V05,T04] (  7, 19   )   byref  ->  r15        
;  V06 cse2         [V06,T01] (  8, 32   )    long  ->  rdi        
;
; Lcl frame size = 40

G_M16768_IG01:
       55                   push     rbp
       4157                 push     r15
       4156                 push     r14
       53                   push     rbx
       4883EC28             sub      rsp, 40
       488D6C2440           lea      rbp, [rsp+40H]
       48897DD8             mov      bword ptr [rbp-28H], rdi
       488975E0             mov      qword ptr [rbp-20H], rsi
       488955C8             mov      bword ptr [rbp-38H], rdx
       48894DD0             mov      qword ptr [rbp-30H], rcx

G_M16768_IG02:
       33DB                 xor      ebx, ebx
       488D7DD8             lea      rdi, bword ptr [rbp-28H]
       E8C68F5CFF           call     System.ReadOnlySpan`1[Int32][System.Int32]:get_Length():int:this
       85C0                 test     eax, eax
       7E2F                 jle      SHORT G_M16768_IG04
       448B75D0             mov      r14d, dword ptr [rbp-30H]
       4C8B7DC8             mov      r15, bword ptr [rbp-38H]

G_M16768_IG03:
       413BDE               cmp      ebx, r14d
       732D                 jae      SHORT G_M16768_IG05
       4863FB               movsxd   rdi, ebx
       3B5DE0               cmp      ebx, dword ptr [rbp-20H]
       7325                 jae      SHORT G_M16768_IG05
       488B45D8             mov      rax, bword ptr [rbp-28H]
       8B04B8               mov      eax, dword ptr [rax+4*rdi]
       418904BF             mov      dword ptr [r15+4*rdi], eax
       FFC3                 inc      ebx
       488D7DD8             lea      rdi, bword ptr [rbp-28H]
       E8978F5CFF           call     System.ReadOnlySpan`1[Int32][System.Int32]:get_Length():int:this
       3BC3                 cmp      eax, ebx
       7FD9                 jg       SHORT G_M16768_IG03

G_M16768_IG04:
       488D65E8             lea      rsp, [rbp-18H]
       5B                   pop      rbx
       415E                 pop      r14
       415F                 pop      r15
       5D                   pop      rbp
       C3                   ret      

G_M16768_IG05:
       E853728178           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     

; Total bytes of code 110, prolog size 31 for method ConsoleApp1.Benchmarks:Copy(struct,struct)
; ============================================================

Edit: I've uploaded the whole jitdump to https://1drv.ms/t/s!Aq_kXFBmTtM2hFNDRqUcEsqNFGcA

@AndyAyersMS
Copy link
Member

Since you have the ability to look at jit info, what reason does the jit give for rejecting the inline of get_Length (say via COMPlus_JitPrintInlinedMethods=1) ?

@mikedn
Copy link
Contributor

mikedn commented Mar 20, 2018

Could it be that this is Unix specific? On Windows this works as expected.

@gfoidl
Copy link
Member Author

gfoidl commented Mar 20, 2018

Inlines into 0600001F ConsoleApp1.Benchmarks:Copy(struct,struct)
  [0 IL=0029 TR=000009 060015C2] [FAILED: not inline candidate] System.ReadOnlySpan`1[Int32][System.Int32]:get_Length():int:this
Budget: initialTime=171, finalTime=171, initialBudget=1710, currentBudget=1710
Budget: initialSize=974, finalSize=974

I can't verify the dasm on Windows.
But the benchmark numbers are done on Windows 10.
On ubuntu I get similar results.

When I change ReadOnlySpan<int> to Span<int> it's analog.

@mikedn
Copy link
Contributor

mikedn commented Mar 20, 2018

Well, when I try this the Copy method generates

G_M55887_IG01:
       4883EC28             sub      rsp, 40

G_M55887_IG02:
       488B01               mov      rax, bword ptr [rcx]
       8B4908               mov      ecx, dword ptr [rcx+8]
       4533C0               xor      r8d, r8d
       85C9                 test     ecx, ecx
       7E1C                 jle      SHORT G_M55887_IG04

G_M55887_IG03:
       443B4208             cmp      r8d, dword ptr [rdx+8]
       731B                 jae      SHORT G_M55887_IG05
       4C8B0A               mov      r9, bword ptr [rdx]
       4D63D0               movsxd   r10, r8d
       468B1C90             mov      r11d, dword ptr [rax+4*r10]
       47891C91             mov      dword ptr [r9+4*r10], r11d
       41FFC0               inc      r8d
       443BC1               cmp      r8d, ecx
       7CE4                 jl       SHORT G_M55887_IG03

G_M55887_IG04:
       4883C428             add      rsp, 40
       C3                   ret

G_M55887_IG05:
       E859F0365F           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3

@AndyAyersMS
Copy link
Member

@gfoidl What sort of build are you using to get the disassembly? If you use Debug you can get misleading results. For instance the core library is built differently.

@mikedn
Copy link
Contributor

mikedn commented Mar 20, 2018

Hmm, the array-variant assembly you have shown is a bit misleading, in that case loop cloning kicks in and you show only the loop version without range checks. I don't think loop cloning ever kicks in for span range checks so that version will always be slightly slower even without the weird Length inlining failure

@gfoidl
Copy link
Member Author

gfoidl commented Mar 20, 2018

@mikedn thx for your dasm in https://github.com/dotnet/coreclr/issues/17065#issuecomment-374705550
When I view the disassembly in VS while debugging (Release + suppress JIT optimizations unchecked, ...) then I'll see equal code (Windows)

@AndyAyersMS I followed the steps described in Viewing JIT Dumps. Built from dotnet/coreclr@ff2abe5
If this is the wrong way, please point me to the adequate way.

Even if the dasm looks quite good for Windows, the perf-numbers aren't.

@AndyAyersMS
Copy link
Member

Best practice would be to locally build both Release and Debug. Then use the release bits to overwrite the DLLs. Then overwrite again with just the Debug-built jit DLL so you can enable disasm and dumps and such.

So you end up with all the copied bits from the same build, and everything copied is built Release except for the jit, which is Debug.

@gfoidl
Copy link
Member Author

gfoidl commented Mar 20, 2018

Thanks for the info! I'll try this.

@mikedn
Copy link
Contributor

mikedn commented Mar 20, 2018

Even if the dasm looks quite good for Windows, the perf-numbers aren't.

Yep, I can reproduce your numbers on my machine with the assembly I posted above. Whatever happens on Linux that causes Length not to be inlined is a separate issue.

Best I can tell the reason why the span version is slower is a combination of lack of loop cloning and lack of hoisting of destination span field loads. This can be seen if you try the following version:

for (int i = 0; i < src.Length && i < dst.Length; ++i)
    dst[i] = src[i];

that generates

G_M55888_IG02:
       488B02               mov      rax, bword ptr [rdx]
       8B5208               mov      edx, dword ptr [rdx+8]
       4C8B01               mov      r8, bword ptr [rcx]
       8B4908               mov      ecx, dword ptr [rcx+8]
       4533C9               xor      r9d, r9d
       85C9                 test     ecx, ecx
       7E1A                 jle      SHORT G_M55888_IG05
       EB13                 jmp      SHORT G_M55888_IG04
G_M55888_IG03:
       4D63D1               movsxd   r10, r9d
       478B1C90             mov      r11d, dword ptr [r8+4*r10]
       46891C90             mov      dword ptr [rax+4*r10], r11d
       41FFC1               inc      r9d
       443BC9               cmp      r9d, ecx
       7D05                 jge      SHORT G_M55888_IG05
G_M55888_IG04:
       443BCA               cmp      r9d, edx
       7CE8                 jl       SHORT G_M55888_IG03
G_M55888_IG05:
       C3                   ret

This does not have range checks and no in-loop loads of span fields. It does have an extra compare so it's still a bit slower than the array version but quite a bit faster than the original version.

@AndyAyersMS
Copy link
Member

I take it we don't promote the destination span and so probably think the writes can alias the span fields?

I have toyed with making the implicit byref promotion more aggressive....

@mikedn
Copy link
Contributor

mikedn commented Mar 20, 2018

I take it we don't promote the destination span and so probably think the writes can alias the span fields?

Ah, right. It looks like we do promote but then the local variable associated with the second argument has zero references so I guess promotion was undone in that case. I'll dig a bit more.

@AndyAyersMS
Copy link
Member

Right, we promote and then undo if we don't think it has sufficient benefit.

Logic is in fgRetypeImplicitByRefArgs. At the very least that undo decision should be made more obvious in the dumps.

@mikedn
Copy link
Contributor

mikedn commented Mar 20, 2018

Yeah, it's due to the varDsc->lvRefCnt <= varDsc->lvFieldCnt check in fgRetypeImplicitByRefArgs. Span has only 2 fields so if the ref count is 2 like in the case of dst (one ref to get the span pointer and one ref to get the span length for range check) then promotion is undone. The workaround I did above simply added more references so promotion succeeds.

At the very least that undo decision should be made more obvious in the dumps.

Indeed.

I have toyed with making the implicit byref promotion more aggressive....

Is this really the right way to handle this? I suppose this is just the cheap way, the right way is probably to rely on CSE to remove redundant loads from such arguments. But then that would require tracking memory stores more accurately, even if the memory for implicit by ref args is not aliased you still need to figure out when these arguments are modified inside the method.

@gfoidl
Copy link
Member Author

gfoidl commented Mar 20, 2018

Side note: when I view the JIT dasm on Linux as described by Andy in https://github.com/dotnet/coreclr/issues/17065#issuecomment-374712572 get_Length is inlined, and the dasm looks quite similar to the windows' one. Thanks again for pointing me in the rigth direction.

@AndyAyersMS
Copy link
Member

@gfoidl that's good to hear. We should probably update that document since I suspect most people who want to look at jit disassembly and behavior are interested in what happens when code is optimized.

So we're left with the fact that the span version has an extra load (from the promote-unpromote dance we do with dst) and an extra range check for dst (since we don't clone for spans because their indexed accesses probably don't match the patterns the cloner regcognizes). And that extra overhead explains the perf difference.

I'm not a big fan of the cloner but in a case like this it actually seems to do something useful. I suppose if we can get its excesses under control, we can think about generalizing the recognizer to handle spans.

As for more aggressive promotion (or less aggressive undoing) I would like to wait until we have plausible block weight estimates available (something I hope to get to fairly soon) and use that to spot cases where one of the refs is in a loop.

@gfoidl
Copy link
Member Author

gfoidl commented Mar 20, 2018

Thanks for the insights 👍

We should probably update that document

I'll try to update it and send an PR.

@mikedn
Copy link
Contributor

mikedn commented Mar 21, 2018

the right way is probably to rely on CSE to remove redundant loads from such arguments.

Which actually works pretty well already ... until you have a store through a byref that the JIT can't figure out that it doesn't alias the implicit byref argument:

struct LargeStruct
{
    public int a, b, c, d, e, f, g, h;
}
[MethodImpl(MethodImplOptions.NoInlining)]
static int Test(ref int mem, LargeStruct ss1, LargeStruct ss2)
{
    int s = 0;
    for (int i = 0; i < ss1.b; i++)
    {
        s += ss1.a * ss2.c / ss1.a;
        //mem += ss.c;
    }
    return s;
}

generates

G_M55886_IG02:
       33C9                 xor      ecx, ecx
       4533C9               xor      r9d, r9d
       448B5204             mov      r10d, dword ptr [rdx+4]
       4585D2               test     r10d, r10d
       7E19                 jle      SHORT G_M55886_IG04
       448B1A               mov      r11d, dword ptr [rdx]
       418BC3               mov      eax, r11d
       410FAF4008           imul     eax, dword ptr [r8+8]
       99                   cdq
       41F7FB               idiv     edx:eax, r11d

G_M55886_IG03:
       03C8                 add      ecx, eax
       41FFC1               inc      r9d
       453BCA               cmp      r9d, r10d
       7CF6                 jl       SHORT G_M55886_IG03

All the loads and invariant expression have been hoisted out of the loop. If the mem += ss.c line is uncommented it all goes down the drain even if mem can't point to one of the struct arguments.

But of course, even if JIT alias analysis would be improved to handle such cases some things would still not work as well as they do when struct promotion is done (e.g. loop cloning runs before VN & CSE).

AndyAyersMS referenced this issue in dotnet/coreclr Mar 21, 2018
If one follows the current described steps, one won't see the JIT dump for optimized code in the core lib.
This PR adds the necessary step to see optimized code.

Cf. https://github.com/dotnet/coreclr/issues/17065#issuecomment-374772011
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Jan 15, 2021
@colgreen
Copy link
Contributor

For the record, this is still an issue in .NET 6. Here's the Benchmark output from my PC:

Method Mean Error StdDev Ratio RatioSD
Array 2,398.1 ns 4.28 ns 4.00 ns 1.00 0.00
Span 4,895.2 ns 53.04 ns 49.62 ns 2.04 0.02
Span_CopyTo 648.6 ns 1.10 ns 0.97 ns 0.27 0.00

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1645 (21H2)
AMD Ryzen 7 PRO 5750GE with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.202
[Host] : .NET 6.0.4 (6.0.422.16404), X64 RyuJIT
DefaultJob : .NET 6.0.4 (6.0.422.16404), X64 RyuJIT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Projects
None yet
Development

No branches or pull requests

7 participants