Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange codegen with struct forwarding implementation to another struct #11000

Closed
Porges opened this issue Aug 29, 2018 · 4 comments
Closed

Strange codegen with struct forwarding implementation to another struct #11000

Porges opened this issue Aug 29, 2018 · 4 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@Porges
Copy link

Porges commented Aug 29, 2018

Info

Setup: Runtime=.NET Core 2.1.3 (CoreCLR 4.6.26725.06, CoreFX 4.6.26725.05), 64bit RyuJIT
Code for repro is at this commit (I haven't reduced it to a minimal repro yet): Porges/Fastre@3a2f805 (the Accepts method is what I am looking at, and the structs are Vector128Impl and Vector256Impl.)

Description

I have two structs T1 and T2 that implement an interface. These are used via the default(T1).Func(...) method to get nice inlined code in a generic function.

The T2 struct implements all of its methods by forwarding to the T1 struct, so at the moment the behaviour is the same. I would expect the codegen to be the same in both cases, if everything in T2 inlines successfully (which it appears to). However, I'm seeing oddities.

The codegen for T1 looks fine:

M00_L06 |  
  | movzx   ecx,byte ptr [rax]
  | mov     ecx,ecx
  | movzx   edx,byte ptr [rax+1]
  | mov     edx,edx
  | movzx   r8d,byte ptr [rax+2]
  | mov     r8d,r8d
  | movzx   r9d,byte ptr [rax+3]
  | mov     r9d,r9d
  | movzx   r10d,byte ptr [rax+4]
  | mov     r10d,r10d
  | movzx   r11d,byte ptr [rax+5]
  | mov     r11d,r11d
  | movzx   ebx,byte ptr [rax+6]
  | mov     ebx,ebx
  | shl     rcx,4
  | add     rcx,rbp
  | shl     rdx,4
  | add     rdx,rbp
  | shl     r8,4
  | add     r8,rbp
  | shl     r9,4
  | add     r9,rbp
  | shl     r10,4
  | add     r10,rbp
  | shl     r11,4
  | add     r11,rbp
  | shl     rbx,4
  | add     rbx,rbp
  | vmovdqu xmm0,xmmword ptr [rcx]
  | vmovdqu xmm1,xmmword ptr [rdx]
  | vmovdqu xmm2,xmmword ptr [r8]
  | vmovdqu xmm3,xmmword ptr [r9]
  | vmovdqu xmm4,xmmword ptr [r10]
  | vmovdqu xmm5,xmmword ptr [r11]
  | vmovdqu xmm7,xmmword ptr [rbx]
  | vpshufb xmm0,xmm0,xmm6
  | vpshufb xmm1,xmm2,xmm1
  | vpshufb xmm2,xmm4,xmm3
  | vpshufb xmm3,xmm7,xmm5
  | vpshufb xmm0,xmm1,xmm0
  | vpshufb xmm1,xmm3,xmm2
  | vpshufb xmm6,xmm1,xmm0
  | add     rax,7
  | lea     rcx,[rax+6]
  | cmp     rcx,rdi
  | jb      M00_L05

However with T2 it looks like it's doing something very strange:

M00_L06 |
  | movzx   ecx,byte ptr [rax]
  | mov     ecx,ecx
  | movzx   edx,byte ptr [rax+1]
  | mov     edx,edx
  | movzx   r8d,byte ptr [rax+2]
  | mov     r8d,r8d
  | movzx   r9d,byte ptr [rax+3]
  | mov     r9d,r9d
  | movzx   r10d,byte ptr [rax+4]
  | mov     r10d,r10d
  | movzx   r11d,byte ptr [rax+5]
  | mov     r11d,r11d
  | movzx   ebx,byte ptr [rax+6]
  | mov     ebx,ebx
  | mov     byte ptr [rsp+240h],0
  | lea     r14,[rsp+240h]
  | mov     byte ptr [r14],0
  | shl     rcx,4
  | add     rcx,rbp
  | mov     byte ptr [rsp+238h],0
  | lea     r14,[rsp+238h]
  | mov     byte ptr [r14],0
  | shl     rdx,4
  | add     rdx,rbp
  | mov     byte ptr [rsp+230h],0
  | lea     r14,[rsp+230h]
  | mov     byte ptr [r14],0
  | shl     r8,4
  | add     r8,rbp
  | mov     byte ptr [rsp+228h],0
  | lea     r14,[rsp+228h]
  | mov     byte ptr [r14],0
  | shl     r9,4
  | add     r9,rbp
  | mov     byte ptr [rsp+220h],0
  | lea     r14,[rsp+220h]
  | mov     byte ptr [r14],0
  | shl     r10,4
  | add     r10,rbp
  | mov     byte ptr [rsp+218h],0
  | lea     r14,[rsp+218h]
  | mov     byte ptr [r14],0
  | shl     r11,4
  | add     r11,rbp
  | mov     byte ptr [rsp+210h],0
  | lea     r14,[rsp+210h]
  | mov     byte ptr [r14],0
  | shl     rbx,4
  | add     rbx,rbp
  | mov     byte ptr [rsp+208h],0
  | lea     r14,[rsp+208h]
  | mov     byte ptr [r14],0
  | vmovdqu xmm0,xmmword ptr [rcx]
  | mov     byte ptr [rsp+200h],0
  | lea     rcx,[rsp+200h]
  | mov     byte ptr [rcx],0
  | vmovdqu xmm1,xmmword ptr [rdx]
  | mov     byte ptr [rsp+1F8h],0
  | lea     rcx,[rsp+1F8h]
  | mov     byte ptr [rcx],0
  | vmovdqu xmm2,xmmword ptr [r8]
  | mov     byte ptr [rsp+1F0h],0
  | lea     rcx,[rsp+1F0h]
  | mov     byte ptr [rcx],0
  | vmovdqu xmm3,xmmword ptr [r9]
  | mov     byte ptr [rsp+1E8h],0
  | lea     rcx,[rsp+1E8h]
  | mov     byte ptr [rcx],0
  | vmovdqu xmm4,xmmword ptr [r10]
  | mov     byte ptr [rsp+1E0h],0
  | lea     rcx,[rsp+1E0h]
  | mov     byte ptr [rcx],0
  | vmovdqu xmm5,xmmword ptr [r11]
  | mov     byte ptr [rsp+1D8h],0
  | lea     rcx,[rsp+1D8h]
  | mov     byte ptr [rcx],0
  | vmovdqu xmm7,xmmword ptr [rbx]
  | vmovapd xmmword ptr [rsp+1C0h],xmm0
  | vmovapd xmmword ptr [rsp+1B0h],xmm6
  | mov     byte ptr [rsp+1D0h],0
  | lea     rcx,[rsp+1D0h]
  | mov     byte ptr [rcx],0
  | vmovapd xmm0,xmmword ptr [rsp+1C0h]
  | vmovapd xmm6,xmmword ptr [rsp+1B0h]
  | vpshufb xmm0,xmm0,xmm6
  | vmovapd xmmword ptr [rsp+190h],xmm2
  | vmovapd xmmword ptr [rsp+180h],xmm1
  | mov     byte ptr [rsp+1A8h],0
  | lea     rcx,[rsp+1A8h]
  | mov     byte ptr [rcx],0
  | vmovapd xmm1,xmmword ptr [rsp+190h]
  | vmovapd xmm2,xmmword ptr [rsp+180h]
  | vpshufb xmm1,xmm1,xmm2
  | vmovapd xmmword ptr [rsp+160h],xmm4
  | vmovapd xmmword ptr [rsp+150h],xmm3
  | mov     byte ptr [rsp+178h],0
  | lea     rcx,[rsp+178h]
  | mov     byte ptr [rcx],0
  | vmovapd xmm2,xmmword ptr [rsp+160h]
  | vmovapd xmm3,xmmword ptr [rsp+150h]
  | vpshufb xmm2,xmm2,xmm3
  | vmovapd xmmword ptr [rsp+130h],xmm7
  | vmovapd xmmword ptr [rsp+120h],xmm5
  | mov     byte ptr [rsp+148h],0
  | lea     rcx,[rsp+148h]
  | mov     byte ptr [rcx],0
  | vmovapd xmm3,xmmword ptr [rsp+130h]
  | vmovapd xmm4,xmmword ptr [rsp+120h]
  | vpshufb xmm3,xmm3,xmm4
  | vmovapd xmmword ptr [rsp+100h],xmm1
  | vmovapd xmmword ptr [rsp+0F0h],xmm0
  | mov     byte ptr [rsp+118h],0
  | lea     rcx,[rsp+118h]
  | mov     byte ptr [rcx],0
  | vmovapd xmm0,xmmword ptr [rsp+100h]
  | vmovapd xmm1,xmmword ptr [rsp+0F0h]
  | vpshufb xmm0,xmm0,xmm1
  | vmovapd xmmword ptr [rsp+0D0h],xmm3
  | vmovapd xmmword ptr [rsp+0C0h],xmm2
  | mov     byte ptr [rsp+0E8h],0
  | lea     rcx,[rsp+0E8h]
  | mov     byte ptr [rcx],0
  | vmovapd xmm1,xmmword ptr [rsp+0D0h]
  | vmovapd xmm2,xmmword ptr [rsp+0C0h]
  | vpshufb xmm1,xmm1,xmm2
  | vmovapd xmmword ptr [rsp+0A0h],xmm1
  | vmovapd xmmword ptr [rsp+90h],xmm0
  | mov     byte ptr [rsp+0B8h],0
  | lea     rcx,[rsp+0B8h]
  | mov     byte ptr [rcx],0
  | vmovapd xmm0,xmmword ptr [rsp+0A0h]
  | vmovapd xmm1,xmmword ptr [rsp+90h]
  | vpshufb xmm6,xmm0,xmm1
  | add     rax,7
  | lea     rcx,[rax+6]
  | cmp     rcx,rdi
  | jb      M00_L05

You can see more of the disassembly here: https://gist.github.com/Porges/334ca424055be72470a67bc6eefe43a8

category:cq
theme:structs
skill-level:expert
cost:large

@mikedn
Copy link
Contributor

mikedn commented Aug 29, 2018

As far as I can tell a simple repro looks something like this:

    struct moo
    {
        public void foo() => Console.WriteLine();
    }

    struct boo
    {
        public void foo() => default(moo).foo();
    }

    static int Test()
    {
        default(moo).foo(); // replace moo with boo => worse codegen
    }

"Normal" codegen:

G_M55887_IG01:
       4883EC28             sub      rsp, 40
       33C0                 xor      rax, rax
       4889442420           mov      qword ptr [rsp+20H], rax
G_M55887_IG02:
       488D442420           lea      rax, bword ptr [rsp+20H]
       C60000               mov      byte  ptr [rax], 0
       E810FEFFFF           call     System.Console:WriteLine()
       B82A000000           mov      eax, 42
G_M55887_IG03:
       4883C428             add      rsp, 40
       C3                   ret

Worse codegen:

G_M55888_IG01:
       4883EC38             sub      rsp, 56
       33C0                 xor      rax, rax
       4889442430           mov      qword ptr [rsp+30H], rax
       4889442428           mov      qword ptr [rsp+28H], rax
G_M55888_IG02:
       488D442430           lea      rax, bword ptr [rsp+30H]
       C60000               mov      byte  ptr [rax], 0
       C644242800           mov      byte  ptr [rsp+28H], 0
       488D442428           lea      rax, bword ptr [rsp+28H]
       C60000               mov      byte  ptr [rax], 0
       E8FEFDFFFF           call     System.Console:WriteLine()
       B82A000000           mov      eax, 42
G_M55888_IG03:
       4883C438             add      rsp, 56
       C3                   ret

Unfortunately this type of issues are rather common, the JIT's handling of structs was and still is rather limited.

@AndyAyersMS
Copy link
Member

cc @CarolEidt

Maybe another good case for the first class structs work.

@mikedn
Copy link
Contributor

mikedn commented Sep 1, 2018

Hmm, I'm not sure if first class structs work will help here. The IL generated for something like default(S).M() is:

IL_0000  12 00             ldloca.s     0x0
IL_0002  25                dup
IL_0003  fe 15 07 00 00 02 initobj      0x2000007
IL_0009  28 08 00 00 06    call         0x6000008

The address of the local gets spilled because of the dup and you end up with an address exposed variable and that's a bit of a dead end.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@jakobbotsch
Copy link
Member

This problem was most likely fixed by some combination of #64171, #72714 and #79341. Please feel free to reopen if you still see codegen deficiencies, it is a bit hard to check without a minimal repro.

@ghost ghost locked as resolved and limited conversation to collaborators Jan 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

5 participants