Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Some SIMD intrinsics operations don't reliably fold memory addressing logic #33002

Closed
GrabYourPitchforks opened this issue Feb 29, 2020 · 14 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI JitUntriaged CLR JIT issues needing additional triage optimization
Milestone

Comments

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Feb 29, 2020

Found while looking at codegen for #32994. From SharpLab:

public unsafe static void WidenAndWrite_Foo(byte* pInput, char* pOutput, ulong offset)
{
    Vector128<byte> zero = Vector128<byte>.Zero;
    Vector128<byte> narrow = Sse2.LoadVector128(pInput + offset);

    Vector128<byte> wideLow = Sse2.UnpackLow(narrow, zero);
    Vector128<byte> wideHigh = Sse2.UnpackHigh(narrow, zero);

    Sse2.Store((byte*)(pOutput + offset), wideLow);
    Sse2.Store((byte*)(pOutput + offset + 0x08), wideHigh);
}

public unsafe static void WidenAndWrite_Bar(byte* pInput, char* pOutput, ulong offset)
{
    Vector128<byte> zero = Vector128<byte>.Zero;
    Vector128<byte> narrow = Sse2.LoadVector128(pInput + offset);

    Vector128<byte> wideLow = Sse2.UnpackLow(narrow, zero);
    Vector128<byte> wideHigh = Sse2.UnpackHigh(narrow, zero);

    Sse2.Store((byte*)(pOutput + offset), wideLow);
    Sse2.Store((byte*)pOutput + (offset << 1) + 0x10, wideHigh);
}
WidenAndWrite_Foo(Byte*, Char*, UInt64)
    L0000: vzeroupper
    L0003: vxorps xmm0, xmm0, xmm0
    L0007: vmovdqu xmm1, [rcx+r8]
    L000d: vpunpcklbw xmm2, xmm1, xmm0
    L0011: vpunpckhbw xmm0, xmm1, xmm0
    L0015: lea rax, [rdx+r8*2]
    L0019: vmovdqu [rax], xmm2
    L001d: vmovdqu [rax+0x10], xmm0
    L0022: ret

WidenAndWrite_Bar(Byte*, Char*, UInt64)
    L0000: vzeroupper
    L0003: vxorps xmm0, xmm0, xmm0
    L0007: vmovdqu xmm1, [rcx+r8]
    L000d: vpunpcklbw xmm2, xmm1, xmm0
    L0011: vpunpckhbw xmm0, xmm1, xmm0
    L0015: vmovdqu [rdx+r8*2], xmm2
    L001b: vmovdqu [rdx+r8*2+0x10], xmm0
    L0022: ret

Is there any reason why the first sample uses a lea to help with calculating the destination memory address, while the second sample foregoes lea and folds the memory addressing logic directly into the vmovdqu instruction? As far as I can tell they take around the same amount of time to execute, but the first sample burns a register unnecessarily.

Related: #10923

category:cq
theme:addressing-modes
skill-level:expert
cost:medium

@GrabYourPitchforks GrabYourPitchforks added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization labels Feb 29, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Feb 29, 2020
@GrabYourPitchforks
Copy link
Member Author

Actually, upon rerunning the benchmarks, it appears that there is a significant perf win for killing the lea instruction and folding the memory addressing calculation into the store instructions.

Method Toolchain Size Mean Error StdDev Ratio
GetChars latin1 4096 243.5 ns 2.15 ns 2.01 ns 1.00
GetChars latin1_memaddr 4096 174.3 ns 3.09 ns 2.58 ns 0.72

This is a 30% perf win on top of the perf improvements already gained by #32994. To get this win I used the trick called out in the issue description here to ensure the memory address calculation logic gets folded into the vmovdqa instruction.

// Now run the main 1x 128-bit read + 2x 128-bit write loop.

nuint finalOffsetWhereCanIterateLoop = elementCount - SizeOfVector128;
while (currentOffset <= finalOffsetWhereCanIterateLoop)
{
    latin1Vector = Sse2.LoadVector128(pLatin1Buffer + currentOffset); // unaligned load

    Vector128<byte> low = Sse2.UnpackLow(latin1Vector, zeroVector);
    Sse2.StoreAligned((byte*)(pUtf16Buffer + currentOffset), low);

    Vector128<byte> high = Sse2.UnpackHigh(latin1Vector, zeroVector);
    Sse2.StoreAligned((byte*)pUtf16Buffer + (currentOffset << 1) + SizeOfVector128, high);

    currentOffset += SizeOfVector128;
}
; rcx := pInput (byte*)
; rdx := pOutput (char*)
; rax := currentOffset (nuint)
; xmm0 := Vector128<byte>.Zero
; r9 := last legal offset where the loop can iterate

;;; BEFORE ;;;
StartOfLoop:
vmovdqu xmm1,xmmword ptr [rcx+rax]
vpunpcklbw xmm2,xmm1,xmm0
lea     r10,[rdx+rax*2] ; <-- r10 used as temp value
vmovdqa xmmword ptr [r10],xmm2
vpunpckhbw xmm1,xmm1,xmm0
vmovdqa xmmword ptr [r10+10h],xmm1
add     rax,10h
cmp     rax,r9
jbe     StartOfLoop

;;; AFTER ;;;
StartOfLoop:
vmovdqu xmm1,xmmword ptr [rcx+rax]
vpunpcklbw xmm2,xmm1,xmm0
vmovdqa xmmword ptr [rdx+rax*2],xmm2
vpunpckhbw xmm1,xmm1,xmm0
vmovdqa xmmword ptr [rdx+rax*2+10h],xmm1
add     rax,10h
cmp     rax,r9
jbe     StartOfLoop

@BruceForstall
Copy link
Member

@tannergooding @CarolEidt

@BruceForstall
Copy link
Member

@briansull Can you investigate this to help understand what is going on in the JIT and what we might do about it? @AndyAyersMS thinks a forward substitution system might be appropriate to fix/improve some of this.

@briansull
Copy link
Contributor

OK I will take a look

@briansull
Copy link
Contributor

The lea instruction is the result of CSE optimization. The JIT sees it as a good CSE opportunity. And it is not that bad guess - the code size with the CSE is not bigger. Unfortunately, the extra lea seems to have disproportinal performance impact due to some micro-architecture reasons.

@BruceForstall
Copy link
Member

@briansull Does CSE consider whether the only uses of the CSE could also be embedded as address modes, and weigh the CSE creation logic accordingly? Perhaps this is a phase ordering issue, where address mode creation should "undo" the CSE?

What do you suggest we do here?

@AndyAyersMS
Copy link
Member

Seems like user-written address modes are a weak area? Jit-generated ones might be blocked via don't cse flags, but user written ones might not have this.

If we're not breaking up these trees initially, perhaps we could add this flag in somewhere.

@briansull
Copy link
Contributor

It is a tough problem to get this right in all cases.
If we ever add a "undo" CSE phase we could reverse a particular CSE when we see a profitable case.

@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label Apr 3, 2020
@BruceForstall BruceForstall added this to the Future milestone Apr 3, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Jan 13, 2022
Extend ref counting done by local morph so that we can determine
single-def single-use locals.

Add a phase that runs just after local morph that will attempt to
forward single-def single-use local defs to uses when they are in
adjacent statements.

Fix or work around issues uncovered elsewhere:
* `gtFoldExprCompare` might fold "identical" volatile subtrees
* `fgGetStubAddrArg` cannot handle complex trees
* some simd/hw operations can lose struct handles
* some calls cannot handle struct local args

Addresses dotnet#6973 and related issues. Still sorting through exactly
which ones are fixed, so list below may need revising.

Fixes dotnet#48605.
Fixes dotnet#51599.
Fixes dotnet#55472.

Improves some but not all cases in dotnet#12280 and dotnet#62064.

Does not fix dotnet#33002, dotnet#47082, or dotnet#63116; these require handling multiple
uses or bypassing statements.
@ghost ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Jan 13, 2022
@hez2010
Copy link
Contributor

hez2010 commented Feb 4, 2022

Is this issue been closed unexpectedly?

@stephentoub
Copy link
Member

stephentoub commented Feb 4, 2022

Is this issue been closed unexpectedly?

Probably. @AndyAyersMS, I'm reopening this under the assumption GitHub closed it by accident due to the wording in the PR description.

@stephentoub stephentoub reopened this Feb 4, 2022
@AndyAyersMS
Copy link
Member

Odd that it still remembered that original text (or something).

@hez2010
Copy link
Contributor

hez2010 commented Feb 5, 2022

Odd that it still remembered that original text (or something).

It was closed because the commit message contains "fix #issue".

@xtqqczze
Copy link
Contributor

Fixed in .NET 8? (sharplab)

@tannergooding
Copy link
Member

Most of the issues should be resolved (and have been for a long while now actually). There's likely still a few here or there that aren't handled at the moment, but they can have new issues opened for them.

@github-actions github-actions bot locked and limited conversation to collaborators Jun 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI JitUntriaged CLR JIT issues needing additional triage optimization
Projects
None yet
Development

No branches or pull requests

9 participants