JIT: Some SIMD intrinsics operations don't reliably fold memory addressing logic #33002

GrabYourPitchforks · 2020-02-29T01:54:53Z

Found while looking at codegen for #32994. From SharpLab:

public unsafe static void WidenAndWrite_Foo(byte* pInput, char* pOutput, ulong offset)
{
    Vector128<byte> zero = Vector128<byte>.Zero;
    Vector128<byte> narrow = Sse2.LoadVector128(pInput + offset);

    Vector128<byte> wideLow = Sse2.UnpackLow(narrow, zero);
    Vector128<byte> wideHigh = Sse2.UnpackHigh(narrow, zero);

    Sse2.Store((byte*)(pOutput + offset), wideLow);
    Sse2.Store((byte*)(pOutput + offset + 0x08), wideHigh);
}

public unsafe static void WidenAndWrite_Bar(byte* pInput, char* pOutput, ulong offset)
{
    Vector128<byte> zero = Vector128<byte>.Zero;
    Vector128<byte> narrow = Sse2.LoadVector128(pInput + offset);

    Vector128<byte> wideLow = Sse2.UnpackLow(narrow, zero);
    Vector128<byte> wideHigh = Sse2.UnpackHigh(narrow, zero);

    Sse2.Store((byte*)(pOutput + offset), wideLow);
    Sse2.Store((byte*)pOutput + (offset << 1) + 0x10, wideHigh);
}

WidenAndWrite_Foo(Byte*, Char*, UInt64)
    L0000: vzeroupper
    L0003: vxorps xmm0, xmm0, xmm0
    L0007: vmovdqu xmm1, [rcx+r8]
    L000d: vpunpcklbw xmm2, xmm1, xmm0
    L0011: vpunpckhbw xmm0, xmm1, xmm0
    L0015: lea rax, [rdx+r8*2]
    L0019: vmovdqu [rax], xmm2
    L001d: vmovdqu [rax+0x10], xmm0
    L0022: ret

WidenAndWrite_Bar(Byte*, Char*, UInt64)
    L0000: vzeroupper
    L0003: vxorps xmm0, xmm0, xmm0
    L0007: vmovdqu xmm1, [rcx+r8]
    L000d: vpunpcklbw xmm2, xmm1, xmm0
    L0011: vpunpckhbw xmm0, xmm1, xmm0
    L0015: vmovdqu [rdx+r8*2], xmm2
    L001b: vmovdqu [rdx+r8*2+0x10], xmm0
    L0022: ret

Is there any reason why the first sample uses a lea to help with calculating the destination memory address, while the second sample foregoes lea and folds the memory addressing logic directly into the vmovdqu instruction? As far as I can tell they take around the same amount of time to execute, but the first sample burns a register unnecessarily.

Related: #10923

category:cq
theme:addressing-modes
skill-level:expert
cost:medium

The text was updated successfully, but these errors were encountered:

GrabYourPitchforks · 2020-02-29T02:52:56Z

Actually, upon rerunning the benchmarks, it appears that there is a significant perf win for killing the lea instruction and folding the memory addressing calculation into the store instructions.

Method	Toolchain	Size	Mean	Error	StdDev	Ratio
GetChars	latin1	4096	243.5 ns	2.15 ns	2.01 ns	1.00
GetChars	latin1_memaddr	4096	174.3 ns	3.09 ns	2.58 ns	0.72

This is a 30% perf win on top of the perf improvements already gained by #32994. To get this win I used the trick called out in the issue description here to ensure the memory address calculation logic gets folded into the vmovdqa instruction.

// Now run the main 1x 128-bit read + 2x 128-bit write loop.

nuint finalOffsetWhereCanIterateLoop = elementCount - SizeOfVector128;
while (currentOffset <= finalOffsetWhereCanIterateLoop)
{
    latin1Vector = Sse2.LoadVector128(pLatin1Buffer + currentOffset); // unaligned load

    Vector128<byte> low = Sse2.UnpackLow(latin1Vector, zeroVector);
    Sse2.StoreAligned((byte*)(pUtf16Buffer + currentOffset), low);

    Vector128<byte> high = Sse2.UnpackHigh(latin1Vector, zeroVector);
    Sse2.StoreAligned((byte*)pUtf16Buffer + (currentOffset << 1) + SizeOfVector128, high);

    currentOffset += SizeOfVector128;
}

; rcx := pInput (byte*)
; rdx := pOutput (char*)
; rax := currentOffset (nuint)
; xmm0 := Vector128<byte>.Zero
; r9 := last legal offset where the loop can iterate

;;; BEFORE ;;;
StartOfLoop:
vmovdqu xmm1,xmmword ptr [rcx+rax]
vpunpcklbw xmm2,xmm1,xmm0
lea     r10,[rdx+rax*2] ; <-- r10 used as temp value
vmovdqa xmmword ptr [r10],xmm2
vpunpckhbw xmm1,xmm1,xmm0
vmovdqa xmmword ptr [r10+10h],xmm1
add     rax,10h
cmp     rax,r9
jbe     StartOfLoop

;;; AFTER ;;;
StartOfLoop:
vmovdqu xmm1,xmmword ptr [rcx+rax]
vpunpcklbw xmm2,xmm1,xmm0
vmovdqa xmmword ptr [rdx+rax*2],xmm2
vpunpckhbw xmm1,xmm1,xmm0
vmovdqa xmmword ptr [rdx+rax*2+10h],xmm1
add     rax,10h
cmp     rax,r9
jbe     StartOfLoop

BruceForstall · 2020-03-03T06:26:51Z

@tannergooding @CarolEidt

BruceForstall · 2020-03-31T18:28:30Z

@briansull Can you investigate this to help understand what is going on in the JIT and what we might do about it? @AndyAyersMS thinks a forward substitution system might be appropriate to fix/improve some of this.

briansull · 2020-04-01T01:12:58Z

OK I will take a look

briansull · 2020-04-01T01:29:36Z

The lea instruction is the result of CSE optimization. The JIT sees it as a good CSE opportunity. And it is not that bad guess - the code size with the CSE is not bigger. Unfortunately, the extra lea seems to have disproportinal performance impact due to some micro-architecture reasons.

BruceForstall · 2020-04-01T22:25:49Z

@briansull Does CSE consider whether the only uses of the CSE could also be embedded as address modes, and weigh the CSE creation logic accordingly? Perhaps this is a phase ordering issue, where address mode creation should "undo" the CSE?

What do you suggest we do here?

AndyAyersMS · 2020-04-01T22:52:13Z

Seems like user-written address modes are a weak area? Jit-generated ones might be blocked via don't cse flags, but user written ones might not have this.

If we're not breaking up these trees initially, perhaps we could add this flag in somewhere.

briansull · 2020-04-02T03:01:46Z

It is a tough problem to get this right in all cases.
If we ever add a "undo" CSE phase we could reverse a particular CSE when we see a profitable case.

Extend ref counting done by local morph so that we can determine single-def single-use locals. Add a phase that runs just after local morph that will attempt to forward single-def single-use local defs to uses when they are in adjacent statements. Fix or work around issues uncovered elsewhere: * `gtFoldExprCompare` might fold "identical" volatile subtrees * `fgGetStubAddrArg` cannot handle complex trees * some simd/hw operations can lose struct handles * some calls cannot handle struct local args Addresses dotnet#6973 and related issues. Still sorting through exactly which ones are fixed, so list below may need revising. Fixes dotnet#48605. Fixes dotnet#51599. Fixes dotnet#55472. Improves some but not all cases in dotnet#12280 and dotnet#62064. Does not fix dotnet#33002, dotnet#47082, or dotnet#63116; these require handling multiple uses or bypassing statements.

hez2010 · 2022-02-04T06:22:12Z

Is this issue been closed unexpectedly?

stephentoub · 2022-02-04T21:37:16Z

Is this issue been closed unexpectedly?

Probably. @AndyAyersMS, I'm reopening this under the assumption GitHub closed it by accident due to the wording in the PR description.

AndyAyersMS · 2022-02-04T21:55:54Z

Odd that it still remembered that original text (or something).

hez2010 · 2022-02-05T02:24:39Z

Odd that it still remembered that original text (or something).

It was closed because the commit message contains "fix #issue".

xtqqczze · 2024-05-25T14:58:59Z

Fixed in .NET 8? (sharplab)

tannergooding · 2024-05-28T16:36:06Z

Most of the issues should be resolved (and have been for a long while now actually). There's likely still a few here or there that aren't handled at the moment, but they can have new issues opened for them.

GrabYourPitchforks added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization labels Feb 29, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Feb 29, 2020

GrabYourPitchforks mentioned this issue Feb 29, 2020

Clean up Latin1Encoding implementation and vectorize its logic #32994

Merged

BruceForstall assigned briansull Mar 31, 2020

BruceForstall removed the untriaged New issue has not been triaged by the area owner label Apr 3, 2020

BruceForstall added this to the Future milestone Apr 3, 2020

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

AndyAyersMS mentioned this issue Jan 13, 2022

JIT: simple forward substitution pass #63720

Merged

ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Jan 13, 2022

AndyAyersMS closed this as completed in 99dcf5b Feb 4, 2022

stephentoub reopened this Feb 4, 2022

BruceForstall unassigned briansull Oct 30, 2022

tannergooding closed this as completed May 28, 2024

github-actions bot locked and limited conversation to collaborators Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Some SIMD intrinsics operations don't reliably fold memory addressing logic #33002

JIT: Some SIMD intrinsics operations don't reliably fold memory addressing logic #33002

GrabYourPitchforks commented Feb 29, 2020 •

edited by BruceForstall

Loading

GrabYourPitchforks commented Feb 29, 2020

BruceForstall commented Mar 3, 2020

BruceForstall commented Mar 31, 2020

briansull commented Apr 1, 2020

briansull commented Apr 1, 2020

BruceForstall commented Apr 1, 2020

AndyAyersMS commented Apr 1, 2020

briansull commented Apr 2, 2020

hez2010 commented Feb 4, 2022

stephentoub commented Feb 4, 2022 •

edited

Loading

AndyAyersMS commented Feb 4, 2022

hez2010 commented Feb 5, 2022

xtqqczze commented May 25, 2024

tannergooding commented May 28, 2024

JIT: Some SIMD intrinsics operations don't reliably fold memory addressing logic #33002

JIT: Some SIMD intrinsics operations don't reliably fold memory addressing logic #33002

Comments

GrabYourPitchforks commented Feb 29, 2020 • edited by BruceForstall Loading

GrabYourPitchforks commented Feb 29, 2020

BruceForstall commented Mar 3, 2020

BruceForstall commented Mar 31, 2020

briansull commented Apr 1, 2020

briansull commented Apr 1, 2020

BruceForstall commented Apr 1, 2020

AndyAyersMS commented Apr 1, 2020

briansull commented Apr 2, 2020

hez2010 commented Feb 4, 2022

stephentoub commented Feb 4, 2022 • edited Loading

AndyAyersMS commented Feb 4, 2022

hez2010 commented Feb 5, 2022

xtqqczze commented May 25, 2024

tannergooding commented May 28, 2024

GrabYourPitchforks commented Feb 29, 2020 •

edited by BruceForstall

Loading

stephentoub commented Feb 4, 2022 •

edited

Loading