-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Some SIMD intrinsics operations don't reliably fold memory addressing logic #33002
Comments
Actually, upon rerunning the benchmarks, it appears that there is a significant perf win for killing the
This is a 30% perf win on top of the perf improvements already gained by #32994. To get this win I used the trick called out in the issue description here to ensure the memory address calculation logic gets folded into the // Now run the main 1x 128-bit read + 2x 128-bit write loop.
nuint finalOffsetWhereCanIterateLoop = elementCount - SizeOfVector128;
while (currentOffset <= finalOffsetWhereCanIterateLoop)
{
latin1Vector = Sse2.LoadVector128(pLatin1Buffer + currentOffset); // unaligned load
Vector128<byte> low = Sse2.UnpackLow(latin1Vector, zeroVector);
Sse2.StoreAligned((byte*)(pUtf16Buffer + currentOffset), low);
Vector128<byte> high = Sse2.UnpackHigh(latin1Vector, zeroVector);
Sse2.StoreAligned((byte*)pUtf16Buffer + (currentOffset << 1) + SizeOfVector128, high);
currentOffset += SizeOfVector128;
} ; rcx := pInput (byte*)
; rdx := pOutput (char*)
; rax := currentOffset (nuint)
; xmm0 := Vector128<byte>.Zero
; r9 := last legal offset where the loop can iterate
;;; BEFORE ;;;
StartOfLoop:
vmovdqu xmm1,xmmword ptr [rcx+rax]
vpunpcklbw xmm2,xmm1,xmm0
lea r10,[rdx+rax*2] ; <-- r10 used as temp value
vmovdqa xmmword ptr [r10],xmm2
vpunpckhbw xmm1,xmm1,xmm0
vmovdqa xmmword ptr [r10+10h],xmm1
add rax,10h
cmp rax,r9
jbe StartOfLoop
;;; AFTER ;;;
StartOfLoop:
vmovdqu xmm1,xmmword ptr [rcx+rax]
vpunpcklbw xmm2,xmm1,xmm0
vmovdqa xmmword ptr [rdx+rax*2],xmm2
vpunpckhbw xmm1,xmm1,xmm0
vmovdqa xmmword ptr [rdx+rax*2+10h],xmm1
add rax,10h
cmp rax,r9
jbe StartOfLoop |
@briansull Can you investigate this to help understand what is going on in the JIT and what we might do about it? @AndyAyersMS thinks a forward substitution system might be appropriate to fix/improve some of this. |
OK I will take a look |
The lea instruction is the result of CSE optimization. The JIT sees it as a good CSE opportunity. And it is not that bad guess - the code size with the CSE is not bigger. Unfortunately, the extra lea seems to have disproportinal performance impact due to some micro-architecture reasons. |
@briansull Does CSE consider whether the only uses of the CSE could also be embedded as address modes, and weigh the CSE creation logic accordingly? Perhaps this is a phase ordering issue, where address mode creation should "undo" the CSE? What do you suggest we do here? |
Seems like user-written address modes are a weak area? Jit-generated ones might be blocked via don't cse flags, but user written ones might not have this. If we're not breaking up these trees initially, perhaps we could add this flag in somewhere. |
It is a tough problem to get this right in all cases. |
Extend ref counting done by local morph so that we can determine single-def single-use locals. Add a phase that runs just after local morph that will attempt to forward single-def single-use local defs to uses when they are in adjacent statements. Fix or work around issues uncovered elsewhere: * `gtFoldExprCompare` might fold "identical" volatile subtrees * `fgGetStubAddrArg` cannot handle complex trees * some simd/hw operations can lose struct handles * some calls cannot handle struct local args Addresses dotnet#6973 and related issues. Still sorting through exactly which ones are fixed, so list below may need revising. Fixes dotnet#48605. Fixes dotnet#51599. Fixes dotnet#55472. Improves some but not all cases in dotnet#12280 and dotnet#62064. Does not fix dotnet#33002, dotnet#47082, or dotnet#63116; these require handling multiple uses or bypassing statements.
Is this issue been closed unexpectedly? |
Probably. @AndyAyersMS, I'm reopening this under the assumption GitHub closed it by accident due to the wording in the PR description. |
Odd that it still remembered that original text (or something). |
It was closed because the commit message contains "fix #issue". |
Fixed in .NET 8? (sharplab) |
Most of the issues should be resolved (and have been for a long while now actually). There's likely still a few here or there that aren't handled at the moment, but they can have new issues opened for them. |
Found while looking at codegen for #32994. From SharpLab:
Is there any reason why the first sample uses a
lea
to help with calculating the destination memory address, while the second sample foregoeslea
and folds the memory addressing logic directly into thevmovdqu
instruction? As far as I can tell they take around the same amount of time to execute, but the first sample burns a register unnecessarily.Related: #10923
category:cq
theme:addressing-modes
skill-level:expert
cost:medium
The text was updated successfully, but these errors were encountered: