-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite how Matrix3x2 and Matrix4x4 are implemented #80091
Conversation
Tagging subscribers to this area: @dotnet/area-system-numerics Issue DetailsThis rewrites Matrix3x2 and Matrix4x4 so that it can better take advantage of SIMD acceleration and JIT features like promotion. In both cases, the entire implementation was moved to a private nested type Since we don't currently have "vectorcall" on Windows and since we can't compose the underlying types using Where there were trivially recognizable operations possible, such as using a vectorized operation rather than multiple scalar operations, those were done. Other methods, which would require more complex changes, were left "as is" and as a future exercise.
|
Perf_Matrix3x2Several methods are basically the same as before, but many are 8-10x faster
|
Perf_Matrix4x4Similarly to Matrix3x2, several methods are basically the same as before, but many are 8-10x faster
|
Codegen for many of these will improve even more once #80083 goes in, since it improves the loading/storing of Vector3 (TYP_SIMD12) |
An example diff for As you can see, previously we did 16 separate comparisons and at some 32 branches. Now we instead do 4 direct comparisons against the identity constant and no more than 4 branches. Before: ; Assembly listing for method System.Numerics.Matrix4x4:get_IsIdentity():bool:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
; V00 this [V00,T00] ( 18, 10.50) byref -> rcx this single-def
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
; V02 cse0 [V02,T01] ( 5, 3.50) float -> mm1 "CSE - aggressive"
;
; Lcl frame size = 0
G_M52050_IG01:
vzeroupper
;; size=3 bbWeight=1 PerfScore 1.00
G_M52050_IG02:
vmovss xmm0, dword ptr [rcx]
vmovss xmm1, dword ptr [reloc @RWD00]
vucomiss xmm0, xmm1
jp G_M52050_IG06
jne G_M52050_IG06
;; size=28 bbWeight=1 PerfScore 11.00
G_M52050_IG03:
vmovss xmm0, dword ptr [rcx+14H]
vucomiss xmm0, xmm1
jp G_M52050_IG06
jne G_M52050_IG06
vmovss xmm0, dword ptr [rcx+28H]
vucomiss xmm0, xmm1
jp G_M52050_IG06
jne G_M52050_IG06
vmovss xmm0, dword ptr [rcx+3CH]
vucomiss xmm0, xmm1
jp G_M52050_IG06
jne G_M52050_IG06
vmovss xmm0, dword ptr [rcx+04H]
vxorps xmm1, xmm1
vucomiss xmm0, xmm1
jp G_M52050_IG06
jne G_M52050_IG06
vmovss xmm0, dword ptr [rcx+08H]
vxorps xmm1, xmm1
vucomiss xmm0, xmm1
jp G_M52050_IG06
jne G_M52050_IG06
vmovss xmm0, dword ptr [rcx+0CH]
vxorps xmm1, xmm1
vucomiss xmm0, xmm1
jp G_M52050_IG06
jne G_M52050_IG06
vmovss xmm0, dword ptr [rcx+10H]
vxorps xmm1, xmm1
vucomiss xmm0, xmm1
jp G_M52050_IG06
jne G_M52050_IG06
vmovss xmm0, dword ptr [rcx+18H]
vxorps xmm1, xmm1
vucomiss xmm0, xmm1
jp G_M52050_IG06
jne G_M52050_IG06
vmovss xmm0, dword ptr [rcx+1CH]
vxorps xmm1, xmm1
vucomiss xmm0, xmm1
jp SHORT G_M52050_IG06
;; size=203 bbWeight=0.50 PerfScore 36.50
G_M52050_IG04:
jne SHORT G_M52050_IG06
vmovss xmm0, dword ptr [rcx+20H]
vxorps xmm1, xmm1
vucomiss xmm0, xmm1
jp SHORT G_M52050_IG06
jne SHORT G_M52050_IG06
vmovss xmm0, dword ptr [rcx+24H]
vxorps xmm1, xmm1
vucomiss xmm0, xmm1
jp SHORT G_M52050_IG06
jne SHORT G_M52050_IG06
vmovss xmm0, dword ptr [rcx+2CH]
vxorps xmm1, xmm1
vucomiss xmm0, xmm1
jp SHORT G_M52050_IG06
jne SHORT G_M52050_IG06
vmovss xmm0, dword ptr [rcx+30H]
vxorps xmm1, xmm1
vucomiss xmm0, xmm1
jp SHORT G_M52050_IG06
jne SHORT G_M52050_IG06
vmovss xmm0, dword ptr [rcx+34H]
vxorps xmm1, xmm1
vucomiss xmm0, xmm1
jp SHORT G_M52050_IG06
jne SHORT G_M52050_IG06
vmovss xmm0, dword ptr [rcx+38H]
vxorps xmm1, xmm1
xor eax, eax
vucomiss xmm0, xmm1
setnp al
jp SHORT G_M52050_IG05
sete al
;; size=110 bbWeight=0.50 PerfScore 26.13
G_M52050_IG05:
ret
;; size=1 bbWeight=0.50 PerfScore 0.50
G_M52050_IG06:
xor eax, eax
;; size=2 bbWeight=0.50 PerfScore 0.12
G_M52050_IG07:
ret
;; size=1 bbWeight=0.50 PerfScore 0.50
RWD00 dd 3F800000h ; 1
; Total bytes of code 348, prolog size 3, PerfScore 110.55, instruction count 83, allocated bytes for code 348 (MethodHash=dc9134ad) for method System.Numerics.Matrix4x4:get_IsIdentity():bool:this After: ; Assembly listing for method System.Numerics.Matrix4x4:get_IsIdentity():bool:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
; V00 this [V00,T02] ( 3, 3 ) byref -> rcx this single-def
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
; V02 tmp1 [V02,T03] ( 2, 4 ) struct (64) [rsp+48H] do-not-enreg[SF] "impAppendStmt"
;* V03 tmp2 [V03 ] ( 0, 0 ) struct (64) zero-ref do-not-enreg[S] "struct address for call/obj"
;* V04 tmp3 [V04,T00] ( 0, 0 ) struct (64) zero-ref do-not-enreg[SF] ld-addr-op "NewObj constructor temp"
;* V05 tmp4 [V05,T10] ( 0, 0 ) simd16 -> zero-ref ld-addr-op "NewObj constructor temp"
;* V06 tmp5 [V06,T11] ( 0, 0 ) simd16 -> zero-ref ld-addr-op "NewObj constructor temp"
;* V07 tmp6 [V07,T12] ( 0, 0 ) simd16 -> zero-ref ld-addr-op "NewObj constructor temp"
;* V08 tmp7 [V08,T13] ( 0, 0 ) simd16 -> zero-ref ld-addr-op "NewObj constructor temp"
; V09 tmp8 [V09,T04] ( 3, 2 ) bool -> rax "Inline return value spill temp"
; V10 tmp9 [V10,T01] ( 5, 7 ) struct (64) [rsp+08H] do-not-enreg[SF] "Inlining Arg"
;* V11 tmp10 [V11,T05] ( 0, 0 ) struct (64) zero-ref do-not-enreg[SF] "Inlining Arg"
; V12 cse0 [V12,T06] ( 2, 2 ) simd16 -> mm0 "CSE - aggressive"
; V13 cse1 [V13,T07] ( 2, 1.50) simd16 -> mm1 "CSE - aggressive"
; V14 cse2 [V14,T08] ( 2, 1.50) simd16 -> mm2 "CSE - aggressive"
; V15 cse3 [V15,T09] ( 2, 1.50) simd16 -> mm3 "CSE - aggressive"
;
; Lcl frame size = 136
G_M52050_IG01:
sub rsp, 136
vzeroupper
;; size=10 bbWeight=1 PerfScore 1.25
G_M52050_IG02:
vmovdqu ymm0, ymmword ptr[rcx]
vmovdqu ymmword ptr[rsp+48H], ymm0
vmovdqu ymm0, ymmword ptr[rcx+20H]
vmovdqu ymmword ptr[rsp+68H], ymm0
vmovupd xmm0, xmmword ptr [reloc @RWD00]
vmovupd xmm1, xmmword ptr [reloc @RWD16]
vmovupd xmm2, xmmword ptr [reloc @RWD32]
vmovupd xmm3, xmmword ptr [reloc @RWD48]
vmovdqu ymm4, ymmword ptr[rsp+48H]
vmovdqu ymmword ptr[rsp+08H], ymm4
vmovdqu ymm4, ymmword ptr[rsp+68H]
vmovdqu ymmword ptr[rsp+28H], ymm4
vcmpps xmm0, xmm0, xmmword ptr [rsp+08H], 0
vmovmskps xrax, xmm0
cmp eax, 15
jne SHORT G_M52050_IG04
;; size=93 bbWeight=1 PerfScore 40.25
G_M52050_IG03:
vcmpps xmm0, xmm1, xmmword ptr [rsp+18H], 0
vmovmskps xrax, xmm0
cmp eax, 15
jne SHORT G_M52050_IG04
vcmpps xmm0, xmm2, xmmword ptr [rsp+28H], 0
vmovmskps xrax, xmm0
cmp eax, 15
jne SHORT G_M52050_IG04
vcmpps xmm0, xmm3, xmmword ptr [rsp+38H], 0
vmovmskps xrax, xmm0
cmp eax, 15
sete al
movzx rax, al
jmp SHORT G_M52050_IG05
;; size=54 bbWeight=0.50 PerfScore 10.50
G_M52050_IG04:
xor eax, eax
;; size=2 bbWeight=0.50 PerfScore 0.12
G_M52050_IG05:
add rsp, 136
ret
;; size=8 bbWeight=1 PerfScore 1.25
RWD00 dq 000000003F800000h, 0000000000000000h
RWD16 dq 3F80000000000000h, 0000000000000000h
RWD32 dq 0000000000000000h, 000000003F800000h
RWD48 dq 0000000000000000h, 3F80000000000000h
; Total bytes of code 167, prolog size 10, PerfScore 70.08, instruction count 35, allocated bytes for code 167 (MethodHash=dc9134ad) for method System.Numerics.Matrix4x4:get_IsIdentity():bool:this |
Still need to look into a couple edge cases where perf minorly regressed. Disassembly looks to show it being from extra copies where we'd ideally have none. |
27fb5dc
to
b8b97a6
Compare
b8b97a6
to
51bb6c3
Compare
@radekdoulik would you be able to assist (or ping someone from the WASM team who can) with the failure here? Seeing the following in the logs, but don't see an existing issue for it.
It's unexpected that changing matrix would impact the System.Runtime tests and not others. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM at a high level, as most of this seems like a direct copy paste to the new class. Is there anything besides the operator +
change that requires a closer look?
Not particularly, this all pretty straightforward refactoring + simplification. |
You can ignore that, it is known issue.#79874 |
Improvements on Windows-x64: |
This rewrites Matrix3x2 and Matrix4x4 so that it can better take advantage of SIMD acceleration and JIT features like promotion.
In both cases, the entire implementation was moved to a private nested type
Impl
and relies on the fact that bitcasting between these same sized value types is a "no-op" to the JIT. These new types have3x Vector2
or4x Vector4
fields, respectively, and is in contrast to the6x float
and16x float
fields the main types have. This switch allows the JIT to perform field promotion and therefore also enregistration of the underlying bytes. We would have ideally done this more directly, but since the main types expose fields publicly, that would have been a breaking change.Since we don't currently have "vectorcall" on Windows and since we can't compose the underlying types using
Vector128<float>
without it being an ABI break for interop scenarios,Impl
takes the large matrix types viain
to help reduce copying. However, returns are still returned by value since the field promotion allows things to get properly optimized. When inlining occurs, the JIT is able to see through the value being passed asin
and avoid the value being "address taken".Where there were trivially recognizable operations possible, such as using a vectorized operation rather than multiple scalar operations, those were done. Other methods, which would require more complex changes, were left "as is" and as a future exercise.