-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT generates suboptimal instructions for 2D vector struct #11940
Comments
Additional note was that this puts RyuJIT quite a bit slower than the 32-bit legacy JIT which, by virtue of not generating SIMD instructions, seems to be quite a bit faster. In fact the 32-bit legacy JIT produced a benchmark for the 2D vector on par with the 3D vector using RyuJIT (and naturally the 3D vector on the legacy was slightly slower but not substantially). |
It seems to me that this has nothing to do with SIMD, it's all scalar floating point code. |
This is a variation of #4323. In the Win x64 ABI 8 byte structs are supposed to be returned in an integer register and the JIT doesn't handle this case well - it forces the struct into memory. Also related to #11848 (the generated IR is basically the same as in https://github.com/dotnet/coreclr/issues/22079#issuecomment-456531059, but with float field types instead of int field types.) |
I might be totally off as I’m fairly inexperienced with native assembly and SIMD but my understanding was that things like ‘vaddss’ and the use of the ‘xmm’ registers fell under the umbrella of SIMD. In either case the JIT is producing code that is roughly 6x slower for a smaller data type that, conceptually, is doing fewer operations. So whether it’s SIMD or not it still seems like an issue with the code generation, especially when the legacy 32-bit JIT outperforms on this same benchmark. |
Yes, certainly. Note that the issue is specific to 64-bit. 32-bit (x86) RyuJIT produces code that's similar to the legacy 32-bit JIT. |
Ah thanks for clarifying. I’ve renamed the issue accordingly. |
The same fix mentioned in https://github.com/dotnet/coreclr/issues/22079#issuecomment-456554914 results in the expected code being generated: G_M55889_IG01:
C5F877 vzeroupper
6690 nop
G_M55889_IG02:
C5FA100533000000 vmovss xmm0, dword ptr [reloc @RWD00]
C5FA100D2F000000 vmovss xmm1, dword ptr [reloc @RWD04]
B8E8030000 mov eax, 0x3E8
EB1A jmp SHORT G_M55889_IG04
G_M55889_IG03:
C5FA580524000000 vaddss xmm0, dword ptr [reloc @RWD08]
C5F2580D20000000 vaddss xmm1, dword ptr [reloc @RWD12]
8BC2 mov eax, edx
G_M55889_IG04:
8D50FF lea edx, [rax-1]
85C0 test eax, eax
7FE7 jg SHORT G_M55889_IG03
G_M55889_IG05:
C3 ret Though I think I'd prefer not generating that Or, generate movd xmm0, eax
shr rax, 32
movd xmm1, eax |
We may get to this for 3.0, but marking as Future. |
This was fixed in 5.0 by improvements for enregistering small structs. |
I've been benchmarking various approaches to implement basic math struct operators for projects like FNA and MonoGame and among the results was the finding that 2D vectors were across the board slower than 3D and 4D. After some digging I isolated the tests and got the disassembly which shows that a struct with 3 float fields generates far more optimized instructions than a struct with 2 floats. This leads to the 2D vector being roughly 6x slower than the 3D vector.
The code for the entire benchmark can be found in https://github.com/nickgravelyn/math-struct-benchmark/blob/vector2-focus/Program.cs and the disassembly generated can be viewed in https://github.com/nickgravelyn/math-struct-benchmark/tree/vector2-focus#disassembly
category:cq
theme:structs
skill-level:expert
cost:medium
The text was updated successfully, but these errors were encountered: