JIT generates suboptimal instructions for 2D vector struct #11940

ghost · 2019-01-31T05:49:11Z

I've been benchmarking various approaches to implement basic math struct operators for projects like FNA and MonoGame and among the results was the finding that 2D vectors were across the board slower than 3D and 4D. After some digging I isolated the tests and got the disassembly which shows that a struct with 3 float fields generates far more optimized instructions than a struct with 2 floats. This leads to the 2D vector being roughly 6x slower than the 3D vector.

The code for the entire benchmark can be found in https://github.com/nickgravelyn/math-struct-benchmark/blob/vector2-focus/Program.cs and the disassembly generated can be viewed in https://github.com/nickgravelyn/math-struct-benchmark/tree/vector2-focus#disassembly

category:cq
theme:structs
skill-level:expert
cost:medium

ghost · 2019-01-31T05:57:17Z

Additional note was that this puts RyuJIT quite a bit slower than the 32-bit legacy JIT which, by virtue of not generating SIMD instructions, seems to be quite a bit faster. In fact the 32-bit legacy JIT produced a benchmark for the 2D vector on par with the 3D vector using RyuJIT (and naturally the 3D vector on the legacy was slightly slower but not substantially).

mikedn · 2019-01-31T06:08:33Z

It seems to me that this has nothing to do with SIMD, it's all scalar floating point code.

mikedn · 2019-01-31T06:22:08Z

This is a variation of #4323. In the Win x64 ABI 8 byte structs are supposed to be returned in an integer register and the JIT doesn't handle this case well - it forces the struct into memory.

Also related to #11848 (the generated IR is basically the same as in https://github.com/dotnet/coreclr/issues/22079#issuecomment-456531059, but with float field types instead of int field types.)

cc @AndyAyersMS @CarolEidt

ghost · 2019-01-31T06:27:47Z

I might be totally off as I’m fairly inexperienced with native assembly and SIMD but my understanding was that things like ‘vaddss’ and the use of the ‘xmm’ registers fell under the umbrella of SIMD.

In either case the JIT is producing code that is roughly 6x slower for a smaller data type that, conceptually, is doing fewer operations. So whether it’s SIMD or not it still seems like an issue with the code generation, especially when the legacy 32-bit JIT outperforms on this same benchmark.

mikedn · 2019-01-31T06:39:13Z

I might be totally off as I’m fairly inexperienced with native assembly and SIMD but my understanding was that things like ‘vaddss’ and the use of the ‘xmm’ registers fell under the umbrella of SIMD.

vaddss is scalar addition - Add Single precision Scalar. The packed versions (vaddps/vaddpd) are SIMD. XMM registers are used for both normal scalar floating point operation and SIMD operations.

In either case the JIT is producing code that is roughly 6x slower for a smaller data type that, conceptually, is doing fewer operations. So whether it’s SIMD or not it still seems like an issue with the code generation, especially when the legacy 32-bit JIT outperforms on this same benchmark.

Yes, certainly. Note that the issue is specific to 64-bit. 32-bit (x86) RyuJIT produces code that's similar to the legacy 32-bit JIT.

ghost · 2019-01-31T06:41:11Z

Ah thanks for clarifying. I’ve renamed the issue accordingly.

mikedn · 2019-01-31T06:44:42Z

The same fix mentioned in https://github.com/dotnet/coreclr/issues/22079#issuecomment-456554914 results in the expected code being generated:

G_M55889_IG01:
       C5F877               vzeroupper
       6690                 nop
G_M55889_IG02:
       C5FA100533000000     vmovss   xmm0, dword ptr [reloc @RWD00]
       C5FA100D2F000000     vmovss   xmm1, dword ptr [reloc @RWD04]
       B8E8030000           mov      eax, 0x3E8
       EB1A                 jmp      SHORT G_M55889_IG04
G_M55889_IG03:
       C5FA580524000000     vaddss   xmm0, dword ptr [reloc @RWD08]
       C5F2580D20000000     vaddss   xmm1, dword ptr [reloc @RWD12]
       8BC2                 mov      eax, edx
G_M55889_IG04:
       8D50FF               lea      edx, [rax-1]
       85C0                 test     eax, eax
       7FE7                 jg       SHORT G_M55889_IG03
G_M55889_IG05:
       C3                   ret

Though I think I'd prefer not generating that LCL_FLD in the first place, rather than attempting to deal with it after the fact.

Or, generate LCL_FLD but don't force the variable in memory. That would be necessary to get good codegen when the method is not inlined. In that case the 2 float fields could be extracted from the integer register using something like:

movd xmm0, eax
shr rax, 32
movd xmm1, eax

AndyAyersMS · 2019-02-01T16:26:10Z

We may get to this for 3.0, but marking as Future.

CarolEidt · 2020-10-23T23:53:48Z

This was fixed in 5.0 by improvements for enregistering small structs.

ghost changed the title ~~JIT generates suboptimal SIMD instructions for 2D vector struct~~ JIT generates suboptimal instructions for 2D vector struct Jan 31, 2019

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

CarolEidt self-assigned this Oct 23, 2020

CarolEidt modified the milestones: Future, 5.0.0 Oct 23, 2020

CarolEidt closed this as completed Oct 23, 2020

ghost locked as resolved and limited conversation to collaborators Dec 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT generates suboptimal instructions for 2D vector struct #11940

JIT generates suboptimal instructions for 2D vector struct #11940

ghost commented Jan 31, 2019

ghost commented Jan 31, 2019

mikedn commented Jan 31, 2019

mikedn commented Jan 31, 2019

ghost commented Jan 31, 2019

mikedn commented Jan 31, 2019

ghost commented Jan 31, 2019

mikedn commented Jan 31, 2019

AndyAyersMS commented Feb 1, 2019

CarolEidt commented Oct 23, 2020

JIT generates suboptimal instructions for 2D vector struct #11940

JIT generates suboptimal instructions for 2D vector struct #11940

Comments

ghost commented Jan 31, 2019

ghost commented Jan 31, 2019

mikedn commented Jan 31, 2019

mikedn commented Jan 31, 2019

ghost commented Jan 31, 2019

mikedn commented Jan 31, 2019

ghost commented Jan 31, 2019

mikedn commented Jan 31, 2019

AndyAyersMS commented Feb 1, 2019

CarolEidt commented Oct 23, 2020