-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Conversation
return ans; | ||
Vector4 q = Unsafe.As<Quaternion, Vector4>(ref value); | ||
q = Vector4.Normalize(q); | ||
return Unsafe.As<Vector4, Quaternion>(ref q); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if Unsafe.As<Vector4, Quaternion>(ref q)
will prevent q
being a register (e.g. address taken)
e.g. should it be?
public static Quaternion Normalize(Quaternion value)
{
Vector4 q = Unsafe.As<Quaternion, Vector4>(ref value);
Vector4 result = Vector4.Normalize(q);
return Unsafe.As<Vector4, Quaternion>(ref result);
}
61c01ba
to
5d63a4a
Compare
Getting same failures as Linux release locally on Windows; though only in release mode
|
Some pretty weird results either using
|
292f1fe
to
f86df6c
Compare
@@ -383,7 +383,7 @@ public void QuaternionCreateFromYawPitchRollTest2() | |||
|
|||
Quaternion expected = yaw * pitch * roll; | |||
Quaternion actual = Quaternion.CreateFromYawPitchRoll(yawRad, pitchRad, rollRad); | |||
Assert.True(MathHelper.Equal(expected, actual), String.Format("Yaw:{0} Pitch:{1} Roll:{2}", yawAngle, pitchAngle, rollAngle)); | |||
Assert.True(MathHelper.Equal(expected, actual), $"Quaternion.QuaternionCreateFromYawPitchRollTest2 Yaw:{yawAngle} Pitch:{pitchAngle} Roll:{rollAngle} did not return the expected value: expected {expected} actual2 {actual}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason this says actual2 {actual}
? Is that just a type-o?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type-o
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
ans.W = value.W; | ||
|
||
return ans; | ||
Vector4 q = -ToVector4(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick Benchmark on this change:
[Benchmark]
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void Conjugate()
{
Quaternion start = new Quaternion(8.5f, 9.4f, 1.2f, 1f);
Quaternion c1 = Quaternion.Conjugate(start);
Quaternion c2 = Quaternion.Conjugate(c1);
Quaternion c3 = Quaternion.Conjugate(c2);
Quaternion c4 = Quaternion.Conjugate(c3);
}
And the results show a degradation of this method on my machine:
BenchmarkDotNet=v0.10.10.20171127-develop, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.19)
Processor=Intel Core i7-6700 CPU 3.40GHz (Skylake), ProcessorCount=8
Frequency=3328122 Hz, Resolution=300.4698 ns, Timer=TSC
.NET Core SDK=2.2.0-preview1-007522
[Host] : .NET Core 2.1.0-preview1-25907-02 (Framework 4.6.25901.06), 64bit RyuJIT
With your changes:
Method | Mean | Error | StdDev |
---|---|---|---|
Conjugate | 57.38 ns | 0.9437 ns | 0.8827 ns |
Without the changes:
Method | Mean | Error | StdDev |
---|---|---|---|
Conjugate | 5.740 ns | 0.0549 ns | 0.0397 ns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also see a degradation for Normalize. Same machine as above.
[Benchmark]
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void Normalize()
{
Quaternion start = new Quaternion(8.5f, 9.4f, 1.2f, 1f);
Quaternion c1 = Quaternion.Normalize(start);
Quaternion c2 = Quaternion.Normalize(c1);
Quaternion c3 = Quaternion.Normalize(c2);
Quaternion c4 = Quaternion.Normalize(c3);
}
With your Normalize change:
Method | Mean | Error | StdDev |
---|---|---|---|
Normalize | 109.890 ns | 1.6731 ns | 1.5650 ns |
Without your Normalize change:
Method | Mean | Error | StdDev |
---|---|---|---|
Normalize | 60.797 ns | 0.3803 ns | 0.3557 ns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was better with the Unsafe casting; but produced the wrong results in release 😢
Method | Mean | Error | StdDev | Scaled | ScaledSD |
----------------- |----------:|----------:|----------:|-------:|---------:|
ConjugateUnsafe | 8.564 ns | 0.0385 ns | 0.0322 ns | 0.31 | 0.00 |
ConjugateCurrent | 27.613 ns | 0.1639 ns | 0.1533 ns | 1.00 | 0.00 |
ConjugateChange | 64.814 ns | 0.2741 ns | 0.2564 ns | 2.35 | 0.02 |
Will have to dig into why its producing wrong results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there issues if you make this a union, rather than using Unsafe.As
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fairly sure (can't find code atm) that the Jit won't consider struct with overlapping fields for a register
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will try to get a repo and file in coreclr; then revert this back to the unsafe version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raised issue https://github.com/dotnet/coreclr/issues/15237
Not sure about the |
This is captured in the |
In |
BTW: Introducing overlapping fields also affects how the struct is passed in interop on Unix x64. |
This PR is sitting here for 1 month. Any plans to push it forward @benaadams? |
Stripped out the change not to use Unsafe as that made things slower; however then it hits the issue https://github.com/dotnet/coreclr/issues/15237 |
@benaadams the dependency seems to be resolved now. |
As this was an issue in the Jit and Quaternion is also OOB; do I #if the changes for netcoreapp2.1? |
I'd assume you'd have to, or else the tests won't pass on desktop, right? Just an FYI - we try to not use |
d414e86
to
6c3a6d7
Compare
Updated |
Have it wrong somehow?
|
System.Numerics.Vectors is inbox. System.Runtime.CompilerServices.Unsafe is out of box. Inbox cannot depend on out of box. |
<ItemGroup Condition="'$(TargetGroup)' == 'netcoreapp'"> | ||
<Compile Include="System\Numerics\Quaternion.netcoreapp.cs" /> | ||
<Compile Include="$(CommonPath)\CoreLib\Internal\Runtime\CompilerServices\Unsafe.cs"> | ||
<Link>Common\CoreLib\Internal\Runtime\CompilerServices\Unsafe.cs</Link> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to reference the internal Unsafe in CoreLib. Local copy is not going to work - it won't be recognized by the JIT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like this?
<ItemGroup Condition="'$(TargetGroup)' == 'netcoreapp'">
<Compile Include="System\Numerics\Quaternion.netcoreapp.cs" />
<ReferenceFromRuntime Include="System.Private.CoreLib" />
</ItemGroup>
<ItemGroup Condition="'$(IsPartialFacadeAssembly)' != 'true' AND '$(TargetGroup)' != 'netcoreapp'">
<Compile Include="System\Numerics\Quaternion.cs" />
</ItemGroup>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, something like this.
NETFX System.Drawing.Common.Tests
test NETFX x86 Release Build |
<Link>System\MathF.netstandard.cs</Link> | ||
</Compile> | ||
</ItemGroup> | ||
<!-- Optimize Quaternion as Vector4 for netcoreapp --> | ||
<!-- Jit issue for other runtimes https://github.com/dotnet/coreclr/issues/15237 --> | ||
<ItemGroup Condition="'$(IsPartialFacadeAssembly)' != 'true' AND $(TargetGroup.StartsWith('netcoreapp2'))"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CoreLib reference can be used for live netcoreapp
only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always gives me this when using netcoreapp
C:\GitHub\corefx\buildvertical.targets(168,5): error :
Could not find a configuration for ProjectReference
'C:\GitHub\corefx\\external\runtime\runtime.depproj' from configurations
netcoreapp-Windows_NT;
netcoreapp-Unix;
netcoreapp2.0-Windows_NT;
netcoreapp2.0-Unix;
uap;
uapaot;
mono
when building 'System.Numerics.Vectors' for configuration
netcoreapp
[C:\GitHub\corefx\src\System.Numerics.Vectors\src\System.Numerics.Vectors.csproj]
Should be intrinisic? [Intrinsic]
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector4 operator /(Vector4 value1, float value2) |
Unless moving some of Vector to coreclr broke the Intriniscs; or its a bad implementation of it? |
Doesn't look like public static Quaternion Normalize(Quaternion value)
{
Vector4 q = Unsafe.As<Quaternion, Vector4>(ref value);
float length = q.Length();
Vector4 v = q / new Vector4(length);
return Unsafe.As<Vector4, Quaternion>(ref v);
} evaporates the division inlines Inlines into 06000003 Program:Normalize_New():struct
[1 IL=0020 TR=000011 06000006] [profitable inline] QuaternionStruct:.ctor(float,float,float,float):this
[2 IL=0025 TR=000018 06000010] [aggressive inline attribute] QuaternionStruct:Normalize(struct):struct
[3 IL=0002 TR=000092 06000016] [aggressive inline attribute] Unsafe:As(byref):byref
[4 IL=0015 TR=000103 0600010A] [aggressive inline attribute] Vector4:Length():float:this
[5 IL=0036 TR=000134 06000016] [aggressive inline attribute] Unsafe:As(byref):byref
[6 IL=0030 TR=000024 06000010] [aggressive inline attribute] QuaternionStruct:Normalize(struct):struct
[7 IL=0002 TR=000196 06000016] [aggressive inline attribute] Unsafe:As(byref):byref
[8 IL=0015 TR=000207 0600010A] [aggressive inline attribute] Vector4:Length():float:this
[9 IL=0036 TR=000238 06000016] [aggressive inline attribute] Unsafe:As(byref):byref
[10 IL=0035 TR=000035 06000010] [aggressive inline attribute] QuaternionStruct:Normalize(struct):struct
[11 IL=0002 TR=000300 06000016] [aggressive inline attribute] Unsafe:As(byref):byref
[12 IL=0015 TR=000311 0600010A] [aggressive inline attribute] Vector4:Length():float:this
[13 IL=0036 TR=000342 06000016] [aggressive inline attribute] Unsafe:As(byref):byref
[14 IL=0040 TR=000046 06000010] [aggressive inline attribute] QuaternionStruct:Normalize(struct):struct
[15 IL=0002 TR=000404 06000016] [aggressive inline attribute] Unsafe:As(byref):byref
[16 IL=0015 TR=000415 0600010A] [aggressive inline attribute] Vector4:Length():float:this
[17 IL=0036 TR=000446 06000016] [aggressive inline attribute] Unsafe:As(byref):byref
Budget: initialTime=198, finalTime=1188, initialBudget=1980, currentBudget=3004
Budget: increased by 1024 because of force inlines
Budget: initialSize=1180, finalSize=1331
; Assembly listing for method Program:Normalize_New():struct
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 RetBuf [V00,T01] ( 4, 4 ) byref -> rcx
;* V01 loc0 [V01 ] ( 0, 0 ) struct (16) zero-ref
; V02 tmp1 [V02,T06] ( 2, 4 ) struct (16) [rsp+0xA8] do-not-enreg[SB]
; V03 tmp2 [V03,T07] ( 2, 4 ) struct (16) [rsp+0x98] do-not-enreg[SB]
; V04 tmp3 [V04,T08] ( 2, 4 ) struct (16) [rsp+0x88] do-not-enreg[SB]
; V05 tmp4 [V05 ] ( 2, 4 ) struct (16) [rsp+0x78] do-not-enreg[XSVB] addr-exposed ld-addr-op
; V06 tmp5 [V06,T02] ( 4, 4 ) simd16 -> mm0 ld-addr-op
;* V07 tmp6 [V07 ] ( 0, 0 ) float -> zero-ref
; V08 tmp7 [V08,T09] ( 2, 4 ) simd16 -> mm1
;* V09 tmp8 [V09 ] ( 0, 0 ) simd16 -> zero-ref
; V10 tmp9 [V10,T16] ( 2, 2 ) simd16 -> [rsp+0x60] do-not-enreg[SB] ld-addr-op
; V11 tmp10 [V11,T20] ( 2, 2 ) float -> mm1
; V12 tmp11 [V12,T21] ( 2, 2 ) float -> mm1
; V13 tmp12 [V13,T10] ( 2, 4 ) struct (16) [rsp+0x50] do-not-enreg[SVB] ld-addr-op
; V14 tmp13 [V14,T03] ( 4, 4 ) simd16 -> mm0 ld-addr-op
;* V15 tmp14 [V15 ] ( 0, 0 ) float -> zero-ref
; V16 tmp15 [V16,T11] ( 2, 4 ) simd16 -> mm1
;* V17 tmp16 [V17 ] ( 0, 0 ) simd16 -> zero-ref
; V18 tmp17 [V18,T17] ( 2, 2 ) simd16 -> [rsp+0x40] do-not-enreg[SB] ld-addr-op
; V19 tmp18 [V19,T22] ( 2, 2 ) float -> mm1
; V20 tmp19 [V20,T23] ( 2, 2 ) float -> mm1
; V21 tmp20 [V21,T12] ( 2, 4 ) struct (16) [rsp+0x30] do-not-enreg[SVB] ld-addr-op
; V22 tmp21 [V22,T04] ( 4, 4 ) simd16 -> mm0 ld-addr-op
;* V23 tmp22 [V23 ] ( 0, 0 ) float -> zero-ref
; V24 tmp23 [V24,T13] ( 2, 4 ) simd16 -> mm1
;* V25 tmp24 [V25 ] ( 0, 0 ) simd16 -> zero-ref
; V26 tmp25 [V26,T18] ( 2, 2 ) simd16 -> [rsp+0x20] do-not-enreg[SB] ld-addr-op
; V27 tmp26 [V27,T24] ( 2, 2 ) float -> mm1
; V28 tmp27 [V28,T25] ( 2, 2 ) float -> mm1
; V29 tmp28 [V29,T14] ( 2, 4 ) struct (16) [rsp+0x10] do-not-enreg[SVB] ld-addr-op
; V30 tmp29 [V30,T05] ( 4, 4 ) simd16 -> mm0 ld-addr-op
;* V31 tmp30 [V31 ] ( 0, 0 ) float -> zero-ref
; V32 tmp31 [V32,T15] ( 2, 4 ) simd16 -> mm1
;* V33 tmp32 [V33 ] ( 0, 0 ) simd16 -> zero-ref
; V34 tmp33 [V34,T19] ( 2, 2 ) simd16 -> [rsp+0x00] do-not-enreg[SB] ld-addr-op
; V35 tmp34 [V35,T26] ( 2, 2 ) float -> mm1
; V36 tmp35 [V36,T27] ( 2, 2 ) float -> mm1
; V37 tmp36 [V37,T28] ( 2, 2 ) float -> mm0 V01.X(offs=0x00) P-INDEP
; V38 tmp37 [V38,T29] ( 2, 2 ) float -> mm1 V01.Y(offs=0x04) P-INDEP
; V39 tmp38 [V39,T30] ( 2, 2 ) float -> mm2 V01.Z(offs=0x08) P-INDEP
; V40 tmp39 [V40,T31] ( 2, 2 ) float -> mm3 V01.W(offs=0x0c) P-INDEP
; V41 tmp40 [V41,T00] ( 5, 10 ) byref -> rax stack-byref
;# V42 OutArgs [V42 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00]
;
; Lcl frame size = 184
G_M39223_IG01:
4881ECB8000000 sub rsp, 184
C5F877 vzeroupper
G_M39223_IG02:
C4E17A10055D010000 vmovss xmm0, dword ptr [reloc @RWD00]
C4E17A100D58010000 vmovss xmm1, dword ptr [reloc @RWD04]
C4E17A101553010000 vmovss xmm2, dword ptr [reloc @RWD08]
C4E17A101D4E010000 vmovss xmm3, dword ptr [reloc @RWD12]
488D442478 lea rax, bword ptr [rsp+78H]
C4E17A1100 vmovss dword ptr [rax], xmm0
C4E17A114804 vmovss dword ptr [rax+4], xmm1
C4E17A115008 vmovss dword ptr [rax+8], xmm2
C4E17A11580C vmovss dword ptr [rax+12], xmm3
C4E17910442478 vmovupd xmm0, xmmword ptr [rsp+78H]
C4E17828C8 vmovaps xmm1, xmm0
C4E37140C8F1 vdpps xmm1, xmm0, 241
C4E17251C9 vsqrtss xmm1, xmm1
C4E27918C9 vbroadcastss xmm1, xmm1
C4E1785EC1 vdivps xmm0, xmm1
C4E17929442460 vmovapd xmmword ptr [rsp+60H], xmm0
C4E17A6F442460 vmovdqu xmm0, qword ptr [rsp+60H]
C4E17A7F8424A8000000 vmovdqu qword ptr [rsp+A8H], xmm0
C4E17A6F8424A8000000 vmovdqu xmm0, qword ptr [rsp+A8H]
C4E17A7F442450 vmovdqu qword ptr [rsp+50H], xmm0
C4E17910442450 vmovupd xmm0, xmmword ptr [rsp+50H]
C4E17828C8 vmovaps xmm1, xmm0
C4E37140C8F1 vdpps xmm1, xmm0, 241
C4E17251C9 vsqrtss xmm1, xmm1
C4E27918C9 vbroadcastss xmm1, xmm1
C4E1785EC1 vdivps xmm0, xmm1
C4E17929442440 vmovapd xmmword ptr [rsp+40H], xmm0
C4E17A6F442440 vmovdqu xmm0, qword ptr [rsp+40H]
C4E17A7F842498000000 vmovdqu qword ptr [rsp+98H], xmm0
C4E17A6F842498000000 vmovdqu xmm0, qword ptr [rsp+98H]
C4E17A7F442430 vmovdqu qword ptr [rsp+30H], xmm0
C4E17910442430 vmovupd xmm0, xmmword ptr [rsp+30H]
C4E17828C8 vmovaps xmm1, xmm0
C4E37140C8F1 vdpps xmm1, xmm0, 241
C4E17251C9 vsqrtss xmm1, xmm1
C4E27918C9 vbroadcastss xmm1, xmm1
C4E1785EC1 vdivps xmm0, xmm1
C4E17929442420 vmovapd xmmword ptr [rsp+20H], xmm0
C4E17A6F442420 vmovdqu xmm0, qword ptr [rsp+20H]
C4E17A7F842488000000 vmovdqu qword ptr [rsp+88H], xmm0
C4E17A6F842488000000 vmovdqu xmm0, qword ptr [rsp+88H]
C4E17A7F442410 vmovdqu qword ptr [rsp+10H], xmm0
C4E17910442410 vmovupd xmm0, xmmword ptr [rsp+10H]
C4E17828C8 vmovaps xmm1, xmm0
C4E37140C8F1 vdpps xmm1, xmm0, 241
C4E17251C9 vsqrtss xmm1, xmm1
C4E27918C9 vbroadcastss xmm1, xmm1
C4E1785EC1 vdivps xmm0, xmm1
C4E179290424 vmovapd xmmword ptr [rsp], xmm0
C4E17A6F0424 vmovdqu xmm0, qword ptr [rsp]
C4E17A7F01 vmovdqu qword ptr [rcx], xmm0
488BC1 mov rax, rcx
G_M39223_IG03:
4881C4B8000000 add rsp, 184
C3 ret
; Total bytes of code 357, prolog size 10 for method Program:Normalize_New():struct And goes a little faster
Which suggests Vector4 operator should be changed to use |
Some of the asm is a bit redundant though? C4E17A7F8424A8000000 vmovdqu qword ptr [rsp+A8H], xmm0
C4E17A6F8424A8000000 vmovdqu xmm0, qword ptr [rsp+A8H] |
It doesn't seem to be necessary. The fundamental problem seems to be that the current implementation of float invDiv = 1.0f / value2;
return new Vector4(value1.X * invDiv, value1.Y * invDiv, value1.Z * invDiv, value1.W * invDiv); This should be return value1 / new Vector4(value2); that would give you a broadcast/shuffle + divps. return value1 * (1.0f / value2); that would give you divss + broadcast/shuffle + mulps. But this approach is kind of lame. It's slower on current hardware and it's also less precise. Changing
So it seems. Could be an inlining artifact, sometimes it generates copies that aren't removed by subsequent phases. Or it's an unfortunate side effect of using |
Issue: https://github.com/dotnet/coreclr/issues/16385 Workaround: #27122 |
@benaadams, what is the status of this PR? Any updates? |
I've been kidnapped for 2 weeks |
@benaadams how is this week treating you? 😉 |
BTW: If the change is considered "risky" by area owners, we might need to wait for master branch being reopen for post-2.1 work. (2-3 weeks) |
Back on it |
Windows x86 Release Build failure https://github.com/dotnet/corefx/issues/28453 |
Still not good 😢
public static Quaternion Normalize_Current()
{
Quaternion start = new Quaternion(8.5f, 9.4f, 1.2f, 1f);
Quaternion c1 = Quaternion.Normalize(start);
Quaternion c2 = Quaternion.Normalize(c1);
Quaternion c3 = Quaternion.Normalize(c2);
return Quaternion.Normalize(c3);
}
public static QuaternionStruct Normalize_New()
{
QuaternionStruct start = new QuaternionStruct(8.5f, 9.4f, 1.2f, 1f);
QuaternionStruct c1 = QuaternionStruct.Normalize(start);
QuaternionStruct c2 = QuaternionStruct.Normalize(c1);
QuaternionStruct c3 = QuaternionStruct.Normalize(c2);
return QuaternionStruct.Normalize(c3);
}
public static Vector4 Normalize_Vector4()
{
Vector4 start = new Vector4(8.5f, 9.4f, 1.2f, 1f);
Vector4 c1 = Vector4.Normalize(start);
Vector4 c2 = Vector4.Normalize(c1);
Vector4 c3 = Vector4.Normalize(c2);
return Vector4.Normalize(c3);
} ; Assembly listing for method Program:Normalize_Vector4():struct
; ...
; Lcl frame size = 0
G_M3011_IG01:
C5F877 vzeroupper
G_M3011_IG02:
C4E17A1005CC000000 vmovss xmm0, dword ptr [reloc @RWD00]
C4E17A100DC7000000 vmovss xmm1, dword ptr [reloc @RWD04]
C4E17A1015C2000000 vmovss xmm2, dword ptr [reloc @RWD08]
C4E17A101DBD000000 vmovss xmm3, dword ptr [reloc @RWD12]
C4E15857E4 vxorps xmm4, xmm4
C4E15A10E3 vmovss xmm4, xmm4, xmm3
C4E15973FC04 vpslldq xmm4, 4
C4E15A10E2 vmovss xmm4, xmm4, xmm2
C4E15973FC04 vpslldq xmm4, 4
C4E15A10E1 vmovss xmm4, xmm4, xmm1
C4E15973FC04 vpslldq xmm4, 4
C4E15A10E0 vmovss xmm4, xmm4, xmm0
C4E17828C4 vmovaps xmm0, xmm4
C4E17828C8 vmovaps xmm1, xmm0
C4E37140C8F1 vdpps xmm1, xmm0, 241
C4E17251C9 vsqrtss xmm1, xmm1
C4E27918C9 vbroadcastss xmm1, xmm1
C4E1785EC1 vdivps xmm0, xmm1
C4E17828C8 vmovaps xmm1, xmm0
C4E37140C8F1 vdpps xmm1, xmm0, 241
C4E17251C9 vsqrtss xmm1, xmm1
C4E27918C9 vbroadcastss xmm1, xmm1
C4E1785EC1 vdivps xmm0, xmm1
C4E17828C8 vmovaps xmm1, xmm0
C4E37140C8F1 vdpps xmm1, xmm0, 241
C4E17251C9 vsqrtss xmm1, xmm1
C4E27918C9 vbroadcastss xmm1, xmm1
C4E1785EC1 vdivps xmm0, xmm1
C4E17828C8 vmovaps xmm1, xmm0
C4E37140C8F1 vdpps xmm1, xmm0, 241
C4E17251C9 vsqrtss xmm1, xmm1
C4E27918C9 vbroadcastss xmm1, xmm1
C4E1785EC1 vdivps xmm0, xmm1
C4E1791101 vmovupd xmmword ptr [rcx], xmm0
488BC1 mov rax, rcx
G_M3011_IG03:
C3 ret
; Total bytes of code 200, prolog size 3 for method Program:Normalize_Vector4():struct ; Assembly listing for method Program:Normalize_New():struct
;
; V00 RetBuf [V00,T05] ( 4, 4 ) byref -> rcx
;* V01 loc0 [V01 ] ( 0, 0 ) struct (16) zero-ref
; V02 tmp1 [V02,T06] ( 2, 4 ) struct (16) [rsp+0xA8] do-not-enreg[SB]
; V03 tmp2 [V03,T07] ( 2, 4 ) struct (16) [rsp+0x98] do-not-enreg[SB]
; V04 tmp3 [V04,T08] ( 2, 4 ) struct (16) [rsp+0x88] do-not-enreg[SB]
; V05 tmp4 [V05 ] ( 2, 4 ) struct (16) [rsp+0x78] do-not-enreg[XSVB] addr-exposed ld-addr-op
; V06 tmp5 [V06,T16] ( 2, 2 ) simd16 -> [rsp+0x60] do-not-enreg[SB] ld-addr-op
; V07 tmp6 [V07,T17] ( 2, 2 ) simd16 -> mm0
; V08 tmp7 [V08,T01] ( 4, 8 ) simd16 -> mm0 ld-addr-op
;* V09 tmp8 [V09 ] ( 0, 0 ) float -> zero-ref
; V10 tmp9 [V10,T24] ( 2, 2 ) float -> mm1
; V11 tmp10 [V11,T25] ( 2, 2 ) float -> mm1
;* V12 tmp11 [V12 ] ( 0, 0 ) simd16 -> zero-ref
; V13 tmp12 [V13,T09] ( 2, 4 ) simd16 -> mm1
; V14 tmp13 [V14,T10] ( 2, 4 ) struct (16) [rsp+0x50] do-not-enreg[SVB] ld-addr-op
; V15 tmp14 [V15,T18] ( 2, 2 ) simd16 -> [rsp+0x40] do-not-enreg[SB] ld-addr-op
; V16 tmp15 [V16,T19] ( 2, 2 ) simd16 -> mm0
; V17 tmp16 [V17,T02] ( 4, 8 ) simd16 -> mm0 ld-addr-op
;* V18 tmp17 [V18 ] ( 0, 0 ) float -> zero-ref
; V19 tmp18 [V19,T26] ( 2, 2 ) float -> mm1
; V20 tmp19 [V20,T27] ( 2, 2 ) float -> mm1
;* V21 tmp20 [V21 ] ( 0, 0 ) simd16 -> zero-ref
; V22 tmp21 [V22,T11] ( 2, 4 ) simd16 -> mm1
; V23 tmp22 [V23,T12] ( 2, 4 ) struct (16) [rsp+0x30] do-not-enreg[SVB] ld-addr-op
; V24 tmp23 [V24,T20] ( 2, 2 ) simd16 -> [rsp+0x20] do-not-enreg[SB] ld-addr-op
; V25 tmp24 [V25,T21] ( 2, 2 ) simd16 -> mm0
; V26 tmp25 [V26,T03] ( 4, 8 ) simd16 -> mm0 ld-addr-op
;* V27 tmp26 [V27 ] ( 0, 0 ) float -> zero-ref
; V28 tmp27 [V28,T28] ( 2, 2 ) float -> mm1
; V29 tmp28 [V29,T29] ( 2, 2 ) float -> mm1
;* V30 tmp29 [V30 ] ( 0, 0 ) simd16 -> zero-ref
; V31 tmp30 [V31,T13] ( 2, 4 ) simd16 -> mm1
; V32 tmp31 [V32,T14] ( 2, 4 ) struct (16) [rsp+0x10] do-not-enreg[SVB] ld-addr-op
; V33 tmp32 [V33,T22] ( 2, 2 ) simd16 -> [rsp+0x00] do-not-enreg[SB] ld-addr-op
; V34 tmp33 [V34,T23] ( 2, 2 ) simd16 -> mm0
; V35 tmp34 [V35,T04] ( 4, 8 ) simd16 -> mm0 ld-addr-op
;* V36 tmp35 [V36 ] ( 0, 0 ) float -> zero-ref
; V37 tmp36 [V37,T30] ( 2, 2 ) float -> mm1
; V38 tmp37 [V38,T31] ( 2, 2 ) float -> mm1
;* V39 tmp38 [V39 ] ( 0, 0 ) simd16 -> zero-ref
; V40 tmp39 [V40,T15] ( 2, 4 ) simd16 -> mm1
; V41 tmp40 [V41,T32] ( 2, 2 ) float -> mm0 V01.X(offs=0x00) P-INDEP
; V42 tmp41 [V42,T33] ( 2, 2 ) float -> mm1 V01.Y(offs=0x04) P-INDEP
; V43 tmp42 [V43,T34] ( 2, 2 ) float -> mm2 V01.Z(offs=0x08) P-INDEP
; V44 tmp43 [V44,T35] ( 2, 2 ) float -> mm3 V01.W(offs=0x0c) P-INDEP
; V45 tmp44 [V45,T00] ( 5, 10 ) byref -> rax stack-byref
;# V46 OutArgs [V46 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00]
;
; Lcl frame size = 184
G_M39231_IG01:
4881ECB8000000 sub rsp, 184
C5F877 vzeroupper
G_M39231_IG02:
C4E17A10055D010000 vmovss xmm0, dword ptr [reloc @RWD00]
C4E17A100D58010000 vmovss xmm1, dword ptr [reloc @RWD04]
C4E17A101553010000 vmovss xmm2, dword ptr [reloc @RWD08]
C4E17A101D4E010000 vmovss xmm3, dword ptr [reloc @RWD12]
488D442478 lea rax, bword ptr [rsp+78H]
C4E17A1100 vmovss dword ptr [rax], xmm0
C4E17A114804 vmovss dword ptr [rax+4], xmm1
C4E17A115008 vmovss dword ptr [rax+8], xmm2
C4E17A11580C vmovss dword ptr [rax+12], xmm3
C4E17910442478 vmovupd xmm0, xmmword ptr [rsp+78H]
C4E17828C8 vmovaps xmm1, xmm0
C4E37140C8F1 vdpps xmm1, xmm0, 241
C4E17251C9 vsqrtss xmm1, xmm1
C4E27918C9 vbroadcastss xmm1, xmm1
C4E1785EC1 vdivps xmm0, xmm1
C4E17929442460 vmovapd xmmword ptr [rsp+60H], xmm0
C4E17A6F442460 vmovdqu xmm0, qword ptr [rsp+60H]
C4E17A7F8424A8000000 vmovdqu qword ptr [rsp+A8H], xmm0
C4E17A6F8424A8000000 vmovdqu xmm0, qword ptr [rsp+A8H]
C4E17A7F442450 vmovdqu qword ptr [rsp+50H], xmm0
C4E17910442450 vmovupd xmm0, xmmword ptr [rsp+50H]
C4E17828C8 vmovaps xmm1, xmm0
C4E37140C8F1 vdpps xmm1, xmm0, 241
C4E17251C9 vsqrtss xmm1, xmm1
C4E27918C9 vbroadcastss xmm1, xmm1
C4E1785EC1 vdivps xmm0, xmm1
C4E17929442440 vmovapd xmmword ptr [rsp+40H], xmm0
C4E17A6F442440 vmovdqu xmm0, qword ptr [rsp+40H]
C4E17A7F842498000000 vmovdqu qword ptr [rsp+98H], xmm0
C4E17A6F842498000000 vmovdqu xmm0, qword ptr [rsp+98H]
C4E17A7F442430 vmovdqu qword ptr [rsp+30H], xmm0
C4E17910442430 vmovupd xmm0, xmmword ptr [rsp+30H]
C4E17828C8 vmovaps xmm1, xmm0
C4E37140C8F1 vdpps xmm1, xmm0, 241
C4E17251C9 vsqrtss xmm1, xmm1
C4E27918C9 vbroadcastss xmm1, xmm1
C4E1785EC1 vdivps xmm0, xmm1
C4E17929442420 vmovapd xmmword ptr [rsp+20H], xmm0
C4E17A6F442420 vmovdqu xmm0, qword ptr [rsp+20H]
C4E17A7F842488000000 vmovdqu qword ptr [rsp+88H], xmm0
C4E17A6F842488000000 vmovdqu xmm0, qword ptr [rsp+88H]
C4E17A7F442410 vmovdqu qword ptr [rsp+10H], xmm0
C4E17910442410 vmovupd xmm0, xmmword ptr [rsp+10H]
C4E17828C8 vmovaps xmm1, xmm0
C4E37140C8F1 vdpps xmm1, xmm0, 241
C4E17251C9 vsqrtss xmm1, xmm1
C4E27918C9 vbroadcastss xmm1, xmm1
C4E1785EC1 vdivps xmm0, xmm1
C4E179290424 vmovapd xmmword ptr [rsp], xmm0
C4E17A6F0424 vmovdqu xmm0, qword ptr [rsp]
C4E17A7F01 vmovdqu qword ptr [rcx], xmm0
488BC1 mov rax, rcx
G_M39231_IG03:
4881C4B8000000 add rsp, 184
C3 ret
; Total bytes of code 357, prolog size 10 for method Program:Normalize_New():struct ; Assembly listing for method Program:Normalize_Current():struct
;...
; V00 RetBuf [V00,T01] ( 4, 4 ) byref -> rsi
;* V01 loc0 [V01 ] ( 0, 0 ) struct (16) zero-ref
; V02 tmp1 [V02 ] ( 2, 4 ) struct (16) [rsp+0x50] do-not-enreg[XSB] addr-exposed
; V03 tmp2 [V03 ] ( 2, 4 ) struct (16) [rsp+0x40] do-not-enreg[XSB] addr-exposed
; V04 tmp3 [V04 ] ( 2, 4 ) struct (16) [rsp+0x30] do-not-enreg[XSB] addr-exposed
; V05 tmp4 [V05,T06] ( 2, 2 ) float -> mm0 V01.X(offs=0x00) P-INDEP
; V06 tmp5 [V06,T07] ( 2, 2 ) float -> mm1 V01.Y(offs=0x04) P-INDEP
; V07 tmp6 [V07,T08] ( 2, 2 ) float -> mm2 V01.Z(offs=0x08) P-INDEP
; V08 tmp7 [V08,T09] ( 2, 2 ) float -> mm3 V01.W(offs=0x0c) P-INDEP
; V09 tmp8 [V09 ] ( 12, 24 ) struct (16) [rsp+0x20] do-not-enreg[XSB] addr-exposed
; V10 tmp9 [V10,T00] ( 5, 10 ) byref -> rdx stack-byref
; V11 tmp10 [V11,T03] ( 2, 4 ) long -> rcx
; V12 tmp11 [V12,T04] ( 2, 4 ) long -> rcx
; V13 tmp12 [V13,T05] ( 2, 4 ) long -> rcx
; V14 tmp13 [V14,T02] ( 2, 4 ) byref -> rcx
; V15 OutArgs [V15 ] ( 1, 1 ) lclBlk (32) [rsp+0x00]
;
; Lcl frame size = 96
G_M20468_IG01:
56 push rsi
4883EC60 sub rsp, 96
C5F877 vzeroupper
488BF1 mov rsi, rcx
G_M20468_IG02:
C4E17A1005A4000000 vmovss xmm0, dword ptr [reloc @RWD00]
C4E17A100D9F000000 vmovss xmm1, dword ptr [reloc @RWD04]
C4E17A10159A000000 vmovss xmm2, dword ptr [reloc @RWD08]
C4E17A101D95000000 vmovss xmm3, dword ptr [reloc @RWD12]
488D4C2450 lea rcx, bword ptr [rsp+50H]
488D542420 lea rdx, bword ptr [rsp+20H]
C4E17A1102 vmovss dword ptr [rdx], xmm0
C4E17A114A04 vmovss dword ptr [rdx+4], xmm1
C4E17A115208 vmovss dword ptr [rdx+8], xmm2
C4E17A115A0C vmovss dword ptr [rdx+12], xmm3
488D542420 lea rdx, bword ptr [rsp+20H]
E8BEFBFFFF call Quaternion:Normalize(struct):struct
488D4C2440 lea rcx, bword ptr [rsp+40H]
C4E17A6F442450 vmovdqu xmm0, qword ptr [rsp+50H]
C4E17A7F442420 vmovdqu qword ptr [rsp+20H], xmm0
488D542420 lea rdx, bword ptr [rsp+20H]
E8A1FBFFFF call Quaternion:Normalize(struct):struct
488D4C2430 lea rcx, bword ptr [rsp+30H]
C4E17A6F442440 vmovdqu xmm0, qword ptr [rsp+40H]
C4E17A7F442420 vmovdqu qword ptr [rsp+20H], xmm0
488D542420 lea rdx, bword ptr [rsp+20H]
E884FBFFFF call Quaternion:Normalize(struct):struct
488BCE mov rcx, rsi
C4E17A6F442430 vmovdqu xmm0, qword ptr [rsp+30H]
C4E17A7F442420 vmovdqu qword ptr [rsp+20H], xmm0
488D542420 lea rdx, bword ptr [rsp+20H]
E869FBFFFF call Quaternion:Normalize(struct):struct
488BC6 mov rax, rsi
G_M20468_IG03:
4883C460 add rsp, 96
5E pop rsi
C3 ret
; Total bytes of code 184, prolog size 8 for method Program:Normalize_Current():struct |
Going to give up for now 😞 |
Difference between Vector4 and the Quaternion cast to Vector4 is mainly these blocks I think: C4E17929442420 vmovapd xmmword ptr [rsp+20H], xmm0
C4E17A6F442420 vmovdqu xmm0, qword ptr [rsp+20H]
C4E17A7F842488000000 vmovdqu qword ptr [rsp+88H], xmm0
C4E17A6F842488000000 vmovdqu xmm0, qword ptr [rsp+88H]
C4E17A7F442410 vmovdqu qword ptr [rsp+10H], xmm0
C4E17910442410 vmovupd xmm0, xmmword ptr [rsp+10H] |
Raised issue https://github.com/dotnet/coreclr/issues/17207 |
@benaadams - did you intend to reopen this PR? I see you gave up for now and closed it. But then re-opened it the same day. I just want to verify if this PR should be opened or closed. |
Closed it; then opened issue in coreclr; and reopened in hope :) I'd like to Quaternion to be vectorized, but I also don't want to make it worse along the way... |
Note that the |
Added PR for the test changes in this PR #28582 Perhaps something to revisit with CPU intrinsics rather than Vector4 |
Contributes to https://github.com/dotnet/corefx/issues/7751
PTAL @mellinoe @eerhardt @tannergooding @CarolEidt