-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider SIMD acceleration for Matrix and Quaternion operations similar to Vector ones #16984
Comments
It should be possible to at least implement some operations, like public static unsafe Matrix4x4 operator *(Matrix4x4 value1, float value2)
{
Matrix4x4 result;
Vector4* srcPtr = (Vector4*)&value1;
Vector4* destPtr = (Vector4*)&result;
destPtr[0] = srcPtr[0] * value2;
destPtr[1] = srcPtr[1] * value2;
destPtr[2] = srcPtr[2] * value2;
destPtr[3] = srcPtr[3] * value2;
return result;
} For fun, it could also be implemented over public static unsafe Matrix4x4 operator *(Matrix4x4 value1, float value2)
{
Matrix4x4 result;
float* srcPtr = (float*)&value1;
float* destPtr = (float*)&result;
int stride = Vector<float>.Count;
Debug.Assert(16 % stride == 0);
for (int i = 0; i < 16; i += stride)
{
Vector<float> src = Unsafe.Read<Vector<float>>(srcPtr);
Unsafe.Write(destPtr, src);
srcPtr += stride;
destPtr += stride;
}
return result;
} Theoretically, the above could complete the entire multiplication in one "stride" on CPUs supporting AVX512 (upcoming hardware), which would be neat. The above won't be as fast as a real JIT intrinsic recognition, but they might be faster than the naive implementation currently in use. We'd need to do some performance tests on these to see how they behave. That said, it seems like the most "useful" methods to speed up would be Matrix multiplication (i.e. @CarolEidt for more input |
It seems like maybe Matrix4x4 should be reworked to simply contain Vector4 types instead of individual scalar values. Maybe this has come up before, but it is currently not great from a performance point of view. |
I'm going to close this because the work involved in this is over in the https://github.com/dotnet/coreclr repo, and the work can just be tracked by issue https://github.com/dotnet/coreclr/issues/4356. |
Currently only Vector operations seem to benefit from SIMD acceleration. Matrix and Quaternion operations still use scalar operations instead. For instance, adding two Vector4 instances would emit a single instruction for packed addition, while transforming a Vector4 with a Matrix4x4 would result in 16 scalar multiplications and 12 scalar additions, instead of 4 packed multiplications and 3 packet additions. Since one of the main usage scenarios of System.Numerics.Vectors is in graphically intensive 2D and 3D applications, there are many cases where such applications can greatly benefit from accelerating Matrix and Quaternion operations. The theoretical speedup of a factor of 4 simply cannot be ignored. Some operations can be emulated by using Vector instead (e.g. Matrix4x4 can be represented with 4 Vector4 instances, so we can emulate the Matrix4x4 x Vector4 transformation by inlining the necessary code by hand), but other operations are hard or impossible to emulate efficiently.
The text was updated successfully, but these errors were encountered: