Consider SIMD acceleration for Matrix and Quaternion operations similar to Vector ones #16984

vanka78bg · 2016-04-14T17:28:37Z

Currently only Vector operations seem to benefit from SIMD acceleration. Matrix and Quaternion operations still use scalar operations instead. For instance, adding two Vector4 instances would emit a single instruction for packed addition, while transforming a Vector4 with a Matrix4x4 would result in 16 scalar multiplications and 12 scalar additions, instead of 4 packed multiplications and 3 packet additions. Since one of the main usage scenarios of System.Numerics.Vectors is in graphically intensive 2D and 3D applications, there are many cases where such applications can greatly benefit from accelerating Matrix and Quaternion operations. The theoretical speedup of a factor of 4 simply cannot be ignored. Some operations can be emulated by using Vector instead (e.g. Matrix4x4 can be represented with 4 Vector4 instances, so we can emulate the Matrix4x4 x Vector4 transformation by inlining the necessary code by hand), but other operations are hard or impossible to emulate efficiently.

mellinoe · 2016-04-14T19:40:35Z

It should be possible to at least implement some operations, like Matrix4x4 * float with the existing API, and it would be interesting to see if there are some speed improvements from an implementation like this:

        public static unsafe Matrix4x4 operator *(Matrix4x4 value1, float value2)
        {
            Matrix4x4 result;
            Vector4* srcPtr = (Vector4*)&value1;
            Vector4* destPtr = (Vector4*)&result;

            destPtr[0] = srcPtr[0] * value2;
            destPtr[1] = srcPtr[1] * value2;
            destPtr[2] = srcPtr[2] * value2;
            destPtr[3] = srcPtr[3] * value2;

            return result;
        }

For fun, it could also be implemented over Vector<T>. This probably isn't as good an idea as the above, though, for various reasons. This also uses the "Unsafe" addition that is still a work in progress.

        public static unsafe Matrix4x4 operator *(Matrix4x4 value1, float value2)
        {
            Matrix4x4 result;
            float* srcPtr = (float*)&value1;
            float* destPtr = (float*)&result;
            int stride = Vector<float>.Count;

            Debug.Assert(16 % stride == 0);
            for (int i = 0; i < 16; i += stride)
            {
                Vector<float> src = Unsafe.Read<Vector<float>>(srcPtr);
                Unsafe.Write(destPtr, src);
                srcPtr += stride;
                destPtr += stride;
            }

            return result;
        }

Theoretically, the above could complete the entire multiplication in one "stride" on CPUs supporting AVX512 (upcoming hardware), which would be neat.

The above won't be as fast as a real JIT intrinsic recognition, but they might be faster than the naive implementation currently in use. We'd need to do some performance tests on these to see how they behave.

That said, it seems like the most "useful" methods to speed up would be Matrix multiplication (i.e. Matrix4x4 * Matrix4x4) and Vector4/Matrix4x4 transformation, since they will be the most commonly used. Multiplying a Matrix4x4 by a scalar would probably be easy to speed up like above, but it might not be worth it, as it isn't as commonly used. Matrix multiplication could be easily improved if we could speed up Matrix4x4.Transpose, which we have considered in the past, as multiplication can be implemented via transpose -> row multiplication. Row multiplication can be easily vectorized (even with the current implementation), but transposition will be harder to implement, and probably will require special JIT recognition involving shuffling, etc.

@CarolEidt for more input

vanka78bg · 2016-04-15T08:00:38Z

Thank you for the feedback! I understand my suggestion is not currently possible without a proper JIT support. For this reason I have logged a similar suggestion in the coreclr repository. For anyone following this, you can find my other suggestion about RyuJIT here: #4356.

Ziflin · 2016-08-17T17:17:59Z

It seems like maybe Matrix4x4 should be reworked to simply contain Vector4 types instead of individual scalar values. Maybe this has come up before, but it is currently not great from a performance point of view.

mellinoe · 2016-12-07T00:04:37Z

I'm going to close this because the work involved in this is over in the https://github.com/dotnet/coreclr repo, and the work can just be tracked by issue https://github.com/dotnet/coreclr/issues/4356.

joshfree assigned mellinoe Apr 14, 2016

mellinoe closed this as completed Dec 7, 2016

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the 2.0.0 milestone Jan 31, 2020

ghost locked as resolved and limited conversation to collaborators Jan 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider SIMD acceleration for Matrix and Quaternion operations similar to Vector ones #16984

Consider SIMD acceleration for Matrix and Quaternion operations similar to Vector ones #16984

vanka78bg commented Apr 14, 2016

mellinoe commented Apr 14, 2016 •

edited

Loading

vanka78bg commented Apr 15, 2016 •

edited

Loading

Ziflin commented Aug 17, 2016

mellinoe commented Dec 7, 2016

Consider SIMD acceleration for Matrix and Quaternion operations similar to Vector ones #16984

Consider SIMD acceleration for Matrix and Quaternion operations similar to Vector ones #16984

Comments

vanka78bg commented Apr 14, 2016

mellinoe commented Apr 14, 2016 • edited Loading

vanka78bg commented Apr 15, 2016 • edited Loading

Ziflin commented Aug 17, 2016

mellinoe commented Dec 7, 2016

mellinoe commented Apr 14, 2016 •

edited

Loading

vanka78bg commented Apr 15, 2016 •

edited

Loading