Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider SIMD acceleration for Matrix and Quaternion operations similar to Vector ones #16984

Closed
vanka78bg opened this issue Apr 14, 2016 · 4 comments
Assignees
Labels
area-System.Numerics enhancement Product code improvement that does NOT require public API changes/additions
Milestone

Comments

@vanka78bg
Copy link

Currently only Vector operations seem to benefit from SIMD acceleration. Matrix and Quaternion operations still use scalar operations instead. For instance, adding two Vector4 instances would emit a single instruction for packed addition, while transforming a Vector4 with a Matrix4x4 would result in 16 scalar multiplications and 12 scalar additions, instead of 4 packed multiplications and 3 packet additions. Since one of the main usage scenarios of System.Numerics.Vectors is in graphically intensive 2D and 3D applications, there are many cases where such applications can greatly benefit from accelerating Matrix and Quaternion operations. The theoretical speedup of a factor of 4 simply cannot be ignored. Some operations can be emulated by using Vector instead (e.g. Matrix4x4 can be represented with 4 Vector4 instances, so we can emulate the Matrix4x4 x Vector4 transformation by inlining the necessary code by hand), but other operations are hard or impossible to emulate efficiently.

@mellinoe
Copy link
Contributor

mellinoe commented Apr 14, 2016

It should be possible to at least implement some operations, like Matrix4x4 * float with the existing API, and it would be interesting to see if there are some speed improvements from an implementation like this:

        public static unsafe Matrix4x4 operator *(Matrix4x4 value1, float value2)
        {
            Matrix4x4 result;
            Vector4* srcPtr = (Vector4*)&value1;
            Vector4* destPtr = (Vector4*)&result;

            destPtr[0] = srcPtr[0] * value2;
            destPtr[1] = srcPtr[1] * value2;
            destPtr[2] = srcPtr[2] * value2;
            destPtr[3] = srcPtr[3] * value2;

            return result;
        }

For fun, it could also be implemented over Vector<T>. This probably isn't as good an idea as the above, though, for various reasons. This also uses the "Unsafe" addition that is still a work in progress.

        public static unsafe Matrix4x4 operator *(Matrix4x4 value1, float value2)
        {
            Matrix4x4 result;
            float* srcPtr = (float*)&value1;
            float* destPtr = (float*)&result;
            int stride = Vector<float>.Count;

            Debug.Assert(16 % stride == 0);
            for (int i = 0; i < 16; i += stride)
            {
                Vector<float> src = Unsafe.Read<Vector<float>>(srcPtr);
                Unsafe.Write(destPtr, src);
                srcPtr += stride;
                destPtr += stride;
            }

            return result;
        }

Theoretically, the above could complete the entire multiplication in one "stride" on CPUs supporting AVX512 (upcoming hardware), which would be neat.

The above won't be as fast as a real JIT intrinsic recognition, but they might be faster than the naive implementation currently in use. We'd need to do some performance tests on these to see how they behave.

That said, it seems like the most "useful" methods to speed up would be Matrix multiplication (i.e. Matrix4x4 * Matrix4x4) and Vector4/Matrix4x4 transformation, since they will be the most commonly used. Multiplying a Matrix4x4 by a scalar would probably be easy to speed up like above, but it might not be worth it, as it isn't as commonly used. Matrix multiplication could be easily improved if we could speed up Matrix4x4.Transpose, which we have considered in the past, as multiplication can be implemented via transpose -> row multiplication. Row multiplication can be easily vectorized (even with the current implementation), but transposition will be harder to implement, and probably will require special JIT recognition involving shuffling, etc.

@CarolEidt for more input

@vanka78bg
Copy link
Author

vanka78bg commented Apr 15, 2016

Thank you for the feedback! I understand my suggestion is not currently possible without a proper JIT support. For this reason I have logged a similar suggestion in the coreclr repository. For anyone following this, you can find my other suggestion about RyuJIT here: #4356.

@Ziflin
Copy link

Ziflin commented Aug 17, 2016

It seems like maybe Matrix4x4 should be reworked to simply contain Vector4 types instead of individual scalar values. Maybe this has come up before, but it is currently not great from a performance point of view.

@mellinoe
Copy link
Contributor

mellinoe commented Dec 7, 2016

I'm going to close this because the work involved in this is over in the https://github.com/dotnet/coreclr repo, and the work can just be tracked by issue https://github.com/dotnet/coreclr/issues/4356.

@mellinoe mellinoe closed this as completed Dec 7, 2016
@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 2.0.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jan 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Numerics enhancement Product code improvement that does NOT require public API changes/additions
Projects
None yet
Development

No branches or pull requests

4 participants