Speed up matrix-vector operations #752

CatchemAL · 2017-08-04T19:45:03Z

A number of linear algebra operations in the library would benefit from being pinned and working directly with pointers. Creating this issue so it can be referenced in git commits. I will pick this up and send a PR once complete.

Vector-Matrix

This shows a comparison of Accord's vector.Dot(matrix, result) versus a proposed change pinning the arrays and using pointer arithmetic.

Size is the size of the matrix (so 32 means a 32 x 32 square matrix). Trials is the number of simulations run. The timings are shown next and then the speed multiplier is shown. So the bottom row shows it is 17 times faster to pin the array than Accord's current method.

Size	Trials	Accord (ms)	Proposed (ms)	Multiplier
8	4194304	750	368	x2
16	1048576	722	291	x2.5
32	262144	684	242	x2.8
64	65536	746	242	x3.1
128	16384	782	238	x3.3
256	4096	1049	227	x4.6
512	1024	1095	225	x4.9
1024	256	2936	256	x11.5
2048	64	3781	254	x14.9
4096	16	4373	255	x17.1

Matrix-Vector

This shows a comparison of Accord's matrix.Dot(vector, result) versus a proposed change pinning the arrays and using pointer arithmetic.

Size	Trials	Accord (ms)	Proposed (ms)	Multiplier
8	4194304	712	290	x2.5
16	1048576	707	225	x3.1
32	262144	714	206	x3.5
64	65536	702	199	x3.5
128	16384	696	185	x3.8
256	4096	688	182	x3.8
512	1024	717	179	x4
1024	256	697	205	x3.4
2048	64	697	199	x3.5
4096	16	698	200	x3.5

Thanks,
Alex

The text was updated successfully, but these errors were encountered:

CatchemAL · 2017-08-06T12:47:04Z

Hi @cesarsouza

I have updated the templates for three of the methods - they are named DotNew(...) while this is still work in progress - and I will work on a few more.

Now that I've been going through the Dot(...) code a bit more carefully, I've come across an issue I wasn't anticipating. The Dot(...) templates support a very large range of types for inputs and outputs and it is not always clear to me what the 'right' answer should be.

For instance, this test passes...

[Test]
public void MultiplyMatrixVectorWithFloats()
{
    const int N = 2;
    float[,] a = Matrix.Diagonal(N, N, 2.5f);
    float[] b = { 2, 2 };

    int[] result = new int[N];
    a.Dot(b, result);

    int[] expected = { 5, 5 };

    Assert.IsTrue(result.IsEqual(expected));  // *** Passes (5, 5) ***
}

...but this test fails (which is a very surprising inconsistency).

[Test]
public void MultiplyMatrixVectorWithDecimals()
{
    const int N = 2;
    decimal[,] a = Matrix.Diagonal(N, N, 2.5m);
    decimal[] b = { 2, 2 };

    int[] result = new int[N];
    a.Dot(b, result);

    int[] expected = { 5, 5 };

    Assert.IsTrue(result.IsEqual(expected)); // *** Fails (4, 4) ***
}

This is happening because in one method, the numbers are cast to int and then summed whereas in the other, they are summed and then cast to int. It sounds silly (because I have picked a contrived example), but I am actually not sure what the 'right' answer should really be! On the one hand, it sounds good to maintain high precision until the last possible moment. On the other hand, people often choose float over double because they are happy to trade precision for speed and casting every intermediate calc to a high-precision double seems to defeat the point. I have not come across SAXPY-type operations before that cast up to double in the intermediate calculations but all the float operations in Accord do.

For now, I have tried to leave all the optimisations 100% consistent with the framework but it is simply not possible to do this for vector.Dot(matrix, result) as I have changed the ordering to a more cache-friendly approach (see https://github.com/AlexJCross/framework/commit/fd24ecabdc9dcd51c490c52f951932a67831c3b0). I will need to think about that one some more before I send a PR. One approach would be to put some logic in the template depending on the result's type.

If I could offer my two cents, I think Accord could possibly reduce the number of Dot(...) overloads without degrading the experience. Many algorithms really benefit from fast matrix operations and it can be quite tricky to manage with so many permutations.

I'd be really interested to get your thoughts on this.

Regards,
Alex

cesarsouza · 2017-08-06T14:27:04Z

Hi Alex,

Thanks a lot for such an amazing work!

Please let me comment on a few of the issues you have raised:

On the other hand, people often choose float over double because they are happy to trade precision for speed

Actually, people choose float over double to trade precision for memory, not for speed. The speed of doubles and floats are the often the same, if not faster for double than floats (see this for some quick explanation). Anyways, Intel CPUs (I guess this is a general rule for x86, but I can't remember right now) always perform floating operations using 80-bits for either doubles and floats, so AFAIK, no conversion will take place until the values have to be actually registered into main memory, which should happen only at the end of the computation anyways.

I will need to think about that one some more before I send a PR. One approach would be to put some logic in the template depending on the result's type.

Regarding those other issues, I have to say that the current priority in the framework is to have operations that are as accurate and fast for doubles and floats. The implementations for the others (such as decimals) can be, for the moment, regarded more as convenience methods (that may sometimes not be as optimized, or precise). If you have managed to optimize the execution cases for doubles and floats but found some issues trying to make the methods as correct as possible for decimals, I would say, please submit a pull request anyways and at least document that the current behavior might not be as accurate for some of the other data types.

I also think that the issue of converting beforehand to decimals, ints, uint, etc, can be solved with a bit more careful programming in the T4 templates, as you have mentioned. However, this is something that could be totally resolved later (and registered as a separate issue here in the issue tracker).

Again, thanks a lot for the excellent effort into developing those optimizations!

Regards,
Cesar

CatchemAL · 2017-08-06T15:58:37Z

Hi @cesarsouza ,

Thanks very much for getting back to me on this.

Actually, people choose float over double to trade precision for memory, not for speed. The speed of doubles and floats are the often the same, if not faster for double than floats (see this for some quick explanation).

Wow. I never knew that 👍 . Ok I will continue as planned. When I'm ready to submit, I'll make it very clear if any of the methods are impacted by casting.

Regards,
Alex

CatchemAL · 2017-08-15T22:39:32Z

I am very close to completing this and will submit a PR soon. For now, I need to finish documenting and testing the changes. This comment is WIP - I will update it once finished so please ignore for now.

Here is a breakdown of the changes I have made.

A note on precision: No method has been implemented such that the precision has been degraded by early casting. It is possible to speed up a number of methods if we allow that to happen, but I will submit that code in a separate commit. However, it is the case that some of the numbers will change ever so slightly as a result of floating point arithmetic; the order in which numbers are being summed has changed in some instances and, in general, (a + b) + c does not equal (a + c) + b when floating point numbers are involved.

Cross-Product

Signature: double[] Cross(this double[] a, double[] b, double[] result)
Unit test: Accord.Tests.Math.MatrixTest.CrossProductTest()
Numerical impact: None
Comments: Addressing issue GH-755. Unit test confirms correctness of cross-product and ensures that result vector is assigned correctly.

Matrix-Matrix

Signature: double[][] Dot(this double[,] a, double[][] b, double[][] result)
Unit test: Accord.Tests.Math.MatrixTest.MultiplyTwoMatrices3()
Performance impact:

Size	Trials	Accord (ms)	Proposed (ms)	Multiplier
8	8388608	14062	6577	x2.1
16	1048576	12768	5324	x2.4
32	131072	11840	4677	x2.5
64	16384	11809	4988	x2.4
128	2048	11462	4762	x2.4
256	256	11455	4637	x2.5
512	32	11365	4632	x2.5
1024	4	12315	5024	x2.5

Signature: double[,] TransposeAndDot(this double[,] a, double[,] b, double[,] result)
Unit test 1: Accord.Tests.Math.MatrixTest.TransposeAndMultiplyTest()
Unit test 2: Accord.Tests.Math.MatrixTest.TransposeAndDotBatchTest()
Performance impact:

Size	Trials	Accord (ms)	Proposed (ms)	Multiplier
8	8388608	14733	7469	x2
16	1048576	13332	5736	x2.3
32	131072	12424	4742	x2.6
64	16384	13087	4861	x2.7
128	2048	13490	4544	x3
256	256	17989	4473	x4
512	32	18799	4698	x4
1024	4	51634	4714	x11

Signature: double[][] DotWithTransposed(this double[][] a, double[][] b, double[][] result)
Unit test 1: Accord.Tests.Math.MatrixTest.DotWithTransposeTest_Jagged()
Unit test 2: Accord.Tests.Math.MatrixTest.DotWithTransposeBatchTest_Jagged()
Performance impact:

Size	Trials	Accord (ms)	Proposed (ms)	Multiplier
8	8388608	9222	4388	x2.1
16	1048576	6203	3868	x1.6
32	131072	5795	3815	x1.5
64	16384	5175	4200	x1.2
128	2048	4850	4297	x1.1
256	256	4615	4109	x1.1
512	32	4803	4347	x1.1
1024	4	5050	4485	x1.1

Signature: double[][] DotWithTransposed(this double[][] a, double[,] b, double[][] result)
Unit test 1: Accord.Tests.Math.MatrixTest.DotWithTransposeTest_Jagged1()
Unit test 2: Accord.Tests.Math.MatrixTest.DotWithTransposeBatchTest_Jagged1()
Performance impact:

Size	Trials	Accord (ms)	Proposed (ms)	Multiplier
8	8388608	11248	4348	x2.6
16	1048576	7600	4134	x1.8
32	131072	6239	4006	x1.6
64	16384	5821	3954	x1.5
128	2048	5242	4077	x1.3
256	256	5066	4121	x1.2
512	32	4973	4117	x1.2
1024	4	5308	4310	x1.2

Signature: double[][] DotWithTransposed(this double[,] a, double[][] b, double[][] result)
Unit test 1: Accord.Tests.Math.MatrixTest.DotWithTransposeTest_Jagged2()
Unit test 2: Accord.Tests.Math.MatrixTest.DotWithTransposeBatchTest_Jagged2()
Performance impact:

Size	Trials	Accord (ms)	Proposed (ms)	Multiplier
8	8388608	15405	4317	x3.6
16	1048576	12709	3877	x3.3
32	131072	11675	4320	x2.7
64	16384	11839	4186	x2.8
128	2048	11389	4162	x2.7
256	256	11339	4293	x2.6
512	32	11458	4247	x2.7
1024	4	11700	4600	x2.5

Matrix-Vector

Signature: double[] Dot(this double[,] matrix, double[] columnVector, double[] result)
Unit test 1: Accord.Tests.Math.MatrixTest.MultiplyMatrixVectorTest()
Unit test 2: Accord.Tests.Math.MatrixTest.MultiplyMatrixVectorBatchTest()
Performance impact:

Size	Trials	Accord (ms)	Proposed (ms)	Multiplier
8	8388608	1437	821	x1.8
16	2097152	1393	630	x2.2
32	524288	1399	560	x2.5
64	131072	1436	564	x2.5
128	32768	1427	554	x2.6
256	8192	1418	543	x2.6
512	2048	1405	573	x2.5
1024	512	1396	575	x2.4

Vector-Matrix-Vector

Signature: double DotAndDot(this double[] rowVector, double[,] matrix, double[] columnVector)
Unit test: Accord.Tests.Math.MatrixTest.DotAndDotTest()
Performance impact:

Size	Trials	Accord (ms)	Proposed (ms)	Multiplier
8	8388608	1917	679	x2.8
16	2097152	1663	491	x3.4
32	524288	1570	461	x3.4
64	131072	1605	439	x3.7
128	32768	1774	436	x4.1
256	8192	2388	442	x5.4
512	2048	2806	467	x6
1024	512	8393	497	x16.9

Outer

Signature: double[,] Outer(this double[] a, double[] b, double[,] result)
Unit test 1: Accord.Tests.Math.MatrixTest.OuterProductTest()
Unit test 2: Accord.Tests.Math.MatrixTest.OuterProductTestDifferentOverloads()
Performance impact:

Size	Trials	Accord (ms)	Proposed (ms)	Multiplier
8	8388608	1227	621	x2
16	2097152	1414	606	x2.3
32	524288	1302	571	x2.3
64	131072	1237	510	x2.4
128	32768	1251	430	x2.9
256	8192	1188	386	x3.1
512	2048	1231	407	x3
1024	512	1179	388	x3

Kronecker (Vectors)

Signature: double[] Kronecker(this double[] a, double[] b, double[] result)
Unit test 1: Accord.Tests.Math.MatrixTest.KroneckerVectorTest()
Unit test 2: Accord.Tests.Math.MatrixTest.KroneckerVectorBatchTest()
Performance impact:

Size	Trials	Accord (ms)	Proposed (ms)	Multiplier
8	8388608	1179	649	x1.8
16	2097152	995	580	x1.7
32	524288	1082	593	x1.8
64	131072	995	600	x1.7
128	32768	1000	697	x1.4
256	8192	1047	737	x1.4
512	2048	1051	735	x1.4
1024	512	1006	753	x1.3

Kronecker (Matrices)

Signature1: double[][] Kronecker(this double[,] a, double[][] b, double[][] result)
Signature2: double[][] Kronecker(this double[][] a, double[,] b, double[][] result)
Signature3: double[][] Kronecker(this double[][] a, double[][] b, double[][] result)
Unit test 1: Accord.Tests.Math.MatrixTest.KroneckerTest()
Unit test 2: Accord.Tests.Math.MatrixTest.KroneckerBatchTest()
Comments: These methods threw NotImplementedException and have now been implemented and tested.
Performance impact: N/A

CatchemAL · 2017-08-16T06:49:19Z

Hi @cesarsouza

Can I please ask what the expected output of the following code should be?

var EightBySeven = Matrix.Random(8, 7);
var NineBySeven  = Matrix.Random(9, 7);
var EightByOne   = Matrix.Zeros(8, 1);

EightBySeven.DotWithTransposed(NineBySeven, EightByOne); // Dimension mismatch ?

Currently it works. Is this a bug or a feature? If it's a bug, then there are a few more bugs in the code. If it's a feature, I will go through my changes and make sure I haven't changed this behaviour. I believe this is the only thing left for me to do :)

Thanks very much,
Alex

cesarsouza · 2017-08-16T18:08:11Z

Hi @AlexJCross,

This might have been an undocumented feature, at least some time ago :-) But in reality, the method should have been throwing an exception in this case, so please feel free to change it so it really throws a DimensionMismatchException. If this change causes some unit tests in the .Statistics namespace to start failing, please submit your PR nonetheless and I will fix them later when I have more time (unless you would like to give it a try fixing them as well!)

Again, many thanks for the excellent work, this PR is going to look amazing!

Regards,
Cesar

GH-752: Speed up matrix-vector operations

CatchemAL mentioned this issue Aug 10, 2017

Bug in cross product calculation #755

Closed

CatchemAL mentioned this issue Aug 17, 2017

GH-752: Speed up matrix-vector operations #781

Merged

cesarsouza closed this as completed in #781 Aug 17, 2017

cesarsouza added a commit that referenced this issue Aug 17, 2017

Merge pull request #781 from AlexJCross/GH-752

94c5ef0

GH-752: Speed up matrix-vector operations

cesarsouza added pending-release and removed pending-release labels Aug 17, 2017

CatchemAL mentioned this issue Aug 23, 2017

Unsafe code exposed in the public API #805

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up matrix-vector operations #752

Speed up matrix-vector operations #752

CatchemAL commented Aug 4, 2017

CatchemAL commented Aug 6, 2017

cesarsouza commented Aug 6, 2017 •

edited

Loading

CatchemAL commented Aug 6, 2017

CatchemAL commented Aug 15, 2017 •

edited

Loading

CatchemAL commented Aug 16, 2017

cesarsouza commented Aug 16, 2017

Speed up matrix-vector operations #752

Speed up matrix-vector operations #752

Comments

CatchemAL commented Aug 4, 2017

Vector-Matrix

Matrix-Vector

CatchemAL commented Aug 6, 2017

cesarsouza commented Aug 6, 2017 • edited Loading

CatchemAL commented Aug 6, 2017

CatchemAL commented Aug 15, 2017 • edited Loading

Cross-Product

Matrix-Matrix

Matrix-Vector

Vector-Matrix-Vector

Outer

Kronecker (Vectors)

Kronecker (Matrices)

CatchemAL commented Aug 16, 2017

cesarsouza commented Aug 16, 2017

cesarsouza commented Aug 6, 2017 •

edited

Loading

CatchemAL commented Aug 15, 2017 •

edited

Loading