Skip to content
This repository has been archived by the owner on Nov 19, 2020. It is now read-only.

Speed up matrix-vector operations #752

Closed
CatchemAL opened this issue Aug 4, 2017 · 6 comments · Fixed by #781
Closed

Speed up matrix-vector operations #752

CatchemAL opened this issue Aug 4, 2017 · 6 comments · Fixed by #781

Comments

@CatchemAL
Copy link
Collaborator

A number of linear algebra operations in the library would benefit from being pinned and working directly with pointers. Creating this issue so it can be referenced in git commits. I will pick this up and send a PR once complete.

Vector-Matrix

This shows a comparison of Accord's vector.Dot(matrix, result) versus a proposed change pinning the arrays and using pointer arithmetic.

Size is the size of the matrix (so 32 means a 32 x 32 square matrix). Trials is the number of simulations run. The timings are shown next and then the speed multiplier is shown. So the bottom row shows it is 17 times faster to pin the array than Accord's current method.

Size Trials Accord (ms) Proposed (ms) Multiplier
8 4194304 750 368 x2
16 1048576 722 291 x2.5
32 262144 684 242 x2.8
64 65536 746 242 x3.1
128 16384 782 238 x3.3
256 4096 1049 227 x4.6
512 1024 1095 225 x4.9
1024 256 2936 256 x11.5
2048 64 3781 254 x14.9
4096 16 4373 255 x17.1

Matrix-Vector

This shows a comparison of Accord's matrix.Dot(vector, result) versus a proposed change pinning the arrays and using pointer arithmetic.

Size Trials Accord (ms) Proposed (ms) Multiplier
8 4194304 712 290 x2.5
16 1048576 707 225 x3.1
32 262144 714 206 x3.5
64 65536 702 199 x3.5
128 16384 696 185 x3.8
256 4096 688 182 x3.8
512 1024 717 179 x4
1024 256 697 205 x3.4
2048 64 697 199 x3.5
4096 16 698 200 x3.5

Thanks,
Alex

@CatchemAL
Copy link
Collaborator Author

Hi @cesarsouza

I have updated the templates for three of the methods - they are named DotNew(...) while this is still work in progress - and I will work on a few more.

Now that I've been going through the Dot(...) code a bit more carefully, I've come across an issue I wasn't anticipating. The Dot(...) templates support a very large range of types for inputs and outputs and it is not always clear to me what the 'right' answer should be.

For instance, this test passes...

[Test]
public void MultiplyMatrixVectorWithFloats()
{
    const int N = 2;
    float[,] a = Matrix.Diagonal(N, N, 2.5f);
    float[] b = { 2, 2 };

    int[] result = new int[N];
    a.Dot(b, result);

    int[] expected = { 5, 5 };

    Assert.IsTrue(result.IsEqual(expected));  // *** Passes (5, 5) ***
}

...but this test fails (which is a very surprising inconsistency).

[Test]
public void MultiplyMatrixVectorWithDecimals()
{
    const int N = 2;
    decimal[,] a = Matrix.Diagonal(N, N, 2.5m);
    decimal[] b = { 2, 2 };

    int[] result = new int[N];
    a.Dot(b, result);

    int[] expected = { 5, 5 };

    Assert.IsTrue(result.IsEqual(expected)); // *** Fails (4, 4) ***
}

This is happening because in one method, the numbers are cast to int and then summed whereas in the other, they are summed and then cast to int. It sounds silly (because I have picked a contrived example), but I am actually not sure what the 'right' answer should really be! On the one hand, it sounds good to maintain high precision until the last possible moment. On the other hand, people often choose float over double because they are happy to trade precision for speed and casting every intermediate calc to a high-precision double seems to defeat the point. I have not come across SAXPY-type operations before that cast up to double in the intermediate calculations but all the float operations in Accord do.

For now, I have tried to leave all the optimisations 100% consistent with the framework but it is simply not possible to do this for vector.Dot(matrix, result) as I have changed the ordering to a more cache-friendly approach (see https://github.com/AlexJCross/framework/commit/fd24ecabdc9dcd51c490c52f951932a67831c3b0). I will need to think about that one some more before I send a PR. One approach would be to put some logic in the template depending on the result's type.

If I could offer my two cents, I think Accord could possibly reduce the number of Dot(...) overloads without degrading the experience. Many algorithms really benefit from fast matrix operations and it can be quite tricky to manage with so many permutations.

I'd be really interested to get your thoughts on this.

Regards,
Alex

@cesarsouza
Copy link
Member

cesarsouza commented Aug 6, 2017

Hi Alex,

Thanks a lot for such an amazing work!

Please let me comment on a few of the issues you have raised:

On the other hand, people often choose float over double because they are happy to trade precision for speed

Actually, people choose float over double to trade precision for memory, not for speed. The speed of doubles and floats are the often the same, if not faster for double than floats (see this for some quick explanation). Anyways, Intel CPUs (I guess this is a general rule for x86, but I can't remember right now) always perform floating operations using 80-bits for either doubles and floats, so AFAIK, no conversion will take place until the values have to be actually registered into main memory, which should happen only at the end of the computation anyways.

I will need to think about that one some more before I send a PR. One approach would be to put some logic in the template depending on the result's type.

Regarding those other issues, I have to say that the current priority in the framework is to have operations that are as accurate and fast for doubles and floats. The implementations for the others (such as decimals) can be, for the moment, regarded more as convenience methods (that may sometimes not be as optimized, or precise). If you have managed to optimize the execution cases for doubles and floats but found some issues trying to make the methods as correct as possible for decimals, I would say, please submit a pull request anyways and at least document that the current behavior might not be as accurate for some of the other data types.

I also think that the issue of converting beforehand to decimals, ints, uint, etc, can be solved with a bit more careful programming in the T4 templates, as you have mentioned. However, this is something that could be totally resolved later (and registered as a separate issue here in the issue tracker).

Again, thanks a lot for the excellent effort into developing those optimizations!

Regards,
Cesar

@CatchemAL
Copy link
Collaborator Author

Hi @cesarsouza ,

Thanks very much for getting back to me on this.

Actually, people choose float over double to trade precision for memory, not for speed. The speed of doubles and floats are the often the same, if not faster for double than floats (see this for some quick explanation).

Wow. I never knew that 👍 . Ok I will continue as planned. When I'm ready to submit, I'll make it very clear if any of the methods are impacted by casting.

Regards,
Alex

@CatchemAL
Copy link
Collaborator Author

CatchemAL commented Aug 15, 2017

I am very close to completing this and will submit a PR soon. For now, I need to finish documenting and testing the changes. This comment is WIP - I will update it once finished so please ignore for now.

Here is a breakdown of the changes I have made.

A note on precision: No method has been implemented such that the precision has been degraded by early casting. It is possible to speed up a number of methods if we allow that to happen, but I will submit that code in a separate commit. However, it is the case that some of the numbers will change ever so slightly as a result of floating point arithmetic; the order in which numbers are being summed has changed in some instances and, in general, (a + b) + c does not equal (a + c) + b when floating point numbers are involved.

Cross-Product

Signature: double[] Cross(this double[] a, double[] b, double[] result)
Unit test: Accord.Tests.Math.MatrixTest.CrossProductTest()
Numerical impact: None
Comments: Addressing issue GH-755. Unit test confirms correctness of cross-product and ensures that result vector is assigned correctly.

Matrix-Matrix

Signature: double[][] Dot(this double[,] a, double[][] b, double[][] result)
Unit test: Accord.Tests.Math.MatrixTest.MultiplyTwoMatrices3()
Performance impact:

Size Trials Accord (ms) Proposed (ms) Multiplier
8 8388608 14062 6577 x2.1
16 1048576 12768 5324 x2.4
32 131072 11840 4677 x2.5
64 16384 11809 4988 x2.4
128 2048 11462 4762 x2.4
256 256 11455 4637 x2.5
512 32 11365 4632 x2.5
1024 4 12315 5024 x2.5

Signature: double[,] TransposeAndDot(this double[,] a, double[,] b, double[,] result)
Unit test 1: Accord.Tests.Math.MatrixTest.TransposeAndMultiplyTest()
Unit test 2: Accord.Tests.Math.MatrixTest.TransposeAndDotBatchTest()
Performance impact:

Size Trials Accord (ms) Proposed (ms) Multiplier
8 8388608 14733 7469 x2
16 1048576 13332 5736 x2.3
32 131072 12424 4742 x2.6
64 16384 13087 4861 x2.7
128 2048 13490 4544 x3
256 256 17989 4473 x4
512 32 18799 4698 x4
1024 4 51634 4714 x11

Signature: double[][] DotWithTransposed(this double[][] a, double[][] b, double[][] result)
Unit test 1: Accord.Tests.Math.MatrixTest.DotWithTransposeTest_Jagged()
Unit test 2: Accord.Tests.Math.MatrixTest.DotWithTransposeBatchTest_Jagged()
Performance impact:

Size Trials Accord (ms) Proposed (ms) Multiplier
8 8388608 9222 4388 x2.1
16 1048576 6203 3868 x1.6
32 131072 5795 3815 x1.5
64 16384 5175 4200 x1.2
128 2048 4850 4297 x1.1
256 256 4615 4109 x1.1
512 32 4803 4347 x1.1
1024 4 5050 4485 x1.1

Signature: double[][] DotWithTransposed(this double[][] a, double[,] b, double[][] result)
Unit test 1: Accord.Tests.Math.MatrixTest.DotWithTransposeTest_Jagged1()
Unit test 2: Accord.Tests.Math.MatrixTest.DotWithTransposeBatchTest_Jagged1()
Performance impact:

Size Trials Accord (ms) Proposed (ms) Multiplier
8 8388608 11248 4348 x2.6
16 1048576 7600 4134 x1.8
32 131072 6239 4006 x1.6
64 16384 5821 3954 x1.5
128 2048 5242 4077 x1.3
256 256 5066 4121 x1.2
512 32 4973 4117 x1.2
1024 4 5308 4310 x1.2

Signature: double[][] DotWithTransposed(this double[,] a, double[][] b, double[][] result)
Unit test 1: Accord.Tests.Math.MatrixTest.DotWithTransposeTest_Jagged2()
Unit test 2: Accord.Tests.Math.MatrixTest.DotWithTransposeBatchTest_Jagged2()
Performance impact:

Size Trials Accord (ms) Proposed (ms) Multiplier
8 8388608 15405 4317 x3.6
16 1048576 12709 3877 x3.3
32 131072 11675 4320 x2.7
64 16384 11839 4186 x2.8
128 2048 11389 4162 x2.7
256 256 11339 4293 x2.6
512 32 11458 4247 x2.7
1024 4 11700 4600 x2.5

Matrix-Vector

Signature: double[] Dot(this double[,] matrix, double[] columnVector, double[] result)
Unit test 1: Accord.Tests.Math.MatrixTest.MultiplyMatrixVectorTest()
Unit test 2: Accord.Tests.Math.MatrixTest.MultiplyMatrixVectorBatchTest()
Performance impact:

Size Trials Accord (ms) Proposed (ms) Multiplier
8 8388608 1437 821 x1.8
16 2097152 1393 630 x2.2
32 524288 1399 560 x2.5
64 131072 1436 564 x2.5
128 32768 1427 554 x2.6
256 8192 1418 543 x2.6
512 2048 1405 573 x2.5
1024 512 1396 575 x2.4

Vector-Matrix-Vector

Signature: double DotAndDot(this double[] rowVector, double[,] matrix, double[] columnVector)
Unit test: Accord.Tests.Math.MatrixTest.DotAndDotTest()
Performance impact:

Size Trials Accord (ms) Proposed (ms) Multiplier
8 8388608 1917 679 x2.8
16 2097152 1663 491 x3.4
32 524288 1570 461 x3.4
64 131072 1605 439 x3.7
128 32768 1774 436 x4.1
256 8192 2388 442 x5.4
512 2048 2806 467 x6
1024 512 8393 497 x16.9

Outer

Signature: double[,] Outer(this double[] a, double[] b, double[,] result)
Unit test 1: Accord.Tests.Math.MatrixTest.OuterProductTest()
Unit test 2: Accord.Tests.Math.MatrixTest.OuterProductTestDifferentOverloads()
Performance impact:

Size Trials Accord (ms) Proposed (ms) Multiplier
8 8388608 1227 621 x2
16 2097152 1414 606 x2.3
32 524288 1302 571 x2.3
64 131072 1237 510 x2.4
128 32768 1251 430 x2.9
256 8192 1188 386 x3.1
512 2048 1231 407 x3
1024 512 1179 388 x3

Kronecker (Vectors)

Signature: double[] Kronecker(this double[] a, double[] b, double[] result)
Unit test 1: Accord.Tests.Math.MatrixTest.KroneckerVectorTest()
Unit test 2: Accord.Tests.Math.MatrixTest.KroneckerVectorBatchTest()
Performance impact:

Size Trials Accord (ms) Proposed (ms) Multiplier
8 8388608 1179 649 x1.8
16 2097152 995 580 x1.7
32 524288 1082 593 x1.8
64 131072 995 600 x1.7
128 32768 1000 697 x1.4
256 8192 1047 737 x1.4
512 2048 1051 735 x1.4
1024 512 1006 753 x1.3

Kronecker (Matrices)

Signature1: double[][] Kronecker(this double[,] a, double[][] b, double[][] result)
Signature2: double[][] Kronecker(this double[][] a, double[,] b, double[][] result)
Signature3: double[][] Kronecker(this double[][] a, double[][] b, double[][] result)
Unit test 1: Accord.Tests.Math.MatrixTest.KroneckerTest()
Unit test 2: Accord.Tests.Math.MatrixTest.KroneckerBatchTest()
Comments: These methods threw NotImplementedException and have now been implemented and tested.
Performance impact: N/A

@CatchemAL
Copy link
Collaborator Author

Hi @cesarsouza

Can I please ask what the expected output of the following code should be?

var EightBySeven = Matrix.Random(8, 7);
var NineBySeven  = Matrix.Random(9, 7);
var EightByOne   = Matrix.Zeros(8, 1);

EightBySeven.DotWithTransposed(NineBySeven, EightByOne); // Dimension mismatch ?

Currently it works. Is this a bug or a feature? If it's a bug, then there are a few more bugs in the code. If it's a feature, I will go through my changes and make sure I haven't changed this behaviour. I believe this is the only thing left for me to do :)

Thanks very much,
Alex

@cesarsouza
Copy link
Member

Hi @AlexJCross,

This might have been an undocumented feature, at least some time ago :-) But in reality, the method should have been throwing an exception in this case, so please feel free to change it so it really throws a DimensionMismatchException. If this change causes some unit tests in the .Statistics namespace to start failing, please submit your PR nonetheless and I will fix them later when I have more time (unless you would like to give it a try fixing them as well!)

Again, many thanks for the excellent work, this PR is going to look amazing!

Regards,
Cesar

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants