-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sparse multiplication is slow #942
Comments
I am having trouble replicating this.
I have run make with the latest commit. Any ideas what I might be doing wrong? |
You need to do -viral On 24-Jun-2012, at 5:03 PM, Chris DuBois wrote:
|
SparseAccumulator overhead is most likely quite high. The code should be reimplemented along the lines of this CXSparse implementation: http://www.cise.ufl.edu/research/sparse/CXSparse/CXSparse/Source/cs_multiply.c |
I am no expert in this area, but I posted julia code for the naive matrix multiplication mentioned in Gustavson 1978. Maybe it could serve as a useful starting point for someone more versed in making things in Julia fast. As it stands, the method is a bit slower than julia's current implementation; see the example at the bottom of the gist. The code you mention seems to be LGPL'd, so any derivative work would also have to be LGPL'd, correct? This source code is also printed in his 2006 book, though. The approach appears very similar to the Gustavson method, though I didn't check this thoroughly. |
This is essentially what CSparse also does. I will try it out. |
On lines 32 and 34, you are recomputing out VA[jp] every time. The compiler is not yet smart enough to pull it out of the loop. If you do so, you get the same speed as the existing julia code. I would have expected this to be faster, since it is doing lesser work, and is simpler code - but there seems to be little difference. |
@doobwa Could you license this under the MIT license? With a bit of cleanup, this can replace the existing code, even though the speed is the same. |
@JeffBezanson I believe this is the same algorithm that matlab implements. We are off by a factor of 2 compared to a C implementation. Any ideas what we can do to match performance here, or should we call a library function from CXSparse to get good performance?
|
Also, I can get a 30% speedup here if I enable all the optimization passes in codegen.cpp. It helps the code above, but not the matrix multiply that is currently implemented. |
@doobwa Your implementation reads like it is for row-wise storage. I would have expected the outer-loop over columns of B, and inner loops going over columns of A. I have corrected that in my implementation above. Performance is the same for the testcases that are presented in this issue, since access patterns and work does not change. |
@ViralBShah I have updated the gist with your suggested changes and the MIT license. My comments in the original version were not correct (when I said row I meant column!). The order of those loops doesn't really matter, so I have kept them as they are in the original pseudocode. I also added a small bug fix regarding the rowval and nzval values returned (which incidentally also occurs in the current sparse matrix multiplication as well -- I can submit a pull request if you are interested). With your changes I am getting some marginal speedups over the current implementation. |
Yes, please do issue a pull request. -viral On 26-Jun-2012, at 9:20 PM, Chris DuBois wrote:
|
That's right, the changes only give marginal speedups. If you comment out all the other passes in codegen.cpp, you get more aggressive optimization and some noticeable speedups. We probably need more compiler work for this code to perform better - but have to wait for @JeffBezanson to chime in. -viral On 26-Jun-2012, at 9:20 PM, Chris DuBois wrote:
|
In some cases (that are somewhat reasonable for my particular application) I am getting sizeable speedups compared to the old version, which is encouraging, though I don't have a C or Matlab implementation to compare against.
|
@doobwa Did you try non-square matrices?
|
We need the ability to grow the allocated region. |
I added the ability to grow, and merged your patches for del manually. Even after so, I was getting errors with the code in gist. With my version of switched loops, it works fine. I am checking my version of the column-wise iteration master for now. Can you check it out? |
and https://gist.github.com/2993335, with a few modifications to handle growing, trimming upon completion, and loop ordering, which seems to make the tests pass. This code gives a 30% speedup if all LLVM optimizations in codegen.cpp are turned on. @doobwa, thanks for the simpler implementation. Sorry about being unable to merge your pull request due to github acting up.
The runtime is now down to 4.9 seconds from 6 seconds earlier. Disabling the emit_bounds_check and enabling all LLVM passes gets it further down to 3.9 seconds. Adding However, it is still about two times slower than the C implementation that is in Matlab. |
This is now taking 6.3 seconds again. Seems like we have regressed a bit on performance. |
We really need to build up a large benchmark suite and test performance with each new build on all of the tests. |
That is exactly what I have been thinking. We need a new issue for continuous performance benchmarking. As has been discussed elsewhere, we actually do have some machines at MIT that can do this reliably. Even if not on every build, with a cron job, it should be easy to do this every day. |
Looking through the code here, I realized that the work has increased here. There is a double transpose at the end now to ensure that all row indices are in sorted order, and we are using 64-bit indices by default for sparse matrices on x86_64 now. |
happens to make sparse mul faster; #942
@ViralBShah see if that helps |
Now again at 3.3 seconds. Still slower than Matlab, but I am hoping that |
The inbounds macro takes this to 2.7 seconds, and I have some potential code improvements in mind to try out that may completely close this gap. |
Gives 20% improvement on the perf test in #942
When the issue of pointer being reloaded at each iteration is addressed (as discussed in #3268), the performance of this will be further improved. |
@JeffBezanson This has slowed down again and is taking almost 4.0 seconds now, over the last couple of weeks. |
I also observed something strange: using |
For a minute I thought this was my recent First, I changed the function to see the types, to split components up onto separate lines, and to make sure there weren't any order effects (i.e., profiler oddities, which I'm always paranoid about but don't seem to materialize):
Now:
|
The transpose! expects some work to have been done in setting up colptr, which can be expensive and without which the result is likely to be incorrect. Not documented and perhaps should have a different name. |
Jeff - the slowdown existed prior to my commits today. My commits improve performance but those are basically memory related and unrelated to the actual multiplication. |
After fbf3ed8 the timing is now at 2.95 seconds. |
Some distillation of this code should go into our performance suite. |
sparse matmul is already in the perf suite. Codespeed should help identify these regressions. I just happened to stumble across this as I was updating it. |
Julia:
Matlab:
The text was updated successfully, but these errors were encountered: