sparse multiplication is slow #942

ViralBShah · 2012-06-18T10:45:58Z

Julia:

julia> s = sparse(ones(1000,1000));
julia> @time s*s
elapsed time: 6.074394941329956 seconds

Matlab:

>> s = sparse(ones(1000,1000));
>> tic; s*s; toc;
Elapsed time is 2.198466 seconds.

The text was updated successfully, but these errors were encountered:

doobwa · 2012-06-24T11:33:02Z

I am having trouble replicating this.

julia> s=sparse(ones(1000,1000))
1000x1000 sparse matrix with 1000000 nonzeros:
[1   ,    1]  =  1.0
[2   ,    1]  =  1.0
[3   ,    1]  =  1.0
[4   ,    1]  =  1.0
[... I omitted the rest of the output... ]
julia> s*s;
no method *(SparseMatrixCSC{Float64,Int64},SparseMatrixCSC{Float64,Int64})
 in method_missing at base.jl:60

I have run make with the latest commit. Any ideas what I might be doing wrong?

ViralBShah · 2012-06-24T12:23:20Z

You need to do load("linalg_sparse.jl") as well.

-viral

On 24-Jun-2012, at 5:03 PM, Chris DuBois wrote:

I am having trouble replicating this.

s=sparse(ones(1000,1000))
1000x1000 sparse matrix with 1000000 nonzeros:
[1 , 1] = 1.0
[2 , 1] = 1.0
[3 , 1] = 1.0
[4 , 1] = 1.0
julia> s*s;
no method *(SparseMatrixCSC{Float64,Int64},SparseMatrixCSC{Float64,Int64})
in method_missing at base.jl:60

I have run make with the latest commit. Any ideas what I might be doing wrong?

Reply to this email directly or view it on GitHub:
#942 (comment)

ViralBShah · 2012-06-24T13:02:39Z

SparseAccumulator overhead is most likely quite high. The code should be reimplemented along the lines of this CXSparse implementation:

http://www.cise.ufl.edu/research/sparse/CXSparse/CXSparse/Source/cs_multiply.c

doobwa · 2012-06-26T04:55:31Z

I am no expert in this area, but I posted julia code for the naive matrix multiplication mentioned in Gustavson 1978. Maybe it could serve as a useful starting point for someone more versed in making things in Julia fast. As it stands, the method is a bit slower than julia's current implementation; see the example at the bottom of the gist.

The code you mention seems to be LGPL'd, so any derivative work would also have to be LGPL'd, correct? This source code is also printed in his 2006 book, though. The approach appears very similar to the Gustavson method, though I didn't check this thoroughly.

ViralBShah · 2012-06-26T09:32:07Z

This is essentially what CSparse also does. I will try it out.

ViralBShah · 2012-06-26T11:40:17Z

On lines 32 and 34, you are recomputing out VA[jp] every time. The compiler is not yet smart enough to pull it out of the loop. If you do so, you get the same speed as the existing julia code. I would have expected this to be faster, since it is doing lesser work, and is simpler code - but there seems to be little difference.

ViralBShah · 2012-06-26T11:50:21Z

@doobwa Could you license this under the MIT license? With a bit of cleanup, this can replace the existing code, even though the speed is the same.

ViralBShah · 2012-06-26T12:05:03Z

@JeffBezanson I believe this is the same algorithm that matlab implements. We are off by a factor of 2 compared to a C implementation. Any ideas what we can do to match performance here, or should we call a library function from CXSparse to get good performance?

function matmul{Tv,Ti}(A::SparseMatrixCSC{Tv,Ti}, B::SparseMatrixCSC{Tv,Ti})
  mA, nA = size(A)
  mB, nB = size(B)
  if nA != mB; error("mismatched dimensions"); end
  colptrA = A.colptr; rowvalA = A.rowval; nzvalA = A.nzval
  colptrB = B.colptr; rowvalB = B.rowval; nzvalB = B.nzval
  nnzC = length(nzvalA) + length(nzvalB)
  colptrC = Array(Ti, nB+1)
  rowvalC = Array(Ti, nnzC)
  nzvalC = Array(Tv, nnzC)
  ip = 1
  xb = zeros(Ti, mA)
  x  = zeros(Tv, mA)
  for i in 1:nB
      colptrC[i] = ip
      for jp in colptrB[i]:(colptrB[i+1] - 1)
          j = rowvalB[jp]
          nzB = nzvalB[jp]
          for kp in colptrA[j]:(colptrA[j+1] - 1)
              k = rowvalA[kp]
              nzA = nzvalA[kp]
              if xb[k] != i
                  rowvalC[ip] = k
                  ip += 1
                  xb[k] = i
                  x[k] = nzA * nzB
              else
                  x[k] += nzA * nzB
              end
          end
      end
      for vp in colptrC[i]:(ip - 1)
          nzvalC[vp] = x[rowvalC[vp]]
      end
  end
  colptrC[nB+1] = ip

  return SparseMatrixCSC(mA,nB,colptrC,rowvalC,nzvalC)
end

ViralBShah · 2012-06-26T12:15:40Z

Also, I can get a 30% speedup here if I enable all the optimization passes in codegen.cpp. It helps the code above, but not the matrix multiply that is currently implemented.

ViralBShah · 2012-06-26T12:32:47Z

@doobwa Your implementation reads like it is for row-wise storage. I would have expected the outer-loop over columns of B, and inner loops going over columns of A. I have corrected that in my implementation above. Performance is the same for the testcases that are presented in this issue, since access patterns and work does not change.

doobwa · 2012-06-26T15:50:04Z

@ViralBShah I have updated the gist with your suggested changes and the MIT license. My comments in the original version were not correct (when I said row I meant column!). The order of those loops doesn't really matter, so I have kept them as they are in the original pseudocode.

I also added a small bug fix regarding the rowval and nzval values returned (which incidentally also occurs in the current sparse matrix multiplication as well -- I can submit a pull request if you are interested).

With your changes I am getting some marginal speedups over the current implementation.

ViralBShah · 2012-06-26T17:16:31Z

Yes, please do issue a pull request.

-viral

On 26-Jun-2012, at 9:20 PM, Chris DuBois wrote:

@ViralBShah I have updated the gist with your suggested changes and the MIT license. My comments in the original version were not correct (when I said row I meant column!). The order of those loops doesn't really matter, so I have kept them as they are in the original pseudocode.

I also added a small bug fix regarding the rowval and nzval values returned (which incidentally also occurs in the current sparse matrix multiplication as well -- I can submit a pull request if you are interested).

With your changes I am getting some marginal speedups over the current implementation.

Reply to this email directly or view it on GitHub:
#942 (comment)

ViralBShah · 2012-06-26T17:48:07Z

That's right, the changes only give marginal speedups. If you comment out all the other passes in codegen.cpp, you get more aggressive optimization and some noticeable speedups. We probably need more compiler work for this code to perform better - but have to wait for @JeffBezanson to chime in.

-viral

On 26-Jun-2012, at 9:20 PM, Chris DuBois wrote:

@ViralBShah I have updated the gist with your suggested changes and the MIT license. My comments in the original version were not correct (when I said row I meant column!). The order of those loops doesn't really matter, so I have kept them as they are in the original pseudocode.

I also added a small bug fix regarding the rowval and nzval values returned (which incidentally also occurs in the current sparse matrix multiplication as well -- I can submit a pull request if you are interested).

With your changes I am getting some marginal speedups over the current implementation.

Reply to this email directly or view it on GitHub:
#942 (comment)

pao · 2012-06-27T03:15:36Z

Stupid lack of undo and tiny phone buttons. Sorry @doobwa, didn't mean to kill that, was just going to add a #977 so Github would cross-reference.

doobwa · 2012-06-27T03:26:08Z

In some cases (that are somewhat reasonable for my particular application) I am getting sizeable speedups compared to the old version, which is encouraging, though I don't have a C or Matlab implementation to compare against.

N = 100000
n = 10000  # number of nonzero  elements
ix = int(floor(N*N * rand(n)) + 1)  # pick random elements throughout entire matrix
I = [mod(i,N) for i = ix]  # row
J = [int(i/N) for i = ix]  # col
V = int(ones(length(I)))  # all 1's

s = sparse(int32(vec(I)),int32(vec(J)),vec(V),N,N)

@time a=s*s;
@time b=matmul(s,s);


julia> @time a=s*s;
elapsed time: 0.03193998336791992 seconds

julia> @time b=matmul(s,s);
elapsed time: 0.0016660690307617188 seconds```

ViralBShah · 2012-06-29T03:41:48Z

@doobwa Did you try non-square matrices?

julia> matmul(sparse(rand(100,10)),sparse(rand(10,100)))
in matmul: arrayref: index out of range
 in matmul at matmul.jl:29

ViralBShah · 2012-06-29T03:47:48Z

We need the ability to grow the allocated region.

ViralBShah · 2012-06-29T04:10:48Z

I added the ability to grow, and merged your patches for del manually. Even after so, I was getting errors with the code in gist. With my version of switched loops, it works fine. I am checking my version of the column-wise iteration master for now. Can you check it out?

@doobwa

and https://gist.github.com/2993335, with a few modifications to handle growing, trimming upon completion, and loop ordering, which seems to make the tests pass. This code gives a 30% speedup if all LLVM optimizations in codegen.cpp are turned on. @doobwa, thanks for the simpler implementation. Sorry about being unable to merge your pull request due to github acting up.

ViralBShah · 2012-07-01T04:15:23Z

The runtime is now down to 4.9 seconds from 6 seconds earlier. Disabling the emit_bounds_check and enabling all LLVM passes gets it further down to 3.9 seconds. Adding FPM->add(createBBVectorizePass()) at codegen.cc:2144 takes the runtime further down to 3.6 seconds.

However, it is still about two times slower than the C implementation that is in Matlab.

ViralBShah · 2012-12-23T21:22:35Z

This is now taking 6.3 seconds again. Seems like we have regressed a bit on performance.

johnmyleswhite · 2012-12-23T21:23:48Z

We really need to build up a large benchmark suite and test performance with each new build on all of the tests.

ViralBShah · 2012-12-23T21:25:16Z

That is exactly what I have been thinking. We need a new issue for continuous performance benchmarking. As has been discussed elsewhere, we actually do have some machines at MIT that can do this reliably. Even if not on every build, with a cron job, it should be easy to do this every day.

ViralBShah · 2012-12-24T13:43:04Z

Looking through the code here, I realized that the work has increased here. There is a double transpose at the end now to ensure that all row indices are in sorted order, and we are using 64-bit indices by default for sparse matrices on x86_64 now.

happens to make sparse mul faster; #942

JeffBezanson · 2013-07-03T22:33:24Z

@ViralBShah see if that helps

ViralBShah · 2013-07-04T04:07:56Z

Now again at 3.3 seconds. Still slower than Matlab, but I am hoping that @inbounds may help close the gap some more.

ViralBShah · 2013-07-05T18:14:52Z

The inbounds macro takes this to 2.7 seconds, and I have some potential code improvements in mind to try out that may completely close this gap.

Gives 20% improvement on the perf test in #942

lindahua · 2013-07-05T21:57:16Z

When the issue of pointer being reloaded at each iteration is addressed (as discussed in #3268), the performance of this will be further improved.

ViralBShah · 2013-07-18T09:26:54Z

@JeffBezanson This has slowed down again and is taking almost 4.0 seconds now, over the last couple of weeks.

JeffBezanson · 2013-07-18T11:48:16Z

I also observed something strange: using Ct.' was significantly faster for me than calling transpose!. I don't see how that could be, but I don't fully understand the code.

timholy · 2013-07-18T12:48:35Z

For a minute I thought this was my recent transpose!, so I investigated. In the course of my investigations, I noticed something really strange: the floating-point operations in (*) are vastly slower than the integer operations.

First, I changed the function to see the types, to split components up onto separate lines, and to make sure there weren't any order effects (i.e., profiler oddities, which I'm always paranoid about but don't seem to materialize):

function (*){Tv,Ti}(A::SparseMatrixCSC{Tv,Ti}, B::SparseMatrixCSC{Tv,Ti})
    mA, nA = size(A)
    mB, nB = size(B)
    if nA != mB; error("mismatched dimensions"); end

    colptrA = A.colptr; rowvalA = A.rowval; nzvalA = A.nzval
    @show typeof(rowvalA)
    @show size(rowvalA)
    @show typeof(nzvalA)
    @show size(nzvalA)
    colptrB = B.colptr; rowvalB = B.rowval; nzvalB = B.nzval
    # TODO: Need better estimation of result space
    nnzC = min(mA*nB, length(nzvalA) + length(nzvalB))
    colptrC = Array(Ti, nB+1)
    rowvalC = Array(Ti, nnzC)
    nzvalC = Array(Tv, nnzC)

    @inbounds begin
        ip = 1
        xb = zeros(Ti, mA)
        x  = zeros(Tv, mA)
        for i in 1:nB
            if ip + mA - 1 > nnzC
                resize!(rowvalC, nnzC + max(nnzC,mA))
                resize!(nzvalC, nnzC + max(nnzC,mA))
                nnzC = length(nzvalC)
            end
            colptrC[i] = ip
            for jp in colptrB[i]:(colptrB[i+1] - 1)
                j = rowvalB[jp]                                     # line 94
                nzB = nzvalB[jp]                                 # line 95. Vastly slower than 94
                for kp in colptrA[j]:(colptrA[j+1] - 1)
                    tmp = nzvalA[kp]                            # line 97. Vastly slower than 100
                    nzC = tmp * nzB                             # line 98
#                     nzC = nzvalA[kp] * nzB
                    k = rowvalA[kp]                               # line 100
                    if xb[k] != i
                        rowvalC[ip] = k
                        ip += 1
                        xb[k] = i
                        x[k] = nzC                                   # line 105
                    else
                        x[k] += nzC                                 # line 107
                    end
                end
            end
            for vp in colptrC[i]:(ip - 1)
                nzvalC[vp] = x[rowvalC[vp]]
            end
        end
        colptrC[nB+1] = ip
    end

    splice!(rowvalC, colptrC[end]:length(rowvalC))
    splice!(nzvalC, colptrC[end]:length(nzvalC))

    # The Gustavson algorithm does not guarantee the product to have sorted row indices.
    return ((SparseMatrixCSC(mA, nB, colptrC, rowvalC, nzvalC).').')
end

Now:

julia> @sprofile s*s;
typeof(rowvalA) => Array{Int64,1}
size(rowvalA) => (1000000,)
typeof(nzvalA) => Array{Float64,1}
size(nzvalA) => (1000000,)

julia> sprofile_tree(true)
6479 julia; ???; line: 0
 6479 /lib/x86_64-linux-gnu/libc.so.6; __libc_start_main; line: 237
  6479 julia; ???; line: 0
   6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; julia_trampoline; line: 131
    6479 julia; ???; line: 0
     6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
      6479 ???; ???; line: 0
       6479 client.jl; _start; line: 369
        6479 client.jl; run_repl; line: 166
         6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
          6479 ???; ???; line: 0
           6479 client.jl; eval_user_input; line: 91
            6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
             6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
              6479 /home/tim/.julia/Profile/src/sprofile.jl; anonymous; line: 309
               6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                1    linalg/sparse.jl; *; line: 93
                6    linalg/sparse.jl; *; line: 94
                341  linalg/sparse.jl; *; line: 95
                8    linalg/sparse.jl; *; line: 96
                1107 linalg/sparse.jl; *; line: 97
                1711 linalg/sparse.jl; *; line: 98
                124  linalg/sparse.jl; *; line: 100
                516  linalg/sparse.jl; *; line: 101
                3    linalg/sparse.jl; *; line: 102
                339  linalg/sparse.jl; *; line: 105
                2244 linalg/sparse.jl; *; line: 107
                2    linalg/sparse.jl; *; line: 111
                14   linalg/sparse.jl; *; line: 112
                63   linalg/sparse.jl; *; line: 122
                 10 sparse/csparse.jl; _rowcounts; line: 126
                 1  sparse/csparse.jl; transpose!; line: 145
                 2  sparse/csparse.jl; transpose!; line: 147
                 18 sparse/csparse.jl; transpose!; line: 148
                 32 sparse/csparse.jl; transpose!; line: 149

ViralBShah · 2013-07-18T13:07:30Z

The transpose! expects some work to have been done in setting up colptr, which can be expensive and without which the result is likely to be incorrect. Not documented and perhaps should have a different name.

ViralBShah · 2013-07-18T13:12:52Z

Jeff - the slowdown existed prior to my commits today. My commits improve performance but those are basically memory related and unrelated to the actual multiplication.

ViralBShah · 2013-07-18T14:07:52Z

After fbf3ed8 the timing is now at 2.95 seconds.

StefanKarpinski · 2013-07-18T14:19:29Z

Some distillation of this code should go into our performance suite.

ViralBShah · 2013-07-18T14:23:00Z

sparse matmul is already in the perf suite. Codespeed should help identify these regressions. I just happened to stumble across this as I was updating it.

ghost assigned JeffBezanson Jul 1, 2012

stevengj mentioned this issue Mar 22, 2013

Make PETSc solvers available #2645

Closed

JeffBezanson added a commit that referenced this issue Jul 3, 2013

avoid redundant re-loads out of GC frame. ref #3440

4626669

happens to make sparse mul faster; #942

ViralBShah closed this as completed Jul 4, 2013

ViralBShah added a commit that referenced this issue Jul 5, 2013

Use @inbounds in sparse matrix multiplication.

1998871

Gives 20% improvement on the perf test in #942

JeffBezanson added a commit that referenced this issue Jul 18, 2013

fix perf regression in codegen that affected sparse multiply (#942)

fbf3ed8

sparse multiplication is slow #942

sparse multiplication is slow #942

Comments

ViralBShah commented Jun 18, 2012

doobwa commented Jun 24, 2012

ViralBShah commented Jun 24, 2012

ViralBShah commented Jun 24, 2012

doobwa commented Jun 26, 2012

ViralBShah commented Jun 26, 2012

ViralBShah commented Jun 26, 2012

ViralBShah commented Jun 26, 2012

ViralBShah commented Jun 26, 2012

ViralBShah commented Jun 26, 2012

ViralBShah commented Jun 26, 2012

doobwa commented Jun 26, 2012

ViralBShah commented Jun 26, 2012

ViralBShah commented Jun 26, 2012

pao commented Jun 27, 2012

doobwa commented Jun 27, 2012

ViralBShah commented Jun 29, 2012

ViralBShah commented Jun 29, 2012

ViralBShah commented Jun 29, 2012

ViralBShah commented Jul 1, 2012

ViralBShah commented Dec 23, 2012

johnmyleswhite commented Dec 23, 2012

ViralBShah commented Dec 23, 2012

ViralBShah commented Dec 24, 2012

JeffBezanson commented Jul 3, 2013

ViralBShah commented Jul 4, 2013

ViralBShah commented Jul 5, 2013

lindahua commented Jul 5, 2013

ViralBShah commented Jul 18, 2013

JeffBezanson commented Jul 18, 2013

timholy commented Jul 18, 2013

ViralBShah commented Jul 18, 2013

ViralBShah commented Jul 18, 2013

ViralBShah commented Jul 18, 2013

StefanKarpinski commented Jul 18, 2013

ViralBShah commented Jul 18, 2013