Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparse multiplication is slow #942

Closed
ViralBShah opened this issue Jun 18, 2012 · 35 comments
Closed

sparse multiplication is slow #942

ViralBShah opened this issue Jun 18, 2012 · 35 comments
Assignees
Labels
performance Must go faster

Comments

@ViralBShah
Copy link
Member

Julia:

julia> s = sparse(ones(1000,1000));
julia> @time s*s
elapsed time: 6.074394941329956 seconds

Matlab:

>> s = sparse(ones(1000,1000));
>> tic; s*s; toc;
Elapsed time is 2.198466 seconds.
@doobwa
Copy link

doobwa commented Jun 24, 2012

I am having trouble replicating this.

julia> s=sparse(ones(1000,1000))
1000x1000 sparse matrix with 1000000 nonzeros:
[1   ,    1]  =  1.0
[2   ,    1]  =  1.0
[3   ,    1]  =  1.0
[4   ,    1]  =  1.0
[... I omitted the rest of the output... ]
julia> s*s;
no method *(SparseMatrixCSC{Float64,Int64},SparseMatrixCSC{Float64,Int64})
 in method_missing at base.jl:60

I have run make with the latest commit. Any ideas what I might be doing wrong?

@ViralBShah
Copy link
Member Author

You need to do load("linalg_sparse.jl") as well.

-viral

On 24-Jun-2012, at 5:03 PM, Chris DuBois wrote:

I am having trouble replicating this.

s=sparse(ones(1000,1000))
1000x1000 sparse matrix with 1000000 nonzeros:
[1 , 1] = 1.0
[2 , 1] = 1.0
[3 , 1] = 1.0
[4 , 1] = 1.0
julia> s*s;
no method *(SparseMatrixCSC{Float64,Int64},SparseMatrixCSC{Float64,Int64})
in method_missing at base.jl:60

I have run make with the latest commit. Any ideas what I might be doing wrong?


Reply to this email directly or view it on GitHub:
#942 (comment)

@ViralBShah
Copy link
Member Author

SparseAccumulator overhead is most likely quite high. The code should be reimplemented along the lines of this CXSparse implementation:

http://www.cise.ufl.edu/research/sparse/CXSparse/CXSparse/Source/cs_multiply.c

@doobwa
Copy link

doobwa commented Jun 26, 2012

I am no expert in this area, but I posted julia code for the naive matrix multiplication mentioned in Gustavson 1978. Maybe it could serve as a useful starting point for someone more versed in making things in Julia fast. As it stands, the method is a bit slower than julia's current implementation; see the example at the bottom of the gist.

The code you mention seems to be LGPL'd, so any derivative work would also have to be LGPL'd, correct? This source code is also printed in his 2006 book, though. The approach appears very similar to the Gustavson method, though I didn't check this thoroughly.

@ViralBShah
Copy link
Member Author

This is essentially what CSparse also does. I will try it out.

@ViralBShah
Copy link
Member Author

On lines 32 and 34, you are recomputing out VA[jp] every time. The compiler is not yet smart enough to pull it out of the loop. If you do so, you get the same speed as the existing julia code. I would have expected this to be faster, since it is doing lesser work, and is simpler code - but there seems to be little difference.

@ViralBShah
Copy link
Member Author

@doobwa Could you license this under the MIT license? With a bit of cleanup, this can replace the existing code, even though the speed is the same.

@ViralBShah
Copy link
Member Author

@JeffBezanson I believe this is the same algorithm that matlab implements. We are off by a factor of 2 compared to a C implementation. Any ideas what we can do to match performance here, or should we call a library function from CXSparse to get good performance?

function matmul{Tv,Ti}(A::SparseMatrixCSC{Tv,Ti}, B::SparseMatrixCSC{Tv,Ti})
  mA, nA = size(A)
  mB, nB = size(B)
  if nA != mB; error("mismatched dimensions"); end
  colptrA = A.colptr; rowvalA = A.rowval; nzvalA = A.nzval
  colptrB = B.colptr; rowvalB = B.rowval; nzvalB = B.nzval
  nnzC = length(nzvalA) + length(nzvalB)
  colptrC = Array(Ti, nB+1)
  rowvalC = Array(Ti, nnzC)
  nzvalC = Array(Tv, nnzC)
  ip = 1
  xb = zeros(Ti, mA)
  x  = zeros(Tv, mA)
  for i in 1:nB
      colptrC[i] = ip
      for jp in colptrB[i]:(colptrB[i+1] - 1)
          j = rowvalB[jp]
          nzB = nzvalB[jp]
          for kp in colptrA[j]:(colptrA[j+1] - 1)
              k = rowvalA[kp]
              nzA = nzvalA[kp]
              if xb[k] != i
                  rowvalC[ip] = k
                  ip += 1
                  xb[k] = i
                  x[k] = nzA * nzB
              else
                  x[k] += nzA * nzB
              end
          end
      end
      for vp in colptrC[i]:(ip - 1)
          nzvalC[vp] = x[rowvalC[vp]]
      end
  end
  colptrC[nB+1] = ip

  return SparseMatrixCSC(mA,nB,colptrC,rowvalC,nzvalC)
end

@ViralBShah
Copy link
Member Author

Also, I can get a 30% speedup here if I enable all the optimization passes in codegen.cpp. It helps the code above, but not the matrix multiply that is currently implemented.

@ViralBShah
Copy link
Member Author

@doobwa Your implementation reads like it is for row-wise storage. I would have expected the outer-loop over columns of B, and inner loops going over columns of A. I have corrected that in my implementation above. Performance is the same for the testcases that are presented in this issue, since access patterns and work does not change.

@doobwa
Copy link

doobwa commented Jun 26, 2012

@ViralBShah I have updated the gist with your suggested changes and the MIT license. My comments in the original version were not correct (when I said row I meant column!). The order of those loops doesn't really matter, so I have kept them as they are in the original pseudocode.

I also added a small bug fix regarding the rowval and nzval values returned (which incidentally also occurs in the current sparse matrix multiplication as well -- I can submit a pull request if you are interested).

With your changes I am getting some marginal speedups over the current implementation.

@ViralBShah
Copy link
Member Author

Yes, please do issue a pull request.

-viral

On 26-Jun-2012, at 9:20 PM, Chris DuBois wrote:

@ViralBShah I have updated the gist with your suggested changes and the MIT license. My comments in the original version were not correct (when I said row I meant column!). The order of those loops doesn't really matter, so I have kept them as they are in the original pseudocode.

I also added a small bug fix regarding the rowval and nzval values returned (which incidentally also occurs in the current sparse matrix multiplication as well -- I can submit a pull request if you are interested).

With your changes I am getting some marginal speedups over the current implementation.


Reply to this email directly or view it on GitHub:
#942 (comment)

@ViralBShah
Copy link
Member Author

That's right, the changes only give marginal speedups. If you comment out all the other passes in codegen.cpp, you get more aggressive optimization and some noticeable speedups. We probably need more compiler work for this code to perform better - but have to wait for @JeffBezanson to chime in.

-viral

On 26-Jun-2012, at 9:20 PM, Chris DuBois wrote:

@ViralBShah I have updated the gist with your suggested changes and the MIT license. My comments in the original version were not correct (when I said row I meant column!). The order of those loops doesn't really matter, so I have kept them as they are in the original pseudocode.

I also added a small bug fix regarding the rowval and nzval values returned (which incidentally also occurs in the current sparse matrix multiplication as well -- I can submit a pull request if you are interested).

With your changes I am getting some marginal speedups over the current implementation.


Reply to this email directly or view it on GitHub:
#942 (comment)

@pao
Copy link
Member

pao commented Jun 27, 2012

Stupid lack of undo and tiny phone buttons. Sorry @doobwa, didn't mean to kill that, was just going to add a #977 so Github would cross-reference.

@doobwa
Copy link

doobwa commented Jun 27, 2012

In some cases (that are somewhat reasonable for my particular application) I am getting sizeable speedups compared to the old version, which is encouraging, though I don't have a C or Matlab implementation to compare against.

N = 100000
n = 10000  # number of nonzero  elements
ix = int(floor(N*N * rand(n)) + 1)  # pick random elements throughout entire matrix
I = [mod(i,N) for i = ix]  # row
J = [int(i/N) for i = ix]  # col
V = int(ones(length(I)))  # all 1's

s = sparse(int32(vec(I)),int32(vec(J)),vec(V),N,N)

@time a=s*s;
@time b=matmul(s,s);


julia> @time a=s*s;
elapsed time: 0.03193998336791992 seconds

julia> @time b=matmul(s,s);
elapsed time: 0.0016660690307617188 seconds```

@ViralBShah
Copy link
Member Author

@doobwa Did you try non-square matrices?

julia> matmul(sparse(rand(100,10)),sparse(rand(10,100)))
in matmul: arrayref: index out of range
 in matmul at matmul.jl:29

@ViralBShah
Copy link
Member Author

We need the ability to grow the allocated region.

@ViralBShah
Copy link
Member Author

I added the ability to grow, and merged your patches for del manually. Even after so, I was getting errors with the code in gist. With my version of switched loops, it works fine. I am checking my version of the column-wise iteration master for now. Can you check it out?

ViralBShah added a commit that referenced this issue Jun 29, 2012
and https://gist.github.com/2993335, with a few modifications to
handle growing, trimming upon completion, and loop ordering, which
seems to make the tests pass.

This code gives a 30% speedup if all LLVM optimizations in codegen.cpp
are turned on.

@doobwa, thanks for the simpler implementation. Sorry about being unable
to merge your pull request due to github acting up.
@ViralBShah
Copy link
Member Author

The runtime is now down to 4.9 seconds from 6 seconds earlier. Disabling the emit_bounds_check and enabling all LLVM passes gets it further down to 3.9 seconds. Adding FPM->add(createBBVectorizePass()) at codegen.cc:2144 takes the runtime further down to 3.6 seconds.

However, it is still about two times slower than the C implementation that is in Matlab.

@ghost ghost assigned JeffBezanson Jul 1, 2012
@ViralBShah
Copy link
Member Author

This is now taking 6.3 seconds again. Seems like we have regressed a bit on performance.

@johnmyleswhite
Copy link
Member

We really need to build up a large benchmark suite and test performance with each new build on all of the tests.

@ViralBShah
Copy link
Member Author

That is exactly what I have been thinking. We need a new issue for continuous performance benchmarking. As has been discussed elsewhere, we actually do have some machines at MIT that can do this reliably. Even if not on every build, with a cron job, it should be easy to do this every day.

@ViralBShah
Copy link
Member Author

Looking through the code here, I realized that the work has increased here. There is a double transpose at the end now to ensure that all row indices are in sorted order, and we are using 64-bit indices by default for sparse matrices on x86_64 now.

JeffBezanson added a commit that referenced this issue Jul 3, 2013
happens to make sparse mul faster; #942
@JeffBezanson
Copy link
Member

@ViralBShah see if that helps

@ViralBShah
Copy link
Member Author

Now again at 3.3 seconds. Still slower than Matlab, but I am hoping that @inbounds may help close the gap some more.

@ViralBShah
Copy link
Member Author

The inbounds macro takes this to 2.7 seconds, and I have some potential code improvements in mind to try out that may completely close this gap.

ViralBShah added a commit that referenced this issue Jul 5, 2013
Gives 20% improvement on the perf test in #942
@lindahua
Copy link
Contributor

lindahua commented Jul 5, 2013

When the issue of pointer being reloaded at each iteration is addressed (as discussed in #3268), the performance of this will be further improved.

@ViralBShah
Copy link
Member Author

@JeffBezanson This has slowed down again and is taking almost 4.0 seconds now, over the last couple of weeks.

@JeffBezanson
Copy link
Member

I also observed something strange: using Ct.' was significantly faster for me than calling transpose!. I don't see how that could be, but I don't fully understand the code.

@timholy
Copy link
Member

timholy commented Jul 18, 2013

For a minute I thought this was my recent transpose!, so I investigated. In the course of my investigations, I noticed something really strange: the floating-point operations in (*) are vastly slower than the integer operations.

First, I changed the function to see the types, to split components up onto separate lines, and to make sure there weren't any order effects (i.e., profiler oddities, which I'm always paranoid about but don't seem to materialize):

function (*){Tv,Ti}(A::SparseMatrixCSC{Tv,Ti}, B::SparseMatrixCSC{Tv,Ti})
    mA, nA = size(A)
    mB, nB = size(B)
    if nA != mB; error("mismatched dimensions"); end

    colptrA = A.colptr; rowvalA = A.rowval; nzvalA = A.nzval
    @show typeof(rowvalA)
    @show size(rowvalA)
    @show typeof(nzvalA)
    @show size(nzvalA)
    colptrB = B.colptr; rowvalB = B.rowval; nzvalB = B.nzval
    # TODO: Need better estimation of result space
    nnzC = min(mA*nB, length(nzvalA) + length(nzvalB))
    colptrC = Array(Ti, nB+1)
    rowvalC = Array(Ti, nnzC)
    nzvalC = Array(Tv, nnzC)

    @inbounds begin
        ip = 1
        xb = zeros(Ti, mA)
        x  = zeros(Tv, mA)
        for i in 1:nB
            if ip + mA - 1 > nnzC
                resize!(rowvalC, nnzC + max(nnzC,mA))
                resize!(nzvalC, nnzC + max(nnzC,mA))
                nnzC = length(nzvalC)
            end
            colptrC[i] = ip
            for jp in colptrB[i]:(colptrB[i+1] - 1)
                j = rowvalB[jp]                                     # line 94
                nzB = nzvalB[jp]                                 # line 95. Vastly slower than 94
                for kp in colptrA[j]:(colptrA[j+1] - 1)
                    tmp = nzvalA[kp]                            # line 97. Vastly slower than 100
                    nzC = tmp * nzB                             # line 98
#                     nzC = nzvalA[kp] * nzB
                    k = rowvalA[kp]                               # line 100
                    if xb[k] != i
                        rowvalC[ip] = k
                        ip += 1
                        xb[k] = i
                        x[k] = nzC                                   # line 105
                    else
                        x[k] += nzC                                 # line 107
                    end
                end
            end
            for vp in colptrC[i]:(ip - 1)
                nzvalC[vp] = x[rowvalC[vp]]
            end
        end
        colptrC[nB+1] = ip
    end

    splice!(rowvalC, colptrC[end]:length(rowvalC))
    splice!(nzvalC, colptrC[end]:length(nzvalC))

    # The Gustavson algorithm does not guarantee the product to have sorted row indices.
    return ((SparseMatrixCSC(mA, nB, colptrC, rowvalC, nzvalC).').')
end

Now:

julia> @sprofile s*s;
typeof(rowvalA) => Array{Int64,1}
size(rowvalA) => (1000000,)
typeof(nzvalA) => Array{Float64,1}
size(nzvalA) => (1000000,)

julia> sprofile_tree(true)
6479 julia; ???; line: 0
 6479 /lib/x86_64-linux-gnu/libc.so.6; __libc_start_main; line: 237
  6479 julia; ???; line: 0
   6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; julia_trampoline; line: 131
    6479 julia; ???; line: 0
     6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
      6479 ???; ???; line: 0
       6479 client.jl; _start; line: 369
        6479 client.jl; run_repl; line: 166
         6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
          6479 ???; ???; line: 0
           6479 client.jl; eval_user_input; line: 91
            6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
             6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
              6479 /home/tim/.julia/Profile/src/sprofile.jl; anonymous; line: 309
               6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                1    linalg/sparse.jl; *; line: 93
                6    linalg/sparse.jl; *; line: 94
                341  linalg/sparse.jl; *; line: 95
                8    linalg/sparse.jl; *; line: 96
                1107 linalg/sparse.jl; *; line: 97
                1711 linalg/sparse.jl; *; line: 98
                124  linalg/sparse.jl; *; line: 100
                516  linalg/sparse.jl; *; line: 101
                3    linalg/sparse.jl; *; line: 102
                339  linalg/sparse.jl; *; line: 105
                2244 linalg/sparse.jl; *; line: 107
                2    linalg/sparse.jl; *; line: 111
                14   linalg/sparse.jl; *; line: 112
                63   linalg/sparse.jl; *; line: 122
                 10 sparse/csparse.jl; _rowcounts; line: 126
                 1  sparse/csparse.jl; transpose!; line: 145
                 2  sparse/csparse.jl; transpose!; line: 147
                 18 sparse/csparse.jl; transpose!; line: 148
                 32 sparse/csparse.jl; transpose!; line: 149

@ViralBShah
Copy link
Member Author

The transpose! expects some work to have been done in setting up colptr, which can be expensive and without which the result is likely to be incorrect. Not documented and perhaps should have a different name.

@ViralBShah
Copy link
Member Author

Jeff - the slowdown existed prior to my commits today. My commits improve performance but those are basically memory related and unrelated to the actual multiplication.

@ViralBShah
Copy link
Member Author

After fbf3ed8 the timing is now at 2.95 seconds.

@StefanKarpinski
Copy link
Member

Some distillation of this code should go into our performance suite.

@ViralBShah
Copy link
Member Author

sparse matmul is already in the perf suite. Codespeed should help identify these regressions. I just happened to stumble across this as I was updating it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

8 participants