gk.jl example code slower than matlab #950

ViralBShah · 2012-06-19T17:50:37Z

The code in examples/gk/gk.jl is about twice as slow as the same code in matlab. The code does is a mix of scalar loops and a few vectorized operations.

For a quick run, do gk(1000,[0.1]).

The profiler seems to suggest that much of the time is spent in the vectorized operations in the code. Julia is faster than matlab on small problem sizes, while slower on larger problem sizes, suggesting that the vectorized operations may be the reason for the poor performance.

The text was updated successfully, but these errors were encountered:

timholy · 2012-06-20T09:47:46Z

Thanks for trying out the profiler!

It looks like all the time is spent on line 79. Two comments:

If it's the same code in Matlab & Julia, then part of the problem might be issue use copy_to in assign() and ref() where possible #190, which I'm finally back to working on in concert with Performance on high-dimensional arrays #765. That could easily explain the two-fold gap. The arrays branch contains some code towards fixing this, but it's not finished yet.
However, I'm really confused about that loop. It's O(n^2), and I believe the O(n) version

csum[1] = p[1]
for i = 2:n
    csum[i] = csum[i-1]+p[i]
end

will do the same thing much faster. Of course, this is what cumsum does, i.e., you could replace the whole loop with just

csum = cumsum(p)

If you care, there also seem to be a number of other aspects to this that are definitely not performance-optimal. For example, "exp" is expensive, so you don't want to call it more often than necessary. So rather than

s = sum(p[1:n] .* exp((eps/2)*A[1:n,k]))
for i=1:n
    p[i]=(p[i]*exp((eps/2)*A[i,k])) / s
end

you probably want

for i=1:n
    p[i]=(p[i]*exp((eps/2)*A[i,k]))
end
p = p / sum(p)

timholy · 2012-06-20T09:49:37Z

I should add that calling cumsum will allocate a new vector, while the explict version I wrote out re-uses the same memory on each iteration. So for the ultimate in speed it's better to avoid cumsum in this case.

ViralBShah · 2012-06-20T10:29:09Z

So far, I have not tuned the julia code, because it is identical to the matlab code. It would be good for us to be at least as fast as anything out there, without all these improvements. Once we get there, we should update both the julia and matlab codes with the suggested improvements, and see where we stand.

The fact of the matter is that people will write code in a style they are comfortable, and the compiler ought to do the best it can. Wouldn't it be cool if julia can make such suggestions to the user. :-)

timholy · 2012-06-20T11:01:16Z

It would be good for us to be at least as fast as anything out there, without all these improvements.

Care to try manually to see if using memcpy fixes the performance gap? The arrays branch is still quite some ways away from merging, and it might be nice to know whether this really will address the problem. Basically, add a function

function myref{T}(p::Array{T}, r::Range1{Int})
    pcpy = Array(T, length(r))
    copy_to(pcpy, 1, p, first(r), length(r))
    return pcpy
end

and then in line 79 replace the sum(p[1:i]) with sum(myref(p, 1:i)).

The fact of the matter is that people will write code in a style they are comfortable, and the compiler ought to do the
best it can. Wouldn't it be cool if julia can make such suggestions to the user. :-)

As you surely know, Matlab's MLint is quite good for "low-level" optimizations, but choice of algorithm is beyond its capabilities. Still, it would be a useful thing to have. It may require an IDE, unless it could be done when functions are loaded into memory.

ViralBShah · 2012-06-20T11:22:11Z

I used myref for p[1:i] and p[1:n] a few lines below. For 1000x1000, it is slightly faster than matlab. However, for 10000x10000 matrices, it is still twice as slow as matlab.

-viral

On 20-Jun-2012, at 4:31 PM, Tim Holy wrote:

It would be good for us to be at least as fast as anything out there, without all these improvements.

Care to try manually to see if using memcpy fixes the performance gap? The arrays branch is still quite some ways away from merging, and it might be nice to know whether this really will address the problem. Basically, add a function

function myref{T}(p::Array{T}, r::Range1{Int})
pcpy = Array(T, length(r))
copy_to(pcpy, 1, p, first(r), length(r))
return pcpy
end

and then in line 79 replace the sum(p[1:i]) with sum(myref(p, 1:i)).

The fact of the matter is that people will write code in a style they are comfortable, and the compiler ought to do the
best it can. Wouldn't it be cool if julia can make such suggestions to the user. :-)

As you surely know, Matlab's MLint is quite good for "low-level" optimizations, but choice of algorithm is beyond its capabilities. Still, it would be a useful thing to have. It may require an IDE, unless it could be done when functions are loaded into memory.

Reply to this email directly or view it on GitHub:
#950 (comment)

ViralBShah · 2012-06-20T11:42:35Z

Is there any role for mmap here to improve performance on big arrays?

timholy · 2012-06-20T15:15:40Z

Perhaps Matlab is avoiding the temporary altogether? I added the following functions:

function myref{T}(p::Array{T}, r::Range1{Int})
    pcpy = Array(T, length(r))
    copy_to(pcpy, 1, p, first(r), length(r))
    return pcpy
end

function mysum{T}(p::Array{T}, r::Range1{Int})
    s::T = 0   # note: big performance hit if you don't declare the type of s!
    for i = r
        s += p[i]
    end
    return s
end

and then wrote the loop this way:

for i=1:n
    csum[i] = sum(p[1:i])
    csum[i] = sum(myref(p, 1:i))
    csum[i] = mysum(p, 1:i)
end

The profiling results on the relevant lines are as follows:

  646000  4.367203   4.340326 line(93)
  646000  2.978364   2.951488 line(94)
  646000  1.194523   1.167646 line(95)

So avoiding the copy makes it even faster, nearly 4x faster than the original on gk(1000,[0.1]). That would seem to beat Matlab. Of course, you have to delete the slower versions of those lines in the loop! (I put all three of them in there because the # of iterations is variable, so this seemed the safest way to compare them directly.)

So this would seem to suggest that the biggest savings are to be had in avoiding the ref operation.

Just one other thing to check, you are running Matlab single-threaded?

timholy · 2012-06-20T17:23:58Z

No, I don't think mmap will help. mmap's strength is efficient handling of data sets too big to store in RAM, but 10k x 10k is, by modern standards, still not very memory intensive.

ViralBShah · 2012-06-20T17:59:42Z

I'll ensure I am using only one thread in matlab - that always gets me. In generaly, I do believe that matlab has optimized away a lot of temporary creations over the years.

-viral

On 20-Jun-2012, at 8:45 PM, Tim Holy wrote:

Perhaps Matlab is avoiding the temporary altogether? I added the following functions:

function myref{T}(p::Array{T}, r::Range1{Int})
pcpy = Array(T, length(r))
copy_to(pcpy, 1, p, first(r), length(r))
return pcpy
end

function mysum{T}(p::Array{T}, r::Range1{Int})
s::T = 0 # note: big performance hit if you don't declare the type of s!
for i = r
s += p[i]
end
return s
end

and then wrote the loop this way:

for i=1:n
csum[i] = sum(p[1:i])
csum[i] = sum(myref(p, 1:i))
csum[i] = mysum(p, 1:i)
end

The profiling results on the relevant lines are as follows:
 646000  4.367203   4.340326 line(93)
 646000  2.978364   2.951488 line(94)
 646000  1.194523   1.167646 line(95)
So avoiding the copy makes it even faster, nearly 4x faster than the original on gk(1000,[0.1]). That would seem to beat Matlab. Of course, you have to delete the slower versions of those lines in the loop! (I put all three of them in there because the # of iterations is variable, so this seemed the safest way to compare them directly.)

So this would seem to suggest that the biggest savings are to be had in avoiding the ref operation.

Just one other thing to check, you are running Matlab single-threaded?

Reply to this email directly or view it on GitHub:
#950 (comment)

ViralBShah · 2012-08-17T17:57:38Z

This code is no longer twice as slow as matlab. Matlab is 1.6 seconds, whereas julia is 2.1 seconds for me, on the example invocation described above.

JeffBezanson · 2013-01-05T19:38:43Z

This just got significantly faster for me, perhaps due to LLVM 3.2. Could you try it again?

ViralBShah · 2013-01-05T22:02:41Z

Yes, we are now consistently faster than Matlab now.

ViralBShah closed this as completed Jan 5, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gk.jl example code slower than matlab #950

gk.jl example code slower than matlab #950

ViralBShah commented Jun 19, 2012

timholy commented Jun 20, 2012

timholy commented Jun 20, 2012

ViralBShah commented Jun 20, 2012

timholy commented Jun 20, 2012

ViralBShah commented Jun 20, 2012

ViralBShah commented Jun 20, 2012

timholy commented Jun 20, 2012

timholy commented Jun 20, 2012

ViralBShah commented Jun 20, 2012

ViralBShah commented Aug 17, 2012

JeffBezanson commented Jan 5, 2013

ViralBShah commented Jan 5, 2013

gk.jl example code slower than matlab #950

gk.jl example code slower than matlab #950

Comments

ViralBShah commented Jun 19, 2012

timholy commented Jun 20, 2012

timholy commented Jun 20, 2012

ViralBShah commented Jun 20, 2012

timholy commented Jun 20, 2012

ViralBShah commented Jun 20, 2012

ViralBShah commented Jun 20, 2012

timholy commented Jun 20, 2012

timholy commented Jun 20, 2012

ViralBShah commented Jun 20, 2012

ViralBShah commented Aug 17, 2012

JeffBezanson commented Jan 5, 2013

ViralBShah commented Jan 5, 2013