Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gk.jl example code slower than matlab #950

Closed
ViralBShah opened this issue Jun 19, 2012 · 12 comments
Closed

gk.jl example code slower than matlab #950

ViralBShah opened this issue Jun 19, 2012 · 12 comments
Labels
performance Must go faster

Comments

@ViralBShah
Copy link
Member

The code in examples/gk/gk.jl is about twice as slow as the same code in matlab. The code does is a mix of scalar loops and a few vectorized operations.

For a quick run, do gk(1000,[0.1]).

The profiler seems to suggest that much of the time is spent in the vectorized operations in the code. Julia is faster than matlab on small problem sizes, while slower on larger problem sizes, suggesting that the vectorized operations may be the reason for the poor performance.

@timholy
Copy link
Member

timholy commented Jun 20, 2012

Thanks for trying out the profiler!

It looks like all the time is spent on line 79. Two comments:

  1. If it's the same code in Matlab & Julia, then part of the problem might be issue use copy_to in assign() and ref() where possible #190, which I'm finally back to working on in concert with Performance on high-dimensional arrays #765. That could easily explain the two-fold gap. The arrays branch contains some code towards fixing this, but it's not finished yet.
  2. However, I'm really confused about that loop. It's O(n^2), and I believe the O(n) version
csum[1] = p[1]
for i = 2:n
    csum[i] = csum[i-1]+p[i]
end

will do the same thing much faster. Of course, this is what cumsum does, i.e., you could replace the whole loop with just

csum = cumsum(p)

If you care, there also seem to be a number of other aspects to this that are definitely not performance-optimal. For example, "exp" is expensive, so you don't want to call it more often than necessary. So rather than

s = sum(p[1:n] .* exp((eps/2)*A[1:n,k]))
for i=1:n
    p[i]=(p[i]*exp((eps/2)*A[i,k])) / s
end

you probably want

for i=1:n
    p[i]=(p[i]*exp((eps/2)*A[i,k]))
end
p = p / sum(p)

@timholy
Copy link
Member

timholy commented Jun 20, 2012

I should add that calling cumsum will allocate a new vector, while the explict version I wrote out re-uses the same memory on each iteration. So for the ultimate in speed it's better to avoid cumsum in this case.

@ViralBShah
Copy link
Member Author

So far, I have not tuned the julia code, because it is identical to the matlab code. It would be good for us to be at least as fast as anything out there, without all these improvements. Once we get there, we should update both the julia and matlab codes with the suggested improvements, and see where we stand.

The fact of the matter is that people will write code in a style they are comfortable, and the compiler ought to do the best it can. Wouldn't it be cool if julia can make such suggestions to the user. :-)

@timholy
Copy link
Member

timholy commented Jun 20, 2012

It would be good for us to be at least as fast as anything out there, without all these improvements.

Care to try manually to see if using memcpy fixes the performance gap? The arrays branch is still quite some ways away from merging, and it might be nice to know whether this really will address the problem. Basically, add a function

function myref{T}(p::Array{T}, r::Range1{Int})
    pcpy = Array(T, length(r))
    copy_to(pcpy, 1, p, first(r), length(r))
    return pcpy
end

and then in line 79 replace the sum(p[1:i]) with sum(myref(p, 1:i)).

The fact of the matter is that people will write code in a style they are comfortable, and the compiler ought to do the
best it can. Wouldn't it be cool if julia can make such suggestions to the user. :-)

As you surely know, Matlab's MLint is quite good for "low-level" optimizations, but choice of algorithm is beyond its capabilities. Still, it would be a useful thing to have. It may require an IDE, unless it could be done when functions are loaded into memory.

@ViralBShah
Copy link
Member Author

I used myref for p[1:i] and p[1:n] a few lines below. For 1000x1000, it is slightly faster than matlab. However, for 10000x10000 matrices, it is still twice as slow as matlab.

-viral

On 20-Jun-2012, at 4:31 PM, Tim Holy wrote:

It would be good for us to be at least as fast as anything out there, without all these improvements.

Care to try manually to see if using memcpy fixes the performance gap? The arrays branch is still quite some ways away from merging, and it might be nice to know whether this really will address the problem. Basically, add a function

function myref{T}(p::Array{T}, r::Range1{Int})
pcpy = Array(T, length(r))
copy_to(pcpy, 1, p, first(r), length(r))
return pcpy
end

and then in line 79 replace the sum(p[1:i]) with sum(myref(p, 1:i)).

The fact of the matter is that people will write code in a style they are comfortable, and the compiler ought to do the
best it can. Wouldn't it be cool if julia can make such suggestions to the user. :-)

As you surely know, Matlab's MLint is quite good for "low-level" optimizations, but choice of algorithm is beyond its capabilities. Still, it would be a useful thing to have. It may require an IDE, unless it could be done when functions are loaded into memory.


Reply to this email directly or view it on GitHub:
#950 (comment)

@ViralBShah
Copy link
Member Author

Is there any role for mmap here to improve performance on big arrays?

@timholy
Copy link
Member

timholy commented Jun 20, 2012

Perhaps Matlab is avoiding the temporary altogether? I added the following functions:

function myref{T}(p::Array{T}, r::Range1{Int})
    pcpy = Array(T, length(r))
    copy_to(pcpy, 1, p, first(r), length(r))
    return pcpy
end

function mysum{T}(p::Array{T}, r::Range1{Int})
    s::T = 0   # note: big performance hit if you don't declare the type of s!
    for i = r
        s += p[i]
    end
    return s
end

and then wrote the loop this way:

for i=1:n
    csum[i] = sum(p[1:i])
    csum[i] = sum(myref(p, 1:i))
    csum[i] = mysum(p, 1:i)
end

The profiling results on the relevant lines are as follows:

  646000  4.367203   4.340326 line(93)
  646000  2.978364   2.951488 line(94)
  646000  1.194523   1.167646 line(95)

So avoiding the copy makes it even faster, nearly 4x faster than the original on gk(1000,[0.1]). That would seem to beat Matlab. Of course, you have to delete the slower versions of those lines in the loop! (I put all three of them in there because the # of iterations is variable, so this seemed the safest way to compare them directly.)

So this would seem to suggest that the biggest savings are to be had in avoiding the ref operation.

Just one other thing to check, you are running Matlab single-threaded?

@timholy
Copy link
Member

timholy commented Jun 20, 2012

No, I don't think mmap will help. mmap's strength is efficient handling of data sets too big to store in RAM, but 10k x 10k is, by modern standards, still not very memory intensive.

@ViralBShah
Copy link
Member Author

I'll ensure I am using only one thread in matlab - that always gets me. In generaly, I do believe that matlab has optimized away a lot of temporary creations over the years.

-viral

On 20-Jun-2012, at 8:45 PM, Tim Holy wrote:

Perhaps Matlab is avoiding the temporary altogether? I added the following functions:

function myref{T}(p::Array{T}, r::Range1{Int})
pcpy = Array(T, length(r))
copy_to(pcpy, 1, p, first(r), length(r))
return pcpy
end

function mysum{T}(p::Array{T}, r::Range1{Int})
s::T = 0 # note: big performance hit if you don't declare the type of s!
for i = r
s += p[i]
end
return s
end

and then wrote the loop this way:

for i=1:n
csum[i] = sum(p[1:i])
csum[i] = sum(myref(p, 1:i))
csum[i] = mysum(p, 1:i)
end

The profiling results on the relevant lines are as follows:

 646000  4.367203   4.340326 line(93)
 646000  2.978364   2.951488 line(94)
 646000  1.194523   1.167646 line(95)

So avoiding the copy makes it even faster, nearly 4x faster than the original on gk(1000,[0.1]). That would seem to beat Matlab. Of course, you have to delete the slower versions of those lines in the loop! (I put all three of them in there because the # of iterations is variable, so this seemed the safest way to compare them directly.)

So this would seem to suggest that the biggest savings are to be had in avoiding the ref operation.

Just one other thing to check, you are running Matlab single-threaded?


Reply to this email directly or view it on GitHub:
#950 (comment)

@ViralBShah
Copy link
Member Author

This code is no longer twice as slow as matlab. Matlab is 1.6 seconds, whereas julia is 2.1 seconds for me, on the example invocation described above.

@JeffBezanson
Copy link
Member

This just got significantly faster for me, perhaps due to LLVM 3.2. Could you try it again?

@ViralBShah
Copy link
Member Author

Yes, we are now consistently faster than Matlab now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

3 participants