Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: robust division for Complex{Float64} #5112

Merged
merged 2 commits into from
Dec 15, 2013

Conversation

ggggggggg
Copy link
Contributor

Robust complex division for Complex{Float64} based on arxiv.1210.4539. Includes testing based on 10 example hard complex division.
Closes #5072

@StefanKarpinski
Copy link
Member

Awesome. I'll let @JeffBezanson take a look and pull the trigger.

@ViralBShah
Copy link
Member

This looks really thorough!

@ViralBShah
Copy link
Member

How does the performance compare to the existing version?

@JeffBezanson
Copy link
Member

It seems like this would work for Float32 as well, at least. I realize the
test cases are Float64, but that's ok.
On Dec 12, 2013 1:14 AM, "Viral B. Shah" [email protected] wrote:

How does the performance compare to the existing version?


Reply to this email directly or view it on GitHubhttps://github.com//pull/5112#issuecomment-30391802
.

@ggggggggg
Copy link
Contributor Author

I'll work up some benchmarks later, in the paper they benchmark the robust algorithm as something like 1/3 slower than smith's algorithm (which I believe is the current algorithm). As for Float32 I was under the impression that Julia did calculations with Float64, then converts back to Float32. If that is the case then I think this algorithm offers no advantage over the previous (smith's) algorithm, and will probably be slower.

In the paper their reference "correct" answers come from using smith's algorithm with Float64 input, but some sort of intel extension that causes the actual calculation to be done with Float80 precision.

@JeffBezanson
Copy link
Member

There is no magic to do Float32 operations as Float64, but you are right that we should do Float32 complex division by converting to Float64, using the old algorithm, and converting back.

@JeffBezanson
Copy link
Member

I'm not too worried about performance here; I think the extra accuracy is well worth it.

@JeffBezanson
Copy link
Member

OK I think this is ready to merge. But I'd like to ask that the link to the scilab file be removed, since you did not copy the code from there and I don't want to create confusion about whether we are bound by its license.

@ggggggggg
Copy link
Contributor Author

Ok. That was fun. Now I need to find another easy issue.

@jiahao
Copy link
Member

jiahao commented Dec 15, 2013

bump

JeffBezanson added a commit that referenced this pull request Dec 15, 2013
RFC: robust division for Complex{Float64}
@JeffBezanson JeffBezanson merged commit d89ea95 into JuliaLang:master Dec 15, 2013
@ViralBShah
Copy link
Member

Great to have this merged.

@ggggggggg ggggggggg deleted the complex_division_robust branch December 15, 2013 18:06
@stevengj
Copy link
Member

Isn't the main point of the old (Smith) algorithm to avoid spurious overflow (in abs2(z))? If we implement Float32 division in terms of Float64 operations, then overflow should be impossible.

Speaking of spurious overflow, the definition inv(z::Complex) = conj(z)/abs2(z) seems wrong, e.g. it gives:

julia> inv(1e300+0im)
0.0 - 0.0im

# a + i*b
# p + i*q = ---------
# c + i*d
function robust_cdiv{T<:Float64}(a::T,b::T,c::T,d::T)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is T<:Float64 for? Since Float64 is a concrete type, how is it possible T to be anything but Float64?

@stevengj
Copy link
Member

(It would be nice to see our own benchmarks; I've learned the hard way that one should always be cautious about performance claims in papers.)

@ggggggggg
Copy link
Contributor Author

I used T<:Float64 mostly because it was more concise than writing Float64 a bunch throughout the function. I didn't think there would be any performance penalty.

I'll put benchmarks back on my todo.

@JeffBezanson
Copy link
Member

There is no performance penalty, but it would be cleaner to remove the Ts. I'm pretty sure all those declarations are redundant.

@StefanKarpinski
Copy link
Member

Aside: it can actually be useful to use a concrete type as an upper type bound if the parameter could be None:

foo{T<:Float64}(v::Vector{T}) = ...

This foo method now accepts arrays of type Array{Float64} or Array{None} which is sometimes convenient.

@ggggggggg
Copy link
Contributor Author

I put together a simple benchmark, and the speed difference seems to be greater than the paper suggested. The paper benchmarked the robust algorithm at about 4/3 the time of the smith algorithm if I'm remembering correctly. Here I'm seeing closer to a factor of 5. No idea why so much memory is being allocated.

https://gist.github.com/ggggggggg/8046958
robust/new
elapsed time: 0.311824189 seconds (64940708 bytes allocated)
elapsed time: 0.106856784 seconds (64000048 bytes allocated)
elapsed time: 0.086953636 seconds (64000048 bytes allocated)
elapsed time: 0.087881237 seconds (64000048 bytes allocated)
elapsed time: 0.099615934 seconds (64000048 bytes allocated)
smith/old
elapsed time: 0.027790837 seconds (349352 bytes allocated)
elapsed time: 0.021653097 seconds (48 bytes allocated)
elapsed time: 0.021378803 seconds (48 bytes allocated)
elapsed time: 0.020773924 seconds (48 bytes allocated)
elapsed time: 0.020672097 seconds (48 bytes allocated)

@JeffBezanson
Copy link
Member

Are you running that in julia 0.2? It should be much faster with current master.

@stevengj
Copy link
Member

I did a simple benchmark with master:

foo(A, B) = Complex128[(A[i] * conj(B[i])) / abs2(B[i]) for i in 1:length(A)]
bar(A, B) = Complex128[A[i] / B[i] for i in 1:length(A)]
baz(A, B) = Complex128[abs(real(B[i]) + imag(B[i])) > 1e150 ? A[i] / B[i] : (A[i] * conj(B[i])) / abs2(B[i]) for i in 1:length(A)]
n = 10^7
x = rand(n) + rand(n)*im
y = rand(n) + rand(n)*im
@time foo(x,y);
@time bar(x,y);
@time baz(x,y);

and got

elapsed time: 0.198087233 seconds (160000096 bytes allocated)
elapsed time: 0.405705481 seconds (160000096 bytes allocated)
elapsed time: 0.201764476 seconds (160000096 bytes allocated)

I haven't had a chance to try the old (Smith) code. However, it looks like:

  • The new code (robust div) is about 2x slower than naive division
  • Adding a simple check for large values and using naive division in the common case where overflow does not occur (almost all the time in practice) is nearly as fast as naive division.

@ggggggggg
Copy link
Contributor Author

I ran my benchmark on 0.2 on my work machine, so I built the latest master at Jeff's suggestion and reran it. I get much closer performance now.

robust/new: elapsed time: 0.068046929 seconds (48 bytes allocated)
smith/old: elapsed time: 0.05163035 seconds (48 bytes allocated)

Also I the simple check for potential overflow doesn't work. At the least I think it misses cases with underflow. I ran stevegj's baz function through the hard division tests in test/complex/jl with [sb_accuracy(baz(h[1],h[2])[1],h[3]) for h in harddivs]. On half of the tests it gets the correct answer (53 bits), on the other half it fails (0 bits).

@stevengj
Copy link
Member

I'm not surprised that my test was too simple (e.g. I just noticed that the abs call is in the wrong place); it was just a quick test to see the overhead of a basic check.

My point is that it is worth using the naive algorithm if we can devise a simple test for when it is valid, and the robust algorithm otherwise, because in the overwhelming majority of real-world cases (no underflow/overflow) the naive algorithm is fine. Modifying your benchmark code to look at the naive algorithm and a "checked" version (the naive algorithm or the robust one depending on a simple test), I find that the results are even better than my benchmark from above:

robust/new -- elapsed time: 0.031266115 seconds (48 bytes allocated)
smith/old -- elapsed time: 0.020095014 seconds (48 bytes allocated)
naive -- elapsed time: 0.001157761 seconds (48 bytes allocated)
checked naive -- elapsed time: 0.002784719 seconds (48 bytes allocated)

The robust version is 50% slower than Smith. The naive version is 20x faster than Smith, and the naive version plus a simple check is 10x faster than Smith.

For a factor of 10 improvement, it is probably worth spending some time to figure out a reliable test for whether the naive algorithm is sufficient.

@stevengj
Copy link
Member

The following routine passes all of the harddiv tests:

function checked_div(a::Complex128, b::Complex128)
    bnorm1 = abs(real(b)) + abs(imag(b))
    bnorm1 > 1e153 || bnorm1 < 1e-153 ? robust_cdiv(a, b) : a * (conj(b) / abs2(b))
end

and is about 50% faster than the old Smith method in your benchmark. If I manually inline checked_div in the benchmark, it is about 6x faster than the old Smith method.

@JeffBezanson, I don't suppose much can be done about the function-call overhead? Is this another case for #1106?

PS. The whole bnorm1 = abs(real(b)) + abs(imag(b)); bnorm1 > 1e153 || bnorm1 < 1e-153 check could almost certainly be sped up considerably by bit-twiddling to examine the exponent bits directly.

Benchmark routines:

function bench_check(n,d)
    for i = 1:length(n)
        checked_div(n[i],d[i])
    end
end

function bench_check_inline(n,d)
    for i = 1:length(n)
        di = d[i]
        bnorm1 = abs(real(di)) + abs(imag(di))
        bnorm1 > 1e153 || bnorm1 < 1e-153 ? robust_cdiv(n[i], di) : n[i] * (conj(di) / abs2(di))
    end
end

@ggggggggg
Copy link
Contributor Author

I see. You're fast, here is another routine that passes all the harddiv tests, I think it is slightly slower than yours, but I can only test on 0.2 right now which we know is very slow with robust_cdiv

robust/current: elapsed time: 0.098058714 seconds (64006848 bytes allocated)
smith/old: elapsed time: 0.019170227 seconds (48 bytes allocated)
naive_test: elapsed time: 0.01587455 seconds (88712 bytes allocated)
function naive_test_cdiv(a::Complex128, b::Complex128)
    c = naive_cdiv(a,b)
    cr,ci = reim(c)
    if !isfinite(cr) || !isfinite(ci) || (cr == 0.0 && real(a) != 0.0) || (ci == 0.0 && imag(a) != 0.0)
        return robust_cdiv(a,b)
    end
    c
end

Updated my gist with this benchmark and the harddiv tests.
https://gist.github.com/ggggggggg/8046958

@stevengj
Copy link
Member

Here is another routine that is about 2.25x faster than Smith, just by inlining the complex operations in the naive division:

function checked_div(a::Complex128, b::Complex128)
    br = real(b); bi = imag(b)
    bnorm1 = abs(br) + abs(bi)
    if bnorm1 > 1e153 || bnorm1 < 1e-153 
        robust_cdiv(a, b)
    else
        binv = 1.0 / (br*br + bi*bi)
        ar = real(a); ai = imag(a)
        br *= binv; bi *= binv
        Complex128(ar * br + ai * bi, ai * br - ar * bi)
    end
end

@stevengj
Copy link
Member

I benchmarked your routine, along with an optimized version of your routine that inlines naive_cdiv similar to my code, and the results (best times) are:

robust/new elapsed time: 0.032071361 seconds (48 bytes allocated)
smith/old elapsed time: 0.020198377 seconds (48 bytes allocated)
naive elapsed time: 0.001157725 seconds (48 bytes allocated)
checked_div elapsed time: 0.008129058 seconds (48 bytes allocated)
inlined checked_div elapsed time: 0.002815216 seconds (48 bytes allocated)
naive_test_cdiv elapsed time: 0.013251627 seconds (48 bytes allocated)
optimized naive_test_cdiv elapsed time: 0.011002862 seconds (48 bytes allocated)

@ggggggggg
Copy link
Contributor Author

Inline inlined checked_div is impressively fast. Seems pretty close to the best of both worlds.

@stevengj
Copy link
Member

The problem is that it is manually inlined (directly into the benchmark function), so this is not possible as a way to implement / without improvements to Julia's inlining capabilities. checked_div (my 2nd version above) may be close to as good as we can do for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

implement better complex division algorithm
6 participants