Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp and log are much slower than they should be. #135

Open
oscardssmith opened this issue Jan 20, 2022 · 27 comments
Open

exp and log are much slower than they should be. #135

oscardssmith opened this issue Jan 20, 2022 · 27 comments

Comments

@oscardssmith
Copy link
Member

both are around 100x slower than float64. exp in particular should be able to be done relatively quickly since exp(hi+lo)=exp(hi)exp(lo), and since hi can easily be reduced to [1,2), exp(lo) only needs 2 terms.

I have fewer ideas for log, but there should be something better than the current behavior.

also, it's worth figuring on double64 versions of these functions because the double32 versions can be done really quickly using float64.

@JeffreySarnoff
Copy link
Member

Yes .. the Double32 type was an early artifact of hoping to redouble Double64s to compute Double128 accurately. At the time, Julia did not provide the mechanism for that sort involutve constructor to just work [or to work at all]. A next large reorganization of DoubleFloats just drops Double32, and just as you say .. recommends the use of Float64 has the doubled Float32 precision.

I had not worked with exp and log in quite a while. The low performance is unacceptable .. and was not always the case ... thank you for bringing this to light.

@JeffreySarnoff
Copy link
Member

There are at least a few approaches to implementing exp and log. One gives each their own algorithmic specialization and whatever specific computational advantages are to be had. One works hard to speed exp for its own sake and uses it within log to refine the result. At some point I replaced one of those approaches with the other -- and the accuracy suffered some. To undo the revisions exactly. reduplicating the earlier implementations, was an ask too murky for me at the time (my comfort with GitHub has increased slowly).

@oscardssmith
Copy link
Member Author

How much error are you willing to accept for an exp implementation?

@JeffreySarnoff
Copy link
Member

If you sample deeply through all of the arithmetic and each elementary function, and then do the same with other implementations of double-double, we get at least as many and often more accurate bits in our results.
exp is or may be used in constructing other functions that are defined in terms of exp and other things. So, there is less leeway with the arithmetic ops and exp and some others. I like that we do Float106 in a way that better protects the accuracy of the lower order bits [not always the lowest order bit].

@oscardssmith
Copy link
Member Author

How would you feel about 4 ulps (or so) in the low part of the output? I think that will be roughly the threshold that is achievable without significant usage of triple compensated arithmetic.

@JeffreySarnoff
Copy link
Member

I need to find the version of that number I had decided upon some time ago. I do recall there being some threshold of cooperation throughout the functional realizations. OTOH, rather than go full triple-double inside exp, using triple-doubles for the first two-sh coeffs may be the medium.

@oscardssmith
Copy link
Member Author

oscardssmith commented Jan 20, 2022

How does this look? It's still missing the overflow/underflow checks, but I'm seeing this as benching roughly 10x faster than the current implementation, and I think the error should be acceptable (I haven't fully tested it though)

using DoubleFloats, ProfileView, Plots
setprecision(BigFloat, 2048)

@inline function two_sum(a, b)
    hi = a + b
    a1 = hi - b
    lo = (a - a1) + (b - (hi - a1))
    return hi, lo
end

@inline function exthorner(x, p::Tuple)
    hi, lo = p[end], zero(x)
    for i in length(p)-1:-1:1
        pi = p[i]
        prod = hi*x
        err1 = fma(hi, x, -prod)
        hi, err2 = two_sum(pi,prod)
        lo = fma(lo, x, err1 + err2)
    end
    return hi, lo
end

const coefs = Tuple(log(big(2))^n/factorial(big(n)) for n in 1:10)
const coefs_hi =  Float64.(coefs)
const coefs_lo = Float64.(coefs .- coefs_hi)

function exp_kernel(x::Float64)
    hi, lo = exthorner(x, coefs_hi)
    lo2 = evalpoly(x, coefs_lo)
    #lo2 = Float64(evalpoly(x, coefs.-coefs_hi))
    hix = hi*x
    lo_all = fma(lo, x, fma(lo2, x, fma(hi, x, -hix)))
    return Double64(hix, lo_all)
end

function make_table(size, n=1)
    t_array = zeros(Double64, 16);
    for j in 1:size
        val = 2.0^(BigFloat(j-1)/(16*n))
        t_array[j] = val
    end
    return Tuple(t_array)
end
const T1 = make_table(16)
const T2 = make_table(16, 16)

function double_exp2(x::Float64)
    N = round(Int, 256*x)
    k = N>>8
    j1 = T1[(N&255)>>4 + 1]
    j2 = T2[N&15 + 1]
    r = exp_kernel(muladd((-1/256), N, x))
    return r * j1 * j2 * exp2(k)
end

function double_exp2(a::Double64)
    x = a.hi
    N = round(Int, 256*x)
    k = N>>8
    j1 = T1[(N&255)>>4 + 1]
    j2 = T2[N&15 + 1]
    r = fma((-1/256), N, x)
    poly = exp_kernel(r)
    e2k = exp2(k)
    poly_lo = exp_kernel(a.lo)
    lo_part = fma(poly, poly_lo, poly_lo) + poly
    ans = fma(j1*j2, lo_part, j1*j2)
    return e2k*ans
end

Edit: This still has a bug somewhere that is causing it to not give nearly as many sig figs as it should.

@JeffreySarnoff
Copy link
Member

It looks good, much more useful than was available an hour ago!
I need to write a harness of more focused test regions.

@oscardssmith
Copy link
Member Author

I just updated the code. I'm seeing about half the error of the current implementation with 10x speed improvement. Now I just need to make a PR and add back the overflow/underflow checks.

@JeffreySarnoff
Copy link
Member

JeffreySarnoff commented Jan 21, 2022 via email

This was referenced Jan 21, 2022
@JeffreySarnoff
Copy link
Member

let me know when you want to merge something

@oscardssmith
Copy link
Member Author

Can you merge https://github.com/oscardssmith/DoubleFloats.jl/tree/oscardssmith-faster-square? it's a smaller change, but should be ready to merge. this PR needs a small amount of extra work to avoid an accuracy regression (unless you are OK with a minor accuracy regression).

@JeffreySarnoff
Copy link
Member

JeffreySarnoff commented Jan 28, 2022

merged the faster-square
strongly prefer maintaining accuracy
is the regression confined to some lesser used region?
Please give me a few examples that regress accuracy.

@JeffreySarnoff
Copy link
Member

I just ran some rand * scale vectors of Double64 and Float128 through exp and log.
Double64 outperforms Float128 by 2x .. 3.5x, that is rather impressive, considering
all the person-years of development and all the more person-years of use and reporting that libquadmath has going for it.

@oscardssmith is there some corner of the domain that you found troublesome, or did merging your earlier improvement work quite well?

@oscardssmith
Copy link
Member Author

what do you mean?

@JeffreySarnoff
Copy link
Member

The title on this issue seems outdated. Before changing it, I am confirming with you that the performance of exp and log have improved over these past months. It appears they have.

@oscardssmith
Copy link
Member Author

how have they changed? the source likes identical

@JeffreySarnoff
Copy link
Member

Limited benchmarking does not report slow performance relative to Float128,
allowing some proportional time adjustment for the 7 extra sigbits does not appear to matter.

I used 1024 rands, multiplied them by 2, subtracted 1 to have both positive and negative values.
These signed floats were scaled, given them these extrema (34.8, -34.9).
A second test vector was the first one translated for log, the second vector's extrema: (0.17, 69.9).

julia> (@btime map(exp,xs) setup=(xs=float128));
  1.513 ms (1 allocation: 16.12 KiB)

julia> (@btime map(exp,xs) setup=(xs=double64));
  326.600 μs (1 allocation: 16.12 KiB)

julia> (@btime map(log,xs) setup=(xs=float128_nonneg));
  665.800 μs (1 allocation: 16.12 KiB)

julia> (@btime map(log,xs) setup=(xs=double64_nonneg));
  388.900 μs (1 allocation: 16.12 KiB)

@oscardssmith
Copy link
Member Author

yeah, it's not slow compared to float128, but that still is roughly 100x slower than Float64.

@photor
Copy link

photor commented Mar 17, 2024

yeah, it's not slow compared to float128, but that still is roughly 100x slower than Float64.

Any progress since then?

@oscardssmith
Copy link
Member Author

no. I haven't had much time to play with this. I still think that ~5-10x faster than the current algorithm should be achievable. but it is one where you need to be careful to get all the errors to line up properly. it would help to know how many ulps is considered acceptable. I would be inclined to go for relatively loose accuracy requirements since if the last couple bits matter, you are likely better off with multifloats float64x3.

@photor
Copy link

photor commented Mar 18, 2024

no. I haven't had much time to play with this. I still think that ~5-10x faster than the current algorithm should be achievable. but it is one where you need to be careful to get all the errors to line up properly. it would help to know how many ulps is considered acceptable. I would be inclined to go for relatively loose accuracy requirements since if the last couple bits matter, you are likely better off with multifloats float64x3.

Thanks, I was not aware of the package multifloats before. But its webpages say it doesn't support functions like exp and log? Another question is, can multifloats be fit in GPU? Doublefloats can have impressive performance with matrix ops on GPU, though I failed to make it work for exp.

@JeffreySarnoff
Copy link
Member

JeffreySarnoff commented Mar 18, 2024 via email

@JeffreySarnoff
Copy link
Member

I just spent time on this -- the current exp benchmarks 10x slower than Float64 -- no longer 100x slower.
@oscardssmith @photor

julia> using DoubleFloats

julia> x = log(2.0)
0.6931471805599453

julia> dx = log(Double64(2.0))
6.93147180559945309417232121458179079e-01

julia> @b exp(x)
35.361 ns (2.01 allocs: 32.264 bytes)

julia> @b exp(dx)
359.211 ns (2.11 allocs: 67.158 bytes)


julia> x = log(37.0)
3.6109179126442243

julia> dx = log(Double64(37.0))
3.61091791264422444436809567103144955

julia> @b exp(x)
34.032 ns (2.01 allocs: 32.256 bytes)

julia> @b exp(dx)
374.667 ns (2.11 allocs: 67.200 bytes)

Julia Version 1.10.2
Commit bd47eca2c8 (2024-03-01 10:14 UTC)

@oscardssmith
Copy link
Member Author

What computer are you on? 35ns for Float64 exp is way too slow. I get ~5ns on an 11th gen laptop on battery.

julia> @btime exp($(Ref(x)[]))
  4.885 ns (0 allocations: 0 bytes)
2.0

@JeffreySarnoff
Copy link
Member

Thanks for that. It seems that using Chairmarks for this messed up my results. Back to the mill.

@photor
Copy link

photor commented Mar 19, 2024

This not an easy fix, the time taken is a result of optimizing for accuracy over speed. There is a less accurate implementation (as I recall) that is faster. I could add that as exp_unsafe -- let me see about tracking it down and benchmarking that.

On Sun, Mar 17, 2024 at 10:58 PM photor @.> wrote: no. I haven't had much time to play with this. I still think that ~5-10x faster than the current algorithm should be achievable. but it is one where you need to be careful to get all the errors to line up properly. it would help to know how many ulps is considered acceptable. I would be inclined to go for relatively loose accuracy requirements since if the last couple bits matter, you are likely better off with multifloats float64x3. Thanks, I was not aware of the package multifloats before. But its webpages say it doesn't support functions like exp and log? Another question is, can multifloats be fit in GPU? Doublefloats can have impressive performance with matrix ops on GPU, though I failed to make it work for exp. — Reply to this email directly, view it on GitHub <#135 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAM2VRQEZ63ZL3F5NRIPFS3YYZJ7DAVCNFSM5ML6RYP2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBQGI3TQNRSGE2A . You are receiving this because you commented.Message ID: @.>

Any chance to make it work on GPU?

julia> using DoubleFloats, Adapt, CUDA

julia> a=randn(Double64,1000,1000);

julia> gpua=adapt(CuArray, a);

julia> gpub=similar(gpua);

julia> function gpu_exp(y,x)
       for i=1:length(y)
       @inbounds y[i]=exp(x[i])
       end
       return nothing
       end
gpu_exp (generic function with 1 method)

julia> @cuda gpu_exp(gpub,gpua);
ERROR: InvalidIRError: compiling MethodInstance for gpu_exp(::CuDeviceMatrix{Double64, 1}, ::CuDeviceMatrix{Double64, 1}) resulted in invalid LLVM IR
Reason: unsupported call through a literal pointer (call to .text)
Stacktrace:
 [1] modf
   @ .\math.jl:902
 [2] modf
   @ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\prearith\prearith.jl:170
 [3] calc_exp
   @ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\elementary\explog.jl:142
 [4] exp
   @ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\elementary\explog.jl:21
 [5] gpu_exp
   @ .\REPL[7]:3
Reason: unsupported call through a literal pointer (call to .text)
Stacktrace:
 [1] modf
   @ .\math.jl:902
 [2] modf
   @ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\prearith\prearith.jl:171
 [3] calc_exp
   @ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\elementary\explog.jl:142
 [4] exp
   @ C:\Users\DELL\.julia\packages\DoubleFloats\upxJH\src\math\elementary\explog.jl:21
 [5] gpu_exp
   @ .\REPL[7]:3
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
 [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
   @ GPUCompiler C:\Users\DELL\.julia\packages\GPUCompiler\YO8Uj\src\validation.jl:149
 [2] macro expansion
   @ C:\Users\DELL\.julia\packages\GPUCompiler\YO8Uj\src\driver.jl:415 [inlined]
 [3] macro expansion
   @ C:\Users\DELL\.julia\packages\TimerOutputs\RsWnF\src\TimerOutput.jl:253 [inlined]
 [4] macro expansion
   @ C:\Users\DELL\.julia\packages\GPUCompiler\YO8Uj\src\driver.jl:414 [inlined]
 [5] emit_llvm(job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool)
   @ GPUCompiler C:\Users\DELL\.julia\packages\GPUCompiler\YO8Uj\src\utils.jl:89

julia>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants