-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix perf in exp(::Matrix) (for smallish matrices) #29116
Conversation
stdlib/LinearAlgebra/src/dense.jl
Outdated
CC[8]*A6 + CC[6]*A4 + CC[4]*A2 + CC[2]*Inn) | ||
V = A6 * (CC[13]*A6 + CC[11]*A4 + CC[9]*A2) + | ||
CC[7]*A6 + CC[5]*A4 + CC[3]*A2 + CC[1]*Inn | ||
U = A * (A6 * (CC[14]*A6 .+ CC[12]*A4 .+ CC[10]*A2) .+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC[14]*A6
etcetera should be CC[14].*A6
in order to fuse. Probably cleaner to just do @.(CC[14]*A6 + CC[12]*A4 + …)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm not sure why CC
is an array rather than a bunch of scalars. Not that any of these micro-optimizations matter for large arrays.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes the code three times slower so I rather not do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like, I put a MWE on the top so we can talk about actual performance instead of theoretical performance :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks like #28126, particularly since when I tried it I saw a 2x gain. :-\
julia> f_dots2(A, A2, A4, A6, CC, Inn) = A * (A6 * (CC[14].*A6 .+ CC[12].*A4 .+ CC[10].*A2) .+ CC[8]*A6 .+ CC[6].*A4 .+ CC[4].*A2 .+ CC[2].*Inn)
f_dots2 (generic function with 1 method)
julia> @btime f_dots2($A, $A2, $A4, $A6, $CC, $Inn)
254.430 ns (5 allocations: 560 bytes)
2×2 Array{Float64,2}:
0.516647 0.311652
0.186972 0.116884
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I also get that, but not when using the macro... Did I mess up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, just pushed it with explicit dots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the difference appears to be in the n-ary operations. The parser doesn't use special n-ary parsing for .+
but the @.
macro does.
stdlib/LinearAlgebra/src/dense.jl
Outdated
U = A * (A6 * (CC[14]*A6 .+ CC[12]*A4 .+ CC[10]*A2) .+ | ||
CC[8]*A6 .+ CC[6]*A4 .+ CC[4]*A2 .+ CC[2]*Inn) | ||
V = A6 * (CC[13]*A6 .+ CC[11]*A4 .+ CC[9]*A2) .+ | ||
CC[7]*A6 .+ CC[5]*A4 .+ CC[3]*A2 .+ CC[1]*Inn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we really wanted to we could split out the + CC[2]*Inn
operation into a separate loop that just updates the diagonals in-place.
In general, it looks like there are lots of opportunities for avoiding temporary array allocations in this routine if we care, both by re-using arrays and by using things like rmul!
.
It will only matter for tiny arrays like the one in #29113, however, since for larger arrays the Θ(n³) operations will dominate.
b48ebb2
to
603f4b6
Compare
Do we have this as a nanosoldier benchmark and didn't notice? If not, add it? |
Benchmark label is on there so anyone feel free to add it. |
(cherry picked from commit 6acaa10)
(cherry picked from commit 6acaa10)
Fixes JuliaLang/LinearAlgebra.jl#565
Before:
After:
Edit: The text below is no longer relevant:
Should the broadcasting be able to do better here? cc @mbauman Here is a MWE
Update since it is being discussed: using full fusion: