-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use with ForwardDiff #93
Comments
The README also has an example with complex numbers. Mind posting all the code for a reproducible example? Allocations suggests there may be a type instability somewhere. EDIT: Currently, the best thing to do would be to have separate But it'd probably actually be easiest to take the By I've also been trying to get back to working on defining a reverse-pass for LoopVectorization's loops, but I've been distracted for the past month. |
OK, I think I see what you mean about keeping things as struct-of-SVecs, and a few definitions like this appear to be enough: function Base.:+(x::Dual{Z,T,D}, sv::SVec{N,S}) where {Z,T,D,N,S}
y = x.value + sv
ps = ntuple(d -> partials(x,d) + zero(sv), Val(D))
Dual{Z,typeof(y),D}(y, Partials{D,typeof(y)}(ps))
end
The example in which I was timing things is now here, not so minimal! I start to think that the problem is things like |
Looking at those benchmarks, there are a lot of allocations when using
Is that because of: function Base.:+(x::Dual{Z,T,D}, sv::SVec{N,S}) where {Z,T,D,N,S}
y = x.value + sv
ps = ntuple(d -> partials(x,d) + zero(sv), Val(D))
Dual{Z,typeof(y),D}(y, Partials{D,typeof(y)}(ps))
end ? |
Re Weirdly, after fiddling with seemingly unrelated things since then, this scope issue has gone away for code made by the macro. But persists in the julia> @btime Tracker.gradient(sum∘unfill∘f_fwd_avx, $A, $B, $C); # newer macro's code
2.448 ms (89 allocations: 2.47 MiB)
julia> @btime Tracker.gradient(sum∘unfill∘create109, $A, $B, $C); # won't run without using ForwardDiff: partials
2.754 ms (180065 allocations: 18.03 MiB) In |
Yes, that's definitely a problem with julia> @macroexpand @avx for i = 📏i
for j = 📏j
ℛℰ𝒮 = (A[i] + 𝜀A) * log((B[i, j] + 𝜀B) / (C[j] + 𝜀C))
𝛥B[i, j] = 𝛥B[i, j] + (ForwardDiff).partials(ℛℰ𝒮, 1) * 𝛥ℛℰ𝒮[i]
𝛥C[j] = 𝛥C[j] + (ForwardDiff).partials(ℛℰ𝒮, 2) * 𝛥ℛℰ𝒮[i]
𝛥A[i] = 𝛥A[i] + (ForwardDiff).partials(ℛℰ𝒮, 3) * 𝛥ℛℰ𝒮[i]
end
end
quote
var"##loopi#298" = LoopVectorization.maybestaticrange(📏i)
var"##i_loop_lower_bound#299" = LoopVectorization.maybestaticfirst(var"##loopi#298")
var"##i_loop_upper_bound#300" = LoopVectorization.maybestaticlast(var"##loopi#298")
var"##loopj#301" = LoopVectorization.maybestaticrange(📏j)
var"##j_loop_lower_bound#302" = LoopVectorization.maybestaticfirst(var"##loopj#301")
var"##j_loop_upper_bound#303" = LoopVectorization.maybestaticlast(var"##loopj#301")
var"##vptr##_A" = LoopVectorization.stridedpointer(A)
var"##vptr##_B" = LoopVectorization.stridedpointer(B)
var"##vptr##_C" = LoopVectorization.stridedpointer(C)
var"##vptr##_𝛥B" = LoopVectorization.stridedpointer(𝛥B)
var"##vptr##_𝛥ℛℰ𝒮" = LoopVectorization.stridedpointer(𝛥ℛℰ𝒮)
var"##vptr##_𝛥C" = LoopVectorization.stridedpointer(𝛥C)
var"##vptr##_𝛥A" = LoopVectorization.stridedpointer(𝛥A)
begin
$(Expr(:gc_preserve, :(LoopVectorization._avx_!(Val{(0, 0)}(), Tuple{:LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x02), :LoopVectorization, :+, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000102, LoopVectorization.compute, 0x00, 0x03), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x04), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x05), :LoopVectorization, :+, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000405, LoopVectorization.compute, 0x00, 0x06), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x07), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x08), :LoopVectorization, :+, LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000708, LoopVectorization.compute, 0x00, 0x09), :LoopVectorization, :/, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000609, LoopVectorization.compute, 0x00, 0x0a), :LoopVectorization, :log, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x000000000000000a, LoopVectorization.compute, 0x00, 0x0b), :LoopVectorization, :*, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x000000000000030b, LoopVectorization.compute, 0x00, 0x0c), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x0d), :Main, :partials, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000c0d, LoopVectorization.compute, 0x00, 0x0e), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x04, 0x0f), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x05, 0x10), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x00000000000e0f10, LoopVectorization.compute, 0x00, 0x10), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000011, LoopVectorization.memstore, 0x05, 0x11), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x12), :Main, :partials, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000c13, LoopVectorization.compute, 0x00, 0x13), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x07, 0x14), :numericconstant, Symbol("##reductzero#321"), LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x15), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000001, 0x0000000000000000, 0x0000000000140f16, LoopVectorization.compute, 0x00, 0x15), :LoopVectorization, :reduced_add, LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000001, 0x0000000000000000, 0x0000000000001715, LoopVectorization.compute, 0x00, 0x14), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000018, LoopVectorization.memstore, 0x07, 0x16), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x17), :Main, :partials, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000c1a, LoopVectorization.compute, 0x00, 0x18), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x09, 0x19), :numericconstant, Symbol("##reductzero#327"), LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000002, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x1a), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000002, 0x0000000000000000, 0x00000000001b0f1d, LoopVectorization.compute, 0x00, 0x1a), :LoopVectorization, :reduced_add, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000002, 0x0000000000000000, 0x0000000000001e1c, LoopVectorization.compute, 0x00, 0x19), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x000000000000001f, LoopVectorization.memstore, 0x09, 0x1b)}, Tuple{LoopVectorization.ArrayRefStruct{:A,Symbol("##vptr##_A")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:B,Symbol("##vptr##_B")}(0x0000000000000101, 0x0000000000000102, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:C,Symbol("##vptr##_C")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:𝛥ℛℰ𝒮,Symbol("##vptr##_𝛥ℛℰ𝒮")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:𝛥B,Symbol("##vptr##_𝛥B")}(0x0000000000000101, 0x0000000000000102, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:𝛥B,Symbol("##vptr##_𝛥B")}(0x0000000000000101, 0x0000000000000102, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:𝛥C,Symbol("##vptr##_𝛥C")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:𝛥C,Symbol("##vptr##_𝛥C")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:𝛥A,Symbol("##vptr##_𝛥A")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:𝛥A,Symbol("##vptr##_𝛥A")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000)}, Tuple{0, Tuple{}, Tuple{2, 5, 8}, Tuple{(19, 2), (26, 3)}, Tuple{}, Tuple{(22, LoopVectorization.IntOrFloat), (29, LoopVectorization.IntOrFloat)}, Tuple{(13, LoopVectorization.HardInt)}}, Tuple{:i, :j}, (var"##i_loop_lower_bound#299":var"##i_loop_upper_bound#300", var"##j_loop_lower_bound#302":var"##j_loop_upper_bound#303"), var"##vptr##_A", var"##vptr##_B", var"##vptr##_C", var"##vptr##_𝛥ℛℰ𝒮", var"##vptr##_𝛥B", var"##vptr##_𝛥B", var"##vptr##_𝛥C", var"##vptr##_𝛥C", var"##vptr##_𝛥A", var"##vptr##_𝛥A", 𝜀A, 𝜀B, 𝜀C, partials, partials, partials)), :A, :B, :C, :𝛥B, :𝛥ℛℰ𝒮, :𝛥C, :𝛥A))
end
end Notice that it's passing Also, I can't run your examples: julia> f_fwd(A,B,C) = @tullio grad=Dual avx=false s[i] := A[i] * log(B[i,j] / C[j]);
ERROR: LoadError: can't understand LHS, expected A[i,j,k], got grad
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] _tullio(::Expr, ::Expr, ::Expr; multi::Bool, mod::Module) at /home/chriselrod/.julia/packages/Tullio/aaSZU/src/Tullio.jl:83
[3] @tullio(::LineNumberNode, ::Module, ::Vararg{Any,N} where N) at /home/chriselrod/.julia/packages/Tullio/aaSZU/src/Tullio.jl:53
[4] eval(::Module, ::Any) at ./boot.jl:331
[5] eval_user_input(::Any, ::REPL.REPLBackend) at /home/chriselrod/Documents/languages/julia/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:132
[6] run_backend(::REPL.REPLBackend) at /home/chriselrod/.julia/packages/Revise/2K7IK/src/Revise.jl:1070
[7] top-level scope at none:1
in expression starting at REPL[37]:1 |
OK, great, had not looked hard at the expanded version. The alternative access like this gives immediate errors:
I think the error you get might be from master branch, which is last year's attempt, the right branch is |
Ah, checking out
Not yet. On that note, what are Could you also come up with a more minimal example of what you're trying to do? |
There are just three dual numbers as input (the same ones at every step), never an array of them. And ordinary numbers I agree though that my example is not so minimal, sorry! I will have a go at boiling it down. |
OK here's an example small enough just to paste in here: using LoopVectorization, ForwardDiff, BenchmarkTools
function dualinv!(B,A)
@assert size(A) == size(B)
T = eltype(B)
dx = ForwardDiff.Dual(zero(T), (one(T), zero(T)))
for i in eachindex(A)
res = log(A[i] + dx)
B[i] = ForwardDiff.partials(res,1)
end
B
end
dualinv!(rand(5), 1:5) == inv.(1:5)
function dualinv_avx!(B,A)
@assert size(A) == size(B)
T = eltype(B)
dx = ForwardDiff.Dual(zero(T), (one(T), zero(T)))
@avx for i in eachindex(A)
res = log(A[i] + dx)
B[i] = ForwardDiff.partials(res,1)
end
B
end
dualinv_avx!(rand(5), collect(1:5.0)) # UndefVarError: partials not defined
using ForwardDiff: partials
dualinv_avx!(rand(5), collect(1:5.0)) # MethodError: no method matching +(::VectorizationBase.SVec{4,Float64}, ::ForwardDiff.Dual{Nothing,Float64,2})
using Tullio # branch coronavirusrewrite, just for Dual Svec definitions
dualinv_avx!(rand(5), collect(1:5.0)) == inv.(1:5)
A = rand(10^6); B = similar(A);
@btime dualinv!($B, $A); # 7.335 ms (0 allocations: 0 bytes)
@btime dualinv_avx!($B, $A); # 12.363 ms (750003 allocations: 49.59 MiB) |
If you find basic This is for two reasons:
In the case of julia> using Tullio, SIMDPirates, BenchmarkTools
julia> xi = SVec(ntuple(Val(8)) do i Core.VecElement(i) end)
SVec{8,Int64}<1, 2, 3, 4, 5, 6, 7, 8>
julia> @code_native debuginfo=:none SIMDPirates.vinv(xi)
.text
movq %rdi, %rax
vcvtqq2pd (%rsi), %zmm0
movabsq $.rodata.cst8, %rcx
vbroadcastsd (%rcx), %zmm1
vdivpd %zmm0, %zmm1, %zmm0
vmovapd %zmm0, (%rdi)
vzeroupper
retq
nopl (%rax)
julia> @code_native debuginfo=:none inv(xi)
.text
pushq %rbp
movq %rsp, %rbp
pushq %rbx
andq $-64, %rsp
subq $192, %rsp
vxorps %xmm0, %xmm0, %xmm0
vmovaps %xmm0, 32(%rsp)
movq $0, 48(%rsp)
movq %fs:0, %rbx
movq $4, 32(%rsp)
movq -15712(%rbx), %rax
movq %rax, 40(%rsp)
leaq 32(%rsp), %rax
movq %rax, -15712(%rbx)
vmovups (%rdi), %zmm0
vmovaps %zmm0, 64(%rsp)
movabsq $ntuple, %rax
leaq 64(%rsp), %rdi
movl $8, %esi
vzeroupper
callq *%rax
movq %rax, 48(%rsp)
movq %rax, 56(%rsp)
movabsq $jl_apply_generic, %rax
movabsq $139793777923704, %rdi # imm = 0x7F2446799678
leaq 56(%rsp), %rsi
movl $1, %edx
callq *%rax
movq 40(%rsp), %rcx
movq %rcx, -15712(%rbx)
leaq -8(%rbp), %rsp
popq %rbx
popq %rbp
retq
nopw %cs:(%rax,%rax)
nopl (%rax,%rax)
julia> @btime SIMDPirates.vinv($(Ref(xi))[])
3.740 ns (0 allocations: 0 bytes)
SVec{8,Float64}<1.0, 0.5, 0.3333333333333333, 0.25, 0.2, 0.16666666666666666, 0.14285714285714285, 0.125>
julia> @btime inv($(Ref(xi))[])
27.096 ns (2 allocations: 160 bytes)
SVec{8,Float64}<1.0, 0.5, 0.3333333333333333, 0.25, 0.2, 0.16666666666666666, 0.14285714285714285, 0.125> |
OK, it would be great to fix things upstream. But do you just mean things in Base, or would SLEEFPirates be willing to depend on ForwardDiff to have these dual types? Some of the same issues seem to arise for complex numbers, too. Here's a version of my example above, which similarly tries to use them at an intermediate step: function cinv!(B,A)
dx = im/100_000
for i in eachindex(A)
res = log(A[i] + dx)
B[i] = 100_000 * imag(res)
end
B
end
cinv!(rand(5), collect(1:5.0)) ≈ inv.(1:5)
function cinv_avx!(B,A)
dx = im/100_000
@avx for i in eachindex(A)
res = log(A[i] + dx)
B[i] = 100_000 * imag(res)
end
B
end
cinv_avx!(rand(5), collect(1:5.0)) ≈ inv.(1:5) # MethodError: no method matching +(::VectorizationBase.SVec{4,Float64}, ::Complex{Float64}) And, here's what happens with julia> xi = SVec(ntuple(Val(4)) do i Core.VecElement(i) end)
SVec{4,Int64}<1, 2, 3, 4>
julia> @btime SIMDPirates.vinv($(Ref(xi))[])
2.200 ns (0 allocations: 0 bytes)
SVec{4,Float64}<1.0, 0.5, 0.3333333333333333, 0.25>
julia> @btime Tullio.inv($(Ref(xi))[])
2.200 ns (0 allocations: 0 bytes)
SVec{4,Float64}<1.0, 0.5, 0.3333333333333333, 0.25>
julia> xi = SVec(ntuple(Val(8)) do i Core.VecElement(i) end)
SVec{8,Int64}<1, 2, 3, 4, 5, 6, 7, 8>
julia> @btime Tullio.inv($(Ref(xi))[])
28.119 ns (2 allocations: 160 bytes)
SVec{8,Float64}<1.0, 0.5, 0.3333333333333333, 0.25, 0.2, 0.16666666666666666, 0.14285714285714285, 0.125>
julia> @btime SIMDPirates.vinv($(Ref(xi))[])
Segmentation fault: 11 |
On the local versions (I plan to push by the end of the day): julia> using LoopVectorization, ForwardDiff, BenchmarkTools
julia> using ForwardDiff: Dual, Partials
julia> using LoopVectorization: SVec
julia> ForwardDiff.can_dual(::Type{<:SVec}) = true
julia> @inline function Base.:+(x::Dual{Z,T,D}, sv::SVec{N,S}) where {Z,T<:Number,D,N,S}
y = x.value + sv
ps = ntuple(d -> x.partials.values[d] + zero(sv), Val(D)) # avoiding partials(x.d) didn't help
TS = SVec{N,promote_type(T,S)}
Dual{Z,TS,D}(y, Partials{D,TS}(ps))
end
julia> @inline function Base.:+(sv::SVec{N,S}, x::Dual{Z,T,D}) where {Z,T<:Number,D,N,S}
y = x.value + sv
ps = ntuple(d -> x.partials.values[d] + zero(sv), Val(D))
TS = SVec{N,promote_type(T,S)}
Dual{Z,TS,D}(y, Partials{D,TS}(ps))
end
julia> @inline function Base.:*(x::Dual{Z,SVec{N,T},D}, sv::SVec{N,S}) where {Z,T,D,N,S}
y = x.value * sv
ps = ntuple(d -> x.partials.values[d] * sv, Val(D))
TS = SVec{N,promote_type(T,S)}
Dual{Z,typeof(y),D}(y, Partials{D,typeof(y)}(ps))
end
julia> @inline function Base.:*(sv::SVec{N,S}, x::Dual{Z,SVec{N,T},D}) where {Z,T,D,N,S}
y = sv * x.value
ps = ntuple(d -> sv * x.partials.values[d], Val(D))
TS = SVec{N,promote_type(T,S)}
Dual{Z,TS,D}(y, Partials{D,TS}(ps))
end
julia> @inline function Base.:*(p::Partials{D,SVec{N,T}}, sv::SVec{N,S}) where {T,D,N,S}
TS = SVec{N,promote_type(T,S)}
Partials{D,TS}(ntuple(d -> p.values[d] * sv, Val(D)))
end
julia> @inline function Base.:*(sv::SVec{N,S}, p::Partials{D,SVec{N,T}}) where {T,D,N,S}
TS = SVec{N,promote_type(T,S)}
Partials{D,TS}(ntuple(d -> sv * p.values[d], Val(D)))
end
julia> function dualinv!(B,A)
@assert size(A) == size(B)
T = eltype(B)
dx = ForwardDiff.Dual(zero(T), (one(T), zero(T)))
for i in eachindex(A)
res = log(A[i] + dx)
B[i] = ForwardDiff.partials(res,1)
end
B
end
dualinv! (generic function with 1 method)
julia> dualinv!(rand(5), 1:5) == inv.(1:5)
true
julia> function dualinv_avx!(B,A)
@assert size(A) == size(B)
T = eltype(B)
dx = ForwardDiff.Dual(zero(T), (one(T), zero(T)))
@avx for i in eachindex(A)
res = log(A[i] + dx)
B[i] = ForwardDiff.partials(res,1)
end
B
end
dualinv_avx! (generic function with 1 method)
julia> dualinv_avx!(rand(5), collect(1:5.0))
5-element Array{Float64,1}:
1.0
0.5
0.3333333333333333
0.25
0.2
julia> dualinv_avx!(rand(5), 1:5.0)
5-element Array{Float64,1}:
1.0
0.5
0.3333333333333333
0.25
0.2
julia> A = rand(10^6); B = similar(A);
julia> @btime dualinv!($B, $A); # 7.335 ms (0 allocations: 0 bytes)
5.684 ms (0 allocations: 0 bytes)
julia> @btime dualinv_avx!($B, $A);
662.525 μs (0 allocations: 0 bytes) Note that two of your |
Interesting. I see the same thing you do: julia> using Tullio, SIMDPirates, BenchmarkTools
julia> xi4 = SVec(ntuple(Val(4)) do i Core.VecElement(i) end)
SVec{4,Int64}<1, 2, 3, 4>
julia> inv(xi4)
SVec{4,Float64}<1.0, 0.5, 0.3333333333333333, 0.25>
julia> @which inv(xi4)
inv(sv::SVec{N,var"#s24"} where var"#s24"<:Integer) where N in Tullio at /home/chriselrod/.julia/packages/Tullio/kCDFo/src/forward.jl:97
julia> @btime inv($(Ref(xi4))[])
1.876 ns (0 allocations: 0 bytes)
SVec{4,Float64}<1.0, 0.5, 0.3333333333333333, 0.25>
julia> @btime SIMDPirates.vinv($(Ref(xi4))[])
1.879 ns (0 allocations: 0 bytes)
SVec{4,Float64}<1.0, 0.5, 0.3333333333333333, 0.25>
julia> versioninfo()
Julia Version 1.5.0-DEV.645
Commit 0e23ff2581* (2020-04-15 17:02 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, skylake) Maybe there's a limit causing the compiler to give up somewhere between 4 and 8.
That should be fixed on Julia master. On older versions of Julia, it only occurs when an So something like using SIMDPirates
xi = SVec(ntuple(Val(8)) do i Core.VecElement(i) end)
@btime extract_data(SIMDPirates.vinv($(Ref(xi))[])) Might work. |
Ah indeed, that crash was Julia 1.3, but it's fixed on 1.5. And thanks, deleting that unused |
It's insidious! It's one of the first things I look for when I run into surprising amounts of allocations.
Nice. I was expecting that, because Some other methods like
Hmm. ForwardDiff is a heavy dependency. julia> @time using LoopVectorization
0.396693 seconds (912.65 k allocations: 63.532 MiB)
julia> @time using ForwardDiff
2.400366 seconds (10.07 M allocations: 489.370 MiB, 8.63% gc time) Maybe we can create a Otherwise, you could create a library, and I can add to VectorizationBase, SIMDPirates, and SLEEFPirates' READMEs that it is the official/blessed library for ForwardDiff support on Maybe to LoopVectorization's documentation as well, given that there's an example (like the complex number one) demonstrating how to use it. |
Looks like JuliaLang/julia#29393 is the issue about dangling type parameters. Will fiddle a bit more first, but PaddedMatricesForwardDiff doesn't sound like a bad option to store these methods, since it already depends on all the relevant things. |
Okay. You're welcome to make PRs there yourself. |
I was trying to use this with ForwardDiff, but not an array of structs like #19, just using dual numbers within a loop and then extracting ordinary numbers. Something like this:
I got this to run with
@avx
on the loop, by filling in whatever methods were missing, such as:However the result is slower than without, and has many more allocations.
I wonder whether this is expected to work, and whether adding methods like this is right thing to do? I can tidy up an example if there isn't an obvious fatal flaw here.
The text was updated successfully, but these errors were encountered: