-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for complex numbers? #19
Comments
No, not yet. For now, I'd follow the dealing-with-structs approach from the README. Note that |
+1 for complex number support. It's also an opportunity to be faster than C compiled with gcc/clang which does a bad job of vectorizing complex arithmetic. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79102 , https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79336, https://bugs.llvm.org/show_bug.cgi?id=31800 , https://bugs.llvm.org/show_bug.cgi?id=31677 |
I managed to get it working function test_avx!(xre::Vector{T}, xim::Vector{T}, βre::T, βim::T, Are::Matrix{T}, Aim::Matrix{T}, k::Int) where {T<:AbstractFloat}
@avx for i ∈ 1:length(xre)
xre[i] += βre*Are[i,k] + βim*Aim[i,k]
xim[i] += βre*Aim[i,k] - βim*Are[i,k]
end
end
function test_avx!(x::StructVector{Complex{T}}, β::Complex{T}, A::StructArray{Complex{T},2}, k::Int) where {T<:AbstractFloat}
test_avx!(x.re, x.im, β.re, β.im, A.re , A.im, k)
end and the performance is actually slightly faster than my earlier implementation which uses Investigating the assembly I found that
Is this intentional? |
Speaking of which I find especially interesting is that the native julia (version 1.3) implementation function test!(A, x, k, beta)
@simd for n=1:size(A,2)
@inbounds x[n] += beta*conj(A[k,n])
end
end produces quite different asambly depending on the input types. For
whereas it completely screws up with
|
The complex support that will be added will be worse than using StructArrays, so you won't do better than the Did you try Also, to be pedantic, the vunpcklpd %xmm4, %xmm3, %xmm5 # xmm5 = xmm3[0],xmm4[0]
vmulpd %xmm0, %xmm5, %xmm5
vunpcklpd %xmm3, %xmm4, %xmm3 # xmm3 = xmm4[0],xmm3[0]
vmulpd %xmm3, %xmm1, %xmm3
vaddsubpd %xmm3, %xmm5, %xmm3
vaddpd (%rax,%rdx), %xmm3, %xmm3
vmovupd %xmm3, (%rax,%rdx) The suffix on all of these instructions is However, the
It is intentional. I normally introspect via print debugging with revise. For now, you can at least do this to see what it decides (although not why): using LoopVectorization
testq = :(for i ∈ 1:length(xre)
xre[i] += βre*Are[i,k] + βim*Aim[i,k]
xim[i] += βre*Aim[i,k] - βim*Are[i,k]
end)
lstest = LoopVectorization.LoopSet(testq);
LoopVectorization.choose_order(lstest)
# ([:i], :i, 1, -1) The four returned values mean:
As for why it won't unroll, I have it tuned to be fairly conservative when there aren't any loop-carried dependencies.* *This is a heuristic that could use tuning. Currently, if there are no loop carried dependencies, it will check if there is only a single loop (as in your case) and return 1 if so. min(4, round(Int, (compute_rt + load_rt + 1) / compute_rt)) Where There are certainly much smarter and more principled ways this can be done. |
Thanks @lesshaste , I'll check out those bug reports. |
Good point I tried out using only the structs function test_struct!(xre::Vector{T}, xim::Vector{T}, βre::T, βim::T, Are::Matrix{T}, Aim::Matrix{T}, k::Int) where {T<:AbstractFloat}
for i ∈ 1:length(xre)
xre[i] += βre*Are[i,k] + βim*Aim[i,k]
xim[i] += βre*Aim[i,k] - βim*Are[i,k]
end
end
function test_struct!(x::StructVector{Complex{T}}, β::Complex{T}, A::StructArray{Complex{T},2}, k::Int) where {T<:AbstractFloat}
test_avx!(x.re, x.im, β.re, β.im, A.re , A.im, k)
end and in fact LLVM does produce exactly the same low level code so
True it did vectorize, however quite suboptimal.
Thank you for this insight. I was just asking since I found that my earlier implementation which uses
vs
Would it be possible to provide a method to somehow override the heuristic for manual fine tuning? |
Disappointing, I really expected this to be good enough: function test!(x, a, beta)
@inbounds @simd ivdep for n = 1:size(a,2)
x[n] += beta * conj(a[n])
end
end
N = 1024;
T = ComplexF32
x = rand(T, N);
a = rand(T,N);
beta = rand(T);
@code_native debuginfo=:none test!(x, a, beta) # not vectorized
using StructArrays
xsoa = StructArray(x);
asoa = StructArray(a);
@code_native debuginfo=:none test!(xsoa, asoa, beta) # also not vectorized =( You shouldn't have to do things manually.
Interesting. Sounds like I'll have to change the heuristic. Tiling shouldn't be overridden for code other people run, because the optimal values are platform dependent. |
Would have been nice.
It was actually a post on julia dicourse and the openblas cdot kernel which had me try unrolling the loop. This may be a stupid question, but can the out-of-order execution work across such a loop?
Will it not be blocked, since in each iteration values are loaded into the
Would be perfect for testing. Final usage of the package is a different story. There you do not want to fiddle around with these settings. |
I added the ability to manually specify the unroll factor in v0.3.8, via specifying
Seems as though it is not enough. There is also still a dependency chain in the loop counter. The difference isn't close to as extreme as when we have real loop carried dependencies: julia> using BenchmarkTools, LoopVectorization
julia> function selfdotu1(x)
s = zero(eltype(x))
@avx unroll=1 for i ∈ eachindex(x)
s += x[i]*x[i]
end
s
end
selfdotu1 (generic function with 1 method)
julia> function selfdotu2(x)
s = zero(eltype(x))
@avx unroll=2 for i ∈ eachindex(x)
s += x[i]*x[i]
end
s
end
selfdotu2 (generic function with 1 method)
julia> function selfdotu4(x)
s = zero(eltype(x))
@avx unroll=4 for i ∈ eachindex(x)
s += x[i]*x[i]
end
s
end
selfdotu4 (generic function with 1 method)
julia> function selfdotu8(x)
s = zero(eltype(x))
@avx unroll=8 for i ∈ eachindex(x)
s += x[i]*x[i]
end
s
end
selfdotu8 (generic function with 1 method)
julia> x = rand(1024);
julia> @btime selfdotu1($x)
138.725 ns (0 allocations: 0 bytes)
341.62521511329
julia> @btime selfdotu2($x)
81.118 ns (0 allocations: 0 bytes)
341.6252151132899
julia> @btime selfdotu4($x)
57.550 ns (0 allocations: 0 bytes)
341.6252151132899
julia> @btime selfdotu8($x)
45.087 ns (0 allocations: 0 bytes)
341.6252151132899 but even 10% is nothing to scoff at, so I changed the heuristic with something that makes your example unroll by 4. |
You are fast. Thank you.
Overall if I compare the native julia implementation working on the complex arrays with the one working on structs boosted by |
Great to see this comparison and the unrolling. To clarify: is the factor of 12 achieved with something like i.e. function test_avx!(xre::Vector{T}, xim::Vector{T}, βre::T, βim::T, Are::Matrix{T}, Aim::Matrix{T}, k::Int) where {T<:AbstractFloat}
@avx unroll=8 for i ∈ 1:length(xre)
xre[i] += βre*Are[i,k] + βim*Aim[i,k]
xim[i] += βre*Aim[i,k] - βim*Are[i,k]
end
end
function test_avx!(x::StructVector{Complex{T}}, β::Complex{T}, A::StructArray{Complex{T},2}, k::Int) where {T<:AbstractFloat}
test_avx!(x.re, x.im, β.re, β.im, A.re , A.im, k)
end Does it matter if arrays are three-dimensional? |
It should work for multidimensional arrays. |
Hi All, Note that there another possibility of using complex numbers with LoopVectorization:
which could be made more convenient by using
on it. For example, the real part is then y[1, ...]. Felix |
You are right. This is also how I solved my problem in the end. Using function test!(x::Vector, y::Vector, beta)
for n=1:length(x)
x[n] += conj(beta)*y[n]
end
end Initially however, my question was aiming towards a solution, which does not require a hand crafted kernel. |
Note that as of Julia 1.6, you can use You can see a few examples here:
I am very slowly rewriting LoopVectorization. |
Does
LoopVectorization
support complex vectors? E.g.gives me the following error
The text was updated successfully, but these errors were encountered: