Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for complex numbers? #19

Open
hofmannmartin opened this issue Jan 13, 2020 · 16 comments
Open

Support for complex numbers? #19

hofmannmartin opened this issue Jan 13, 2020 · 16 comments
Labels
enhancement New feature or request

Comments

@hofmannmartin
Copy link

Does LoopVectorization support complex vectors? E.g.

using LoopVectorization

function test_avx!(x::Vector, y::Vector, beta)
  @avx for n=1:length(x)
    x[n] += conj(beta)*y[n]
  end
end

# set up
T = ComplexF32
N = 1024

x = zeros(T,N)
y = rand(T,N)
β = rand(T)
# end set up

test_avm!(x,y,β)

gives me the following error

KeyError: key Complex{Float32} not found

Stacktrace:
 [1] getindex at ./dict.jl:477 [inlined]
 [2] llvmtype(::Type) at /home/moeddel/.julia/packages/VectorizationBase/hBLD0/src/vectorizable.jl:21
 [3] #s24#123 at /home/moeddel/.julia/packages/SIMDPirates/aKvPC/src/memory.jl:94 [inlined]
 [4] #s24#123(::Any, ::Any, ::Any, ::Any, ::Any, ::Type, ::Any, ::Type, ::Any) at ./none:0
 [5] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any,N} where N) at ./boot.jl:524
 [6] vload at /home/moeddel/.julia/packages/SIMDPirates/aKvPC/src/memory.jl:218 [inlined]
 [7] macro expansion at ./gcutils.jl:91 [inlined]
 [8] test_avm!(::Array{Complex{Float32},1}, ::Array{Complex{Float32},1}, ::Complex{Float32}) at ./In[26]:4
@chriselrod
Copy link
Member

No, not yet.
Sometime after #17 we could add complex number support.

For now, I'd follow the dealing-with-structs approach from the README.

Note that StructArrays will be faster than the eventual support for arrays of structs.

@lesshaste
Copy link

+1 for complex number support. It's also an opportunity to be faster than C compiled with gcc/clang which does a bad job of vectorizing complex arithmetic. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79102 , https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79336, https://bugs.llvm.org/show_bug.cgi?id=31800 , https://bugs.llvm.org/show_bug.cgi?id=31677

@hofmannmartin
Copy link
Author

I managed to get it working

function test_avx!(xre::Vector{T}, xim::Vector{T}, βre::T, βim::T, Are::Matrix{T}, Aim::Matrix{T}, k::Int) where {T<:AbstractFloat}
    @avx for i  1:length(xre)
        xre[i] += βre*Are[i,k] + βim*Aim[i,k]
        xim[i] += βre*Aim[i,k] - βim*Are[i,k]
    end
end

function test_avx!(x::StructVector{Complex{T}}, β::Complex{T}, A::StructArray{Complex{T},2}, k::Int) where {T<:AbstractFloat}
    test_avx!(x.re, x.im, β.re, β.im, A.re , A.im, k)
end

and the performance is actually slightly faster than my earlier implementation which uses VectorizationBase and SIMDPirates directly to handle complex arrays. This is not surprising since the complex multiplication requires shuffling of the Numbers in the registers in this case.

Investigating the assembly I found that @avx does not seem to unroll the loop.

L80:
	vmovups	(%rbx,%rsi,4), %ymm4
	vmovups	(%rax,%rsi,4), %ymm5
	vmulps	%ymm5, %ymm2, %ymm6
	vfmadd231ps	%ymm4, %ymm3, %ymm6
	vaddps	(%r15,%rsi,4), %ymm6, %ymm6
	vmulps	%ymm5, %ymm3, %ymm5
	vfmsub231ps	%ymm4, %ymm2, %ymm5
	vaddps	(%rdx,%rsi,4), %ymm5, %ymm4
	vmovups	%ymm6, (%r15,%rsi,4)
	vmovups	%ymm4, (%rdx,%rsi,4)
	addq	$8, %rsi
	cmpq	%rdi, %rsi
	jl	L80

Is this intentional?

@hofmannmartin
Copy link
Author

+1 for complex number support. It's also an opportunity to be faster than C compiled with gcc/clang which does a bad job of vectorizing complex arithmetic.

Speaking of which I find especially interesting is that the native julia (version 1.3) implementation

function test!(A, x, k, beta)
  @simd for n=1:size(A,2)
    @inbounds x[n] += beta*conj(A[k,n])
  end
end

produces quite different asambly depending on the input types. For ComplesF64 it does not vectorize at all

L96:
	vmovsd	-8(%rcx,%rdx), %xmm3    # xmm3 = mem[0],zero
	vmovsd	(%rcx,%rdx), %xmm4      # xmm4 = mem[0],zero
	vxorpd	%xmm2, %xmm4, %xmm4
	vunpcklpd	%xmm4, %xmm3, %xmm5 # xmm5 = xmm3[0],xmm4[0]
	vmulpd	%xmm0, %xmm5, %xmm5
	vunpcklpd	%xmm3, %xmm4, %xmm3 # xmm3 = xmm4[0],xmm3[0]
	vmulpd	%xmm3, %xmm1, %xmm3
	vaddsubpd	%xmm3, %xmm5, %xmm3
	vaddpd	(%rax,%rdx), %xmm3, %xmm3
	vmovupd	%xmm3, (%rax,%rdx)
	addq	$1, %rsi
	addq	$16, %rdx
	cmpq	%rdi, %rsi
	jb	L96

whereas it completely screws up with ComplexF32

L240:
	vpsllq	$3, %ymm5, %ymm9
	vmovq	%rax, %xmm2
	vpbroadcastq	%xmm2, %ymm2
	vpaddq	%ymm9, %ymm2, %ymm9
	vpsllq	$3, %ymm4, %ymm11
	vpaddq	%ymm11, %ymm2, %ymm2
	vpcmpeqd	%xmm3, %xmm3, %xmm3
	vgatherqps	%xmm3, (,%ymm2), %xmm1
	vpcmpeqd	%xmm3, %xmm3, %xmm3
	vgatherqps	%xmm3, (,%ymm9), %xmm11
	vpcmpeqd	%xmm3, %xmm3, %xmm3
	vmovdqu	96(%rsp), %ymm0
	vpaddq	%ymm4, %ymm0, %ymm14
	vpaddq	%ymm5, %ymm0, %ymm15
	vpaddq	%ymm13, %ymm9, %ymm9
	vpsllq	$3, %ymm15, %ymm15
	vmovq	%r10, %xmm10
	vpbroadcastq	%xmm10, %ymm10
	vpaddq	%ymm15, %ymm10, %ymm15
	vpsllq	$3, %ymm14, %ymm14
	vpaddq	%ymm14, %ymm10, %ymm10
	vpaddq	%ymm13, %ymm2, %ymm2
	vpcmpeqd	%xmm6, %xmm6, %xmm6
	vgatherqps	%xmm6, (,%ymm10), %xmm7
	vpcmpeqd	%xmm6, %xmm6, %xmm6
	vgatherqps	%xmm6, (,%ymm15), %xmm14
	vpaddq	%ymm13, %ymm10, %ymm6
	vgatherqps	%xmm3, (,%ymm2), %xmm0
	vpcmpeqd	%xmm2, %xmm2, %xmm2
	vgatherqps	%xmm2, (,%ymm6), %xmm3
	vpcmpeqd	%xmm2, %xmm2, %xmm2
	vgatherqps	%xmm2, (,%ymm9), %xmm6
	vinsertf128	$1, %xmm1, %ymm11, %ymm2
	vinsertf128	$1, %xmm7, %ymm14, %ymm7
	vpaddq	%ymm13, %ymm15, %ymm10
	vpcmpeqd	%xmm1, %xmm1, %xmm1
	vinsertf128	$1, %xmm0, %ymm6, %ymm0
	vgatherqps	%xmm1, (,%ymm10), %xmm6
	vinsertf128	$1, %xmm3, %ymm6, %ymm1
	vmovups	64(%rsp), %ymm10
	vmulps	%ymm10, %ymm7, %ymm3
	vmovups	32(%rsp), %ymm11
	vmulps	%ymm1, %ymm11, %ymm6
	vaddps	%ymm6, %ymm3, %ymm3
	vmulps	%ymm1, %ymm10, %ymm1
	vmulps	%ymm11, %ymm7, %ymm6
	vaddps	%ymm3, %ymm2, %ymm2
	vsubps	%ymm1, %ymm6, %ymm1
	vaddps	%ymm1, %ymm0, %ymm0
	vmovq	%xmm9, %rdx
	vunpckhps	%xmm0, %xmm2, %xmm1 # xmm1 = xmm2[2],xmm0[2],xmm2[3],xmm0[3]
	vunpcklps	%xmm0, %xmm2, %xmm3 # xmm3 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
	vinsertf128	$1, %xmm1, %ymm3, %ymm1
	vextractf128	$1, %ymm0, %xmm0
	vextractf128	$1, %ymm2, %xmm2
	vunpckhps	%xmm0, %xmm2, %xmm3 # xmm3 = xmm2[2],xmm0[2],xmm2[3],xmm0[3]
	vunpcklps	%xmm0, %xmm2, %xmm0 # xmm0 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
	vmovups	%ymm1, -4(%rdx)
	vinsertf128	$1, %xmm3, %ymm0, %ymm0
	vmovdqu	(%rsp), %ymm1
	vpaddq	%ymm1, %ymm4, %ymm4
	vpaddq	%ymm1, %ymm5, %ymm5
	addq	$-8, %rsi
	vmovups	%ymm0, 28(%rdx)
	jne	L240

@chriselrod
Copy link
Member

chriselrod commented Jan 14, 2020

The complex support that will be added will be worse than using StructArrays, so you won't do better than the test_avx! implementation (barring other improvements in the library).
Depending on architecture, it might end up producing code like the completely screwed up ComplexF32 example.

Did you try test!, using those StructArrays as arguments?
You noted that @avx being faster wasn't surprising because the StructArrays save you from any shuffling. LLVM will also take advantage of that to produce much nicer code. It might do just as well as avx.
(If it does better than @avx, file an issue, and I'll try and find out why / how to improve.)

Also, to be pedantic, the ComplesF64 code with @simd did actually vectorize:

	vunpcklpd	%xmm4, %xmm3, %xmm5 # xmm5 = xmm3[0],xmm4[0]
	vmulpd	%xmm0, %xmm5, %xmm5
	vunpcklpd	%xmm3, %xmm4, %xmm3 # xmm3 = xmm4[0],xmm3[0]
	vmulpd	%xmm3, %xmm1, %xmm3
	vaddsubpd	%xmm3, %xmm5, %xmm3
	vaddpd	(%rax,%rdx), %xmm3, %xmm3
	vmovupd	%xmm3, (%rax,%rdx)

The suffix on all of these instructions is pd. This means "packed double", ie, packed double precision vectors. If it were not vectorized at all, the suffix would be sd for "single double" (eg, it would use vmulsd instead of vmulpd).

However, the xmm registers are only 128 bits, meaning they are operations on packed doubles of 2 Float64. This is roughly twice as good as sd operations, but only half as good as it would have been had the registers been ymm.
LLVM probably decided all the shuffling it would take (like you see in the ComplexF32 code) would not have been worth it, and would've made the code slower.

Investigating the assembly I found that @avx does not seem to unroll the loop.
Is this intentional?

It is intentional.

I normally introspect via print debugging with revise.
Obviously not a good idea for anyone not familiar with the code, and I'd prefer most people to be on tagged releases rather than dev-ing the library.
So I'll add more functions for introspection, and document them when I write documentation.

For now, you can at least do this to see what it decides (although not why):

using LoopVectorization
testq = :(for i  1:length(xre)
        xre[i] += βre*Are[i,k] + βim*Aim[i,k]
        xim[i] += βre*Aim[i,k] - βim*Are[i,k]
    end)
lstest = LoopVectorization.LoopSet(testq);
LoopVectorization.choose_order(lstest)
# ([:i], :i, 1, -1)

The four returned values mean:

  1. The order it'll nest the loops, from outer-most to inner-most.
  2. The loop from which it'll load contiguous SIMD vectors.
  3. The unroll factor: 1, as you noted.
  4. The tiling factor. If -1, it won't.

As for why it won't unroll, I have it tuned to be fairly conservative when there aren't any loop-carried dependencies.*
When a loop iteration depends on the previous one, but things are more or less associative (eg, a dot product), you can unroll to have several independent dependency chains that can be evaluated in parallel.
However, if you don't have any such dependency chains, out-of-order execution shouldn't need extra help in speculatively evaluating multiple loop iterations in parallel.

*This is a heuristic that could use tuning. Currently, if there are no loop carried dependencies, it will check if there is only a single loop (as in your case) and return 1 if so.
Otherwise, it will use:

min(4, round(Int, (compute_rt + load_rt + 1) / compute_rt))

Where compute_rt and load_rt are the estimated summed reciprical throughputs of "compute" and "load" instructions. Optionally greater than 1, because there is a cost to traversing strides.

There are certainly much smarter and more principled ways this can be done.

@chriselrod
Copy link
Member

Thanks @lesshaste , I'll check out those bug reports.

@hofmannmartin
Copy link
Author

Did you try test!, using those StructArrays as arguments?
You noted that @avx being faster wasn't surprising because the StructArrays save you from any shuffling. LLVM will also take advantage of that to produce much nicer code. It might do just as well as avx.
(If it does better than @avx, file an issue, and I'll try and find out why / how to improve.)

Good point I tried out using only the structs

function test_struct!(xre::Vector{T}, xim::Vector{T}, βre::T, βim::T, Are::Matrix{T}, Aim::Matrix{T}, k::Int) where {T<:AbstractFloat}
    for i  1:length(xre)
        xre[i] += βre*Are[i,k] + βim*Aim[i,k]
        xim[i] += βre*Aim[i,k] - βim*Are[i,k]
    end
end

function test_struct!(x::StructVector{Complex{T}}, β::Complex{T}, A::StructArray{Complex{T},2}, k::Int) where {T<:AbstractFloat}
    test_avx!(x.re, x.im, β.re, β.im, A.re , A.im, k)
end

and in fact LLVM does produce exactly the same low level code so @avx is actually not needed here.

Also, to be pedantic, the ComplesF64 code with @simd did actually vectorize:

True it did vectorize, however quite suboptimal.

For now, you can at least do this to see what it decides (although not why):

using LoopVectorization
testq = :(for i  1:length(xre)
        xre[i] += βre*Are[i,k] + βim*Aim[i,k]
        xim[i] += βre*Aim[i,k] - βim*Are[i,k]
    end)
lstest = LoopVectorization.LoopSet(testq);
LoopVectorization.choose_order(lstest)
# ([:i], :i, 1, -1)

The four returned values mean:

1. The order it'll nest the loops, from outer-most to inner-most.

2. The loop from which it'll load contiguous SIMD vectors.

3. The unroll factor: 1, as you noted.

4. The tiling factor. If `-1`, it won't.

As for why it won't unroll, I have it tuned to be fairly conservative when there aren't any loop-carried dependencies.*
When a loop iteration depends on the previous one, but things are more or less associative (eg, a dot product), you can unroll to have several independent dependency chains that can be evaluated in parallel.
However, if you don't have any such dependency chains, out-of-order execution shouldn't need extra help in speculatively evaluating multiple loop iterations in parallel.

*This is a heuristic that could use tuning. Currently, if there are no loop carried dependencies, it will check if there is only a single loop (as in your case) and return 1 if so.
Otherwise, it will use:

min(4, round(Int, (compute_rt + load_rt + 1) / compute_rt))

Where compute_rt and load_rt are the estimated summed reciprical throughputs of "compute" and "load" instructions. Optionally greater than 1, because there is a cost to traversing strides.

There are certainly much smarter and more principled ways this can be done.

Thank you for this insight. I was just asking since I found that my earlier implementation which uses VectorizationBase and SIMDPirates on Complex Arrays did profit from loop unroling (over 20% perfromance increase).

L304:
	vmovups	(%rcx,%rsi), %ymm2
	vmovups	(%rdx,%rsi), %ymm3
	vfmadd231ps	%ymm0, %ymm2, %ymm3
	vpermilps	$177, %ymm2, %ymm2 # ymm2 = ymm2[1,0,3,2,5,4,7,6]
	vfmadd213ps	%ymm3, %ymm1, %ymm2
	vmovups	%ymm2, (%rdx,%rsi)
	addq	$32, %rsi
	addq	$-1, %rbp
	jne	L304

vs

L272:
	vmovups	-96(%rdi,%rsi), %ymm2
	vmovups	-64(%rdi,%rsi), %ymm3
	vmovups	-32(%rdi,%rsi), %ymm4
	vmovups	(%rdi,%rsi), %ymm5
	vmovups	-96(%rdx,%rsi), %ymm6
	vfmadd231ps	%ymm1, %ymm2, %ymm6
	vpermilps	$177, %ymm2, %ymm2 # ymm2 = ymm2[1,0,3,2,5,4,7,6]
	vfmadd213ps	%ymm6, %ymm0, %ymm2
	vmovups	%ymm2, -96(%rdx,%rsi)
	vmovups	-64(%rdx,%rsi), %ymm2
	vfmadd231ps	%ymm1, %ymm3, %ymm2
	vpermilps	$177, %ymm3, %ymm3 # ymm3 = ymm3[1,0,3,2,5,4,7,6]
	vfmadd213ps	%ymm2, %ymm0, %ymm3
	vmovups	%ymm3, -64(%rdx,%rsi)
	vmovups	-32(%rdx,%rsi), %ymm2
	vfmadd231ps	%ymm1, %ymm4, %ymm2
	vpermilps	$177, %ymm4, %ymm3 # ymm3 = ymm4[1,0,3,2,5,4,7,6]
	vfmadd213ps	%ymm2, %ymm0, %ymm3
	vmovups	%ymm3, -32(%rdx,%rsi)
	vmovups	(%rdx,%rsi), %ymm2
	vfmadd231ps	%ymm1, %ymm5, %ymm2
	vpermilps	$177, %ymm5, %ymm3 # ymm3 = ymm5[1,0,3,2,5,4,7,6]
	vfmadd213ps	%ymm2, %ymm0, %ymm3
	vmovups	%ymm3, (%rdx,%rsi)
	subq	$-128, %rsi
	addq	$-1, %rcx
	jne	L272

Would it be possible to provide a method to somehow override the heuristic for manual fine tuning?

@chriselrod
Copy link
Member

Disappointing, I really expected this to be good enough:

function test!(x, a, beta)
    @inbounds @simd ivdep for n = 1:size(a,2)
        x[n] += beta * conj(a[n])
    end
end
N = 1024;
T = ComplexF32
x = rand(T, N);
a = rand(T,N);
beta = rand(T);
@code_native debuginfo=:none test!(x, a, beta) # not vectorized
using StructArrays
xsoa = StructArray(x);
asoa = StructArray(a);
@code_native debuginfo=:none test!(xsoa, asoa, beta) # also not vectorized =(

You shouldn't have to do things manually.

did profit from loop unroling (over 20% perfromance increase).

Interesting. Sounds like I'll have to change the heuristic.
I'll add an option for overriding the heuristic.
That would make it a lot easier to try different things, and thus discover when the heuristic needs work. Ideally it's made as robust as possible to save everyone's time from having to fiddle with settings.
One of my motivations behind this library was to automate the fiddling I had been doing.

Tiling shouldn't be overridden for code other people run, because the optimal values are platform dependent.

@hofmannmartin
Copy link
Author

Disappointing, I really expected this to be good enough:

Would have been nice.

Interesting. Sounds like I'll have to change the heuristic.

It was actually a post on julia dicourse and the openblas cdot kernel which had me try unrolling the loop.

This may be a stupid question, but can the out-of-order execution work across such a loop?

L304:
	vmovups	(%rcx,%rsi), %ymm2
	vmovups	(%rdx,%rsi), %ymm3
	vfmadd231ps	%ymm0, %ymm2, %ymm3
	vpermilps	$177, %ymm2, %ymm2 # ymm2 = ymm2[1,0,3,2,5,4,7,6]
	vfmadd213ps	%ymm3, %ymm1, %ymm2
	vmovups	%ymm2, (%rdx,%rsi)
	addq	$32, %rsi
	addq	$-1, %rbp
	jne	L304

Will it not be blocked, since in each iteration values are loaded into the ymm2 and ymm3 register, so it can only proceed, once vmovups %ymm2, (%rdx,%rsi). respectively vfmadd213ps %ymm3, %ymm1, %ymm2 have been executed? Would at least explain the gain I saw.

I'll add an option for overriding the heuristic.
That would make it a lot easier to try different things, and thus discover when the heuristic needs work. Ideally it's made as robust as possible to save everyone's time from having to fiddle with settings.
One of my motivations behind this library was to automate the fiddling I had been doing.

Would be perfect for testing. Final usage of the package is a different story. There you do not want to fiddle around with these settings.

@chriselrod
Copy link
Member

chriselrod commented Jan 17, 2020

I added the ability to manually specify the unroll factor in v0.3.8, via specifying unroll = U or tile = (U,T) where U and T are integer literals (ie, they can't be symbols).

can the out-of-order execution work across such a loop?

Register renaming:

This technique is used to eliminate false data dependencies arising from the reuse of registers by successive instructions that do not have any real data dependencies between them. The elimination of these false data dependencies reveals more instruction-level parallelism in an instruction stream, which can be exploited by various and complementary techniques such as superscalar and out-of-order execution for better performance.

Seems as though it is not enough. There is also still a dependency chain in the loop counter.

The difference isn't close to as extreme as when we have real loop carried dependencies:

julia> using BenchmarkTools, LoopVectorization

julia> function selfdotu1(x)
           s = zero(eltype(x))
           @avx unroll=1 for i  eachindex(x)
               s += x[i]*x[i]
           end
           s
       end
selfdotu1 (generic function with 1 method)

julia> function selfdotu2(x)
           s = zero(eltype(x))
           @avx unroll=2 for i  eachindex(x)
               s += x[i]*x[i]
           end
           s
       end
selfdotu2 (generic function with 1 method)

julia> function selfdotu4(x)
           s = zero(eltype(x))
           @avx unroll=4 for i  eachindex(x)
               s += x[i]*x[i]
           end
           s
       end
selfdotu4 (generic function with 1 method)

julia> function selfdotu8(x)
           s = zero(eltype(x))
           @avx unroll=8 for i  eachindex(x)
               s += x[i]*x[i]
           end
           s
       end
selfdotu8 (generic function with 1 method)

julia> x = rand(1024);

julia> @btime selfdotu1($x)
  138.725 ns (0 allocations: 0 bytes)
341.62521511329

julia> @btime selfdotu2($x)
  81.118 ns (0 allocations: 0 bytes)
341.6252151132899

julia> @btime selfdotu4($x)
  57.550 ns (0 allocations: 0 bytes)
341.6252151132899

julia> @btime selfdotu8($x)
  45.087 ns (0 allocations: 0 bytes)
341.6252151132899

but even 10% is nothing to scoff at, so I changed the heuristic with something that makes your example unroll by 4.

@hofmannmartin
Copy link
Author

I added the ability to manually specify the unroll factor in v0.3.8, via specifying unroll = U or tile = (U,T) where U and T are integer literals (ie, they can't be symbols).

You are fast. Thank you.

but even 10% is nothing to scoff at, so I changed the heuristic with something that makes your example unroll by 4.

Overall if I compare the native julia implementation working on the complex arrays with the one working on structs boosted by @avx (which now unrolls) I get a speedup of 12. This is awesome.

@chriselrod chriselrod added the enhancement New feature or request label Apr 14, 2020
@AshtonSBradley
Copy link

AshtonSBradley commented Dec 21, 2021

I added the ability to manually specify the unroll factor in v0.3.8, via specifying unroll = U or tile = (U,T) where U and T are integer literals (ie, they can't be symbols).

You are fast. Thank you.

but even 10% is nothing to scoff at, so I changed the heuristic with something that makes your example unroll by 4.

Overall if I compare the native julia implementation working on the complex arrays with the one working on structs boosted by @avx (which now unrolls) I get a speedup of 12. This is awesome.

Great to see this comparison and the unrolling. To clarify: is the factor of 12 achieved with something like unroll=8 applied to test_avx!?

i.e.

function test_avx!(xre::Vector{T}, xim::Vector{T}, βre::T, βim::T, Are::Matrix{T}, Aim::Matrix{T}, k::Int) where {T<:AbstractFloat}
    @avx unroll=8 for i  1:length(xre)
        xre[i] += βre*Are[i,k] + βim*Aim[i,k]
        xim[i] += βre*Aim[i,k] - βim*Are[i,k]
    end
end

function test_avx!(x::StructVector{Complex{T}}, β::Complex{T}, A::StructArray{Complex{T},2}, k::Int) where {T<:AbstractFloat}
    test_avx!(x.re, x.im, β.re, β.im, A.re , A.im, k)
end

Does it matter if arrays are three-dimensional?

@chriselrod
Copy link
Member

unroll=8 should be unnecessary.

It should work for multidimensional arrays.

@felixhorger
Copy link

Hi All,

Note that there another possibility of using complex numbers with LoopVectorization:
Instead of having separate arrays containing real & imaginary parts, you can just reinterpret the complex array x with

y = reinterpret(real(typeof(x)), x)

which could be made more convenient by using

y = reshape(y, 2, size(x)...)

on it. For example, the real part is then y[1, ...].
I haven't checked the performance of this, but it runs fast enough for my use case so I wanted to point it out here :)

Felix

@hofmannmartin
Copy link
Author

You are right. This is also how I solved my problem in the end. Using reinterpret and hand crafting a kernel for my function

function test!(x::Vector, y::Vector, beta)
  for n=1:length(x)
    x[n] += conj(beta)*y[n]
  end
end

Initially however, my question was aiming towards a solution, which does not require a hand crafted kernel.

@chriselrod
Copy link
Member

which could be made more convenient by using

Note that as of Julia 1.6, you can use reinterpret(reshape, real(typeof(x)), x) instead, to combine it into one step.
Doing this should yield slightly better performance than doing it in two steps, as this way LoopVectorization/VectorizationBase will know exactly what stride(y, 2) equals at compile time, allowing it to generate a better sequence of shuffle instructions to do the AoS<->SoA conversion.

You can see a few examples here:
https://github.com/JuliaSIMD/LoopVectorization.jl/blob/master/test/shuffleloadstores.jl

my question was aiming towards a solution, which does not require a hand crafted kernel.

I am very slowly rewriting LoopVectorization.
The rewrite will be able to handle this automatically, i.e. you'll be able to use Complex numbers directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants