-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mixed Float64 and Int64 #11
Comments
Okay, I added a few missing methods to Please feel free to file more issues as they come up. Once you update to LoopVectorization 0.3.1 (which lower bounds VectorizationBase and SIMDPirates): julia> using LoopVectorization
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]
julia> function mygemm!(C, A, B)
@inbounds for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
Cᵢⱼ = zero(eltype(C))
@fastmath for k ∈ 1:size(A,2)
Cᵢⱼ += A[i,k] * B[k,j]
end
C[i,j] = Cᵢⱼ
end
end
mygemm! (generic function with 1 method)
julia> function mygemmavx!(C, A, B)
@avx for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
Cᵢⱼ = zero(eltype(C))
for k ∈ 1:size(A,2)
Cᵢⱼ += A[i,k] * B[k,j]
end
C[i,j] = Cᵢⱼ
end
end
mygemmavx! (generic function with 1 method)
julia> N = 20;
julia> A = rand(0:1000,N, N);
julia> B = rand(0:1000,N, N);
julia> C1 = Matrix{Float64}(undef, N, N);
julia> C2 = Matrix{Float64}(undef, N, N);
julia> using BenchmarkTools
julia> @btime mygemm!($C1, $A, $B)
5.039 μs (0 allocations: 0 bytes)
julia> @btime mygemmavx!($C2, $A, $B)
869.035 ns (0 allocations: 0 bytes)
julia> C1 ≈ C2
true But note that there is a danger to making these things generic -- performance can mysteriously be bad, without the cause always being clear. AX2 cannot efficiently convert from julia> using LoopVectorization
julia> using LoopVectorization: Vec
julia> const W64 = LoopVectorization.VectorizationBase.pick_vector_width(Float64)
8
julia> xi = ntuple(Val(W64)) do i Core.VecElement(i) end
(VecElement{Int64}(1), VecElement{Int64}(2), VecElement{Int64}(3), VecElement{Int64}(4), VecElement{Int64}(5), VecElement{Int64}(6), VecElement{Int64}(7), VecElement{Int64}(8))
julia> LoopVectorization.SIMDPirates.vconvert(Vec{W64,Float64}, xi)
(VecElement{Float64}(1.0), VecElement{Float64}(2.0), VecElement{Float64}(3.0), VecElement{Float64}(4.0), VecElement{Float64}(5.0), VecElement{Float64}(6.0), VecElement{Float64}(7.0), VecElement{Float64}(8.0))
julia> @code_native debuginfo=:none LoopVectorization.SIMDPirates.vconvert(Vec{W64,Float64}, xi)
.text
vcvtqq2pd %zmm0, %zmm0
retq
nopw (%rax,%rax) vector convert from quadword integer (Int64) to packed doubles is avx512-only. Try this code on your computer, and you'll likely get a (much slower) series of instructions instead. However, both instruction sets can efficiently convert from julia> xi32 = ntuple(Val(W64)) do i Core.VecElement(Int32(i)) end;
julia> LoopVectorization.SIMDPirates.vconvert(Vec{W64,Float64}, xi32);
julia> @code_native debuginfo=:none LoopVectorization.SIMDPirates.vconvert(Vec{W64,Float64}, xi32)
.text
vcvtdq2pd %ymm0, %zmm0
retq
nopw (%rax,%rax) vector convert of double-words to packed doubles was added in sse2 and avx. Therefore if you want cross-platform performance (instead of avx512-only), you should mix 32-bit integers with |
I forgot to update the manifest, so tests failed because the manifest listed versions outside the compat bounds I updated: Let me know if I should tag another version with the updated manifest. |
Thanks, the mixed-type function call works now. I have AVX2 only, which I guess is the reason that I don't get as big of a speedup when using When trying to execute the later code blocks which call |
I started writing that part of the comment before making changes to I edited the previous blocks to reference julia> using LoopVectorization
julia> using LoopVectorization: Vec
julia> const W64 = LoopVectorization.VectorizationBase.pick_vector_width(Float64)
4
julia> xi = ntuple(Val(W64)) do i Core.VecElement(i) end
(VecElement{Int64}(1), VecElement{Int64}(2), VecElement{Int64}(3), VecElement{Int64}(4))
julia> LoopVectorization.SIMDPirates.vconvert(Vec{W64,Float64}, xi)
(VecElement{Float64}(1.0), VecElement{Float64}(2.0), VecElement{Float64}(3.0), VecElement{Float64}(4.0))
julia> @code_native debuginfo=:none LoopVectorization.SIMDPirates.vconvert(Vec{W64,Float64}, xi)
.text
vextracti128 $1, %ymm0, %xmm1
vpextrq $1, %xmm1, %rax
vcvtsi2sd %rax, %xmm2, %xmm2
vmovq %xmm1, %rax
vcvtsi2sd %rax, %xmm3, %xmm1
vpextrq $1, %xmm0, %rax
vmovlhps %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0]
vcvtsi2sd %rax, %xmm3, %xmm2
vmovq %xmm0, %rax
vcvtsi2sd %rax, %xmm3, %xmm0
vmovlhps %xmm2, %xmm0, %xmm0 # xmm0 = xmm0[0],xmm2[0]
vinsertf128 $1, %xmm1, %ymm0, %ymm0
retq
nop 12 instructions vs 1. |
The Thanks! |
Minor modifications of the
mygemm
functions from your README seem to work great for inputs that are all Float64 or all Int64If however I make A & B Int64 and C a Float 64, then the macro seems to fail
The error produced is
My understanding from your comment on Discourse is that this should work.
FYI I'm using Julia 1.3.1 on MacOS.
The text was updated successfully, but these errors were encountered: