Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mixed Float64 and Int64 #11

Closed
chrisvwx opened this issue Jan 7, 2020 · 5 comments
Closed

mixed Float64 and Int64 #11

chrisvwx opened this issue Jan 7, 2020 · 5 comments

Comments

@chrisvwx
Copy link

chrisvwx commented Jan 7, 2020

Minor modifications of the mygemm functions from your README seem to work great for inputs that are all Float64 or all Int64

function mygemm!(C, A, B)
    @inbounds for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
        Cᵢⱼ = zero(eltype(C))
        @fastmath for k ∈ 1:size(A,2)
            Cᵢⱼ += A[i,k] * B[k,j]
        end
        C[i,j] = Cᵢⱼ
    end
end
function mygemmavx!(C, A, B)
    @avx for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
        Cᵢⱼ = zero(eltype(C))
        for k ∈ 1:size(A,2)
            Cᵢⱼ += A[i,k] * B[k,j]
        end
        C[i,j] = Cᵢⱼ
    end
end

If however I make A & B Int64 and C a Float 64, then the macro seems to fail

N = 20;
A = rand(0:1000,N, N);
B = rand(0:1000,N, N);
C1 = Matrix{Float64}(undef, N, N); 
C2 = Matrix{Float64}(undef, N, N); 
@btime mygemm!($C1, $A, $B)
@btime mygemmavx!($C2, $A, $B)

The error produced is

julia> @btime mygemmavx!($C2, $A, $B)
ERROR: MethodError: no method matching +(::VectorizationBase.SVec{4,Int64}, ::VectorizationBase.SVec{4,Float64})
...

My understanding from your comment on Discourse is that this should work.
FYI I'm using Julia 1.3.1 on MacOS.

@chriselrod
Copy link
Member

chriselrod commented Jan 7, 2020

Okay, I added a few missing methods to SIMDPirates.jl et al, but there are still probably many more.
I've added a few integer*integer = float matrix multiplication tests like your example.

Please feel free to file more issues as they come up. Once you update to LoopVectorization 0.3.1 (which lower bounds VectorizationBase and SIMDPirates):

julia> using LoopVectorization
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]

julia> function mygemm!(C, A, B)
           @inbounds for i  1:size(A,1), j  1:size(B,2)
               Cᵢⱼ = zero(eltype(C))
               @fastmath for k  1:size(A,2)
                   Cᵢⱼ += A[i,k] * B[k,j]
               end
               C[i,j] = Cᵢⱼ
           end
       end
mygemm! (generic function with 1 method)

julia> function mygemmavx!(C, A, B)
           @avx for i  1:size(A,1), j  1:size(B,2)
               Cᵢⱼ = zero(eltype(C))
               for k  1:size(A,2)
                   Cᵢⱼ += A[i,k] * B[k,j]
               end
               C[i,j] = Cᵢⱼ
           end
       end
mygemmavx! (generic function with 1 method)

julia> N = 20;

julia> A = rand(0:1000,N, N);

julia> B = rand(0:1000,N, N);

julia> C1 = Matrix{Float64}(undef, N, N);

julia> C2 = Matrix{Float64}(undef, N, N);

julia> using BenchmarkTools

julia> @btime mygemm!($C1, $A, $B)
  5.039 μs (0 allocations: 0 bytes)

julia> @btime mygemmavx!($C2, $A, $B)
  869.035 ns (0 allocations: 0 bytes)

julia> C1  C2
true

But note that there is a danger to making these things generic -- performance can mysteriously be bad, without the cause always being clear.

AX2 cannot efficiently convert from Int64 to Float64, but AVX512 can. Meaning that code will be fast on some computers, and slow on others. Correctness is more important than speed, but mysterious and silent performance differences between computers can be confusing and frustrating.

julia> using LoopVectorization

julia> using LoopVectorization: Vec

julia> const W64 = LoopVectorization.VectorizationBase.pick_vector_width(Float64)
8

julia> xi = ntuple(Val(W64)) do i Core.VecElement(i) end
(VecElement{Int64}(1), VecElement{Int64}(2), VecElement{Int64}(3), VecElement{Int64}(4), VecElement{Int64}(5), VecElement{Int64}(6), VecElement{Int64}(7), VecElement{Int64}(8))

julia> LoopVectorization.SIMDPirates.vconvert(Vec{W64,Float64}, xi)
(VecElement{Float64}(1.0), VecElement{Float64}(2.0), VecElement{Float64}(3.0), VecElement{Float64}(4.0), VecElement{Float64}(5.0), VecElement{Float64}(6.0), VecElement{Float64}(7.0), VecElement{Float64}(8.0))

julia> @code_native debuginfo=:none LoopVectorization.SIMDPirates.vconvert(Vec{W64,Float64}, xi)
	.text
	vcvtqq2pd	%zmm0, %zmm0
	retq
	nopw	(%rax,%rax)

vector convert from quadword integer (Int64) to packed doubles is avx512-only. Try this code on your computer, and you'll likely get a (much slower) series of instructions instead.

However, both instruction sets can efficiently convert from Int32 to Float64, so you may want to try Int32 arrays instead for performance.

julia> xi32 = ntuple(Val(W64)) do i Core.VecElement(Int32(i)) end;

julia> LoopVectorization.SIMDPirates.vconvert(Vec{W64,Float64}, xi32);

julia> @code_native debuginfo=:none LoopVectorization.SIMDPirates.vconvert(Vec{W64,Float64}, xi32)
	.text
	vcvtdq2pd	%ymm0, %zmm0
	retq
	nopw	(%rax,%rax)

vector convert of double-words to packed doubles was added in sse2 and avx.

Therefore if you want cross-platform performance (instead of avx512-only), you should mix 32-bit integers with Float64. Although the 64-bit integers were faster on AVX512, so you could also implement checks...

@chriselrod
Copy link
Member

I forgot to update the manifest, so tests failed because the manifest listed versions outside the compat bounds I updated:
374c84d

Let me know if I should tag another version with the updated manifest.

@chrisvwx
Copy link
Author

chrisvwx commented Jan 7, 2020

Thanks, the mixed-type function call works now. I have AVX2 only, which I guess is the reason that I don't get as big of a speedup when using @avx as you see when mixing Int64 and Float64. I did see larger speedup for mixing Int32 and Float64.

When trying to execute the later code blocks which call LoopVectorization.SIMDPirates.pirate_convert I got a ERROR: UndefVarError: pirate_convert not defined error. I used LoopVectorization#master

@chriselrod
Copy link
Member

I started writing that part of the comment before making changes to SIMDPirates. The previous version had a function called pirate_convert which I deleted and replaced with vconvert, now using Base.llvmcall to perform integer-floating point conversions.

I edited the previous blocks to reference vconvert instead.
Rerunning the first of those blocks, now on a computer with AVX2 only:

julia> using LoopVectorization

julia> using LoopVectorization: Vec

julia> const W64 = LoopVectorization.VectorizationBase.pick_vector_width(Float64)
4

julia> xi = ntuple(Val(W64)) do i Core.VecElement(i) end
(VecElement{Int64}(1), VecElement{Int64}(2), VecElement{Int64}(3), VecElement{Int64}(4))

julia> LoopVectorization.SIMDPirates.vconvert(Vec{W64,Float64}, xi)
(VecElement{Float64}(1.0), VecElement{Float64}(2.0), VecElement{Float64}(3.0), VecElement{Float64}(4.0))

julia> @code_native debuginfo=:none LoopVectorization.SIMDPirates.vconvert(Vec{W64,Float64}, xi)
        .text
        vextracti128    $1, %ymm0, %xmm1
        vpextrq $1, %xmm1, %rax
        vcvtsi2sd       %rax, %xmm2, %xmm2
        vmovq   %xmm1, %rax
        vcvtsi2sd       %rax, %xmm3, %xmm1
        vpextrq $1, %xmm0, %rax
        vmovlhps        %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0]
        vcvtsi2sd       %rax, %xmm3, %xmm2
        vmovq   %xmm0, %rax
        vcvtsi2sd       %rax, %xmm3, %xmm0
        vmovlhps        %xmm2, %xmm0, %xmm0 # xmm0 = xmm0[0],xmm2[0]
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        retq
        nop

12 instructions vs 1.

@chrisvwx
Copy link
Author

chrisvwx commented Jan 7, 2020

The vconvert function works for me now, and I see the AVX2 instructions as you describe.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants