mixed Float64 and Int64 #11

chrisvwx · 2020-01-07T00:53:01Z

Minor modifications of the mygemm functions from your README seem to work great for inputs that are all Float64 or all Int64

function mygemm!(C, A, B)
    @inbounds for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
        Cᵢⱼ = zero(eltype(C))
        @fastmath for k ∈ 1:size(A,2)
            Cᵢⱼ += A[i,k] * B[k,j]
        end
        C[i,j] = Cᵢⱼ
    end
end
function mygemmavx!(C, A, B)
    @avx for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
        Cᵢⱼ = zero(eltype(C))
        for k ∈ 1:size(A,2)
            Cᵢⱼ += A[i,k] * B[k,j]
        end
        C[i,j] = Cᵢⱼ
    end
end

If however I make A & B Int64 and C a Float 64, then the macro seems to fail

N = 20;
A = rand(0:1000,N, N);
B = rand(0:1000,N, N);
C1 = Matrix{Float64}(undef, N, N); 
C2 = Matrix{Float64}(undef, N, N); 
@btime mygemm!($C1, $A, $B)
@btime mygemmavx!($C2, $A, $B)

The error produced is

julia> @btime mygemmavx!($C2, $A, $B)
ERROR: MethodError: no method matching +(::VectorizationBase.SVec{4,Int64}, ::VectorizationBase.SVec{4,Float64})
...

My understanding from your comment on Discourse is that this should work.
FYI I'm using Julia 1.3.1 on MacOS.

The text was updated successfully, but these errors were encountered:

chriselrod · 2020-01-07T12:27:47Z

Okay, I added a few missing methods to SIMDPirates.jl et al, but there are still probably many more.
I've added a few integer*integer = float matrix multiplication tests like your example.

Please feel free to file more issues as they come up. Once you update to LoopVectorization 0.3.1 (which lower bounds VectorizationBase and SIMDPirates):

julia> using LoopVectorization
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]

julia> function mygemm!(C, A, B)
           @inbounds for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
               Cᵢⱼ = zero(eltype(C))
               @fastmath for k ∈ 1:size(A,2)
                   Cᵢⱼ += A[i,k] * B[k,j]
               end
               C[i,j] = Cᵢⱼ
           end
       end
mygemm! (generic function with 1 method)

julia> function mygemmavx!(C, A, B)
           @avx for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
               Cᵢⱼ = zero(eltype(C))
               for k ∈ 1:size(A,2)
                   Cᵢⱼ += A[i,k] * B[k,j]
               end
               C[i,j] = Cᵢⱼ
           end
       end
mygemmavx! (generic function with 1 method)

julia> N = 20;

julia> A = rand(0:1000,N, N);

julia> B = rand(0:1000,N, N);

julia> C1 = Matrix{Float64}(undef, N, N);

julia> C2 = Matrix{Float64}(undef, N, N);

julia> using BenchmarkTools

julia> @btime mygemm!($C1, $A, $B)
  5.039 μs (0 allocations: 0 bytes)

julia> @btime mygemmavx!($C2, $A, $B)
  869.035 ns (0 allocations: 0 bytes)

julia> C1 ≈ C2
true

But note that there is a danger to making these things generic -- performance can mysteriously be bad, without the cause always being clear.

AX2 cannot efficiently convert from Int64 to Float64, but AVX512 can. Meaning that code will be fast on some computers, and slow on others. Correctness is more important than speed, but mysterious and silent performance differences between computers can be confusing and frustrating.

julia> using LoopVectorization

julia> using LoopVectorization: Vec

julia> const W64 = LoopVectorization.VectorizationBase.pick_vector_width(Float64)
8

julia> xi = ntuple(Val(W64)) do i Core.VecElement(i) end
(VecElement{Int64}(1), VecElement{Int64}(2), VecElement{Int64}(3), VecElement{Int64}(4), VecElement{Int64}(5), VecElement{Int64}(6), VecElement{Int64}(7), VecElement{Int64}(8))

julia> LoopVectorization.SIMDPirates.vconvert(Vec{W64,Float64}, xi)
(VecElement{Float64}(1.0), VecElement{Float64}(2.0), VecElement{Float64}(3.0), VecElement{Float64}(4.0), VecElement{Float64}(5.0), VecElement{Float64}(6.0), VecElement{Float64}(7.0), VecElement{Float64}(8.0))

julia> @code_native debuginfo=:none LoopVectorization.SIMDPirates.vconvert(Vec{W64,Float64}, xi)
	.text
	vcvtqq2pd	%zmm0, %zmm0
	retq
	nopw	(%rax,%rax)

vector convert from quadword integer (Int64) to packed doubles is avx512-only. Try this code on your computer, and you'll likely get a (much slower) series of instructions instead.

However, both instruction sets can efficiently convert from Int32 to Float64, so you may want to try Int32 arrays instead for performance.

julia> xi32 = ntuple(Val(W64)) do i Core.VecElement(Int32(i)) end;

julia> LoopVectorization.SIMDPirates.vconvert(Vec{W64,Float64}, xi32);

julia> @code_native debuginfo=:none LoopVectorization.SIMDPirates.vconvert(Vec{W64,Float64}, xi32)
	.text
	vcvtdq2pd	%ymm0, %zmm0
	retq
	nopw	(%rax,%rax)

vector convert of double-words to packed doubles was added in sse2 and avx.

Therefore if you want cross-platform performance (instead of avx512-only), you should mix 32-bit integers with Float64. Although the 64-bit integers were faster on AVX512, so you could also implement checks...

chriselrod · 2020-01-07T12:48:13Z

I forgot to update the manifest, so tests failed because the manifest listed versions outside the compat bounds I updated:
374c84d

Let me know if I should tag another version with the updated manifest.

chrisvwx · 2020-01-07T14:53:53Z

Thanks, the mixed-type function call works now. I have AVX2 only, which I guess is the reason that I don't get as big of a speedup when using @avx as you see when mixing Int64 and Float64. I did see larger speedup for mixing Int32 and Float64.

When trying to execute the later code blocks which call LoopVectorization.SIMDPirates.pirate_convert I got a ERROR: UndefVarError: pirate_convert not defined error. I used LoopVectorization#master

chriselrod · 2020-01-07T15:14:11Z

I started writing that part of the comment before making changes to SIMDPirates. The previous version had a function called pirate_convert which I deleted and replaced with vconvert, now using Base.llvmcall to perform integer-floating point conversions.

I edited the previous blocks to reference vconvert instead.
Rerunning the first of those blocks, now on a computer with AVX2 only:

julia> using LoopVectorization

julia> using LoopVectorization: Vec

julia> const W64 = LoopVectorization.VectorizationBase.pick_vector_width(Float64)
4

julia> xi = ntuple(Val(W64)) do i Core.VecElement(i) end
(VecElement{Int64}(1), VecElement{Int64}(2), VecElement{Int64}(3), VecElement{Int64}(4))

julia> LoopVectorization.SIMDPirates.vconvert(Vec{W64,Float64}, xi)
(VecElement{Float64}(1.0), VecElement{Float64}(2.0), VecElement{Float64}(3.0), VecElement{Float64}(4.0))

julia> @code_native debuginfo=:none LoopVectorization.SIMDPirates.vconvert(Vec{W64,Float64}, xi)
        .text
        vextracti128    $1, %ymm0, %xmm1
        vpextrq $1, %xmm1, %rax
        vcvtsi2sd       %rax, %xmm2, %xmm2
        vmovq   %xmm1, %rax
        vcvtsi2sd       %rax, %xmm3, %xmm1
        vpextrq $1, %xmm0, %rax
        vmovlhps        %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0]
        vcvtsi2sd       %rax, %xmm3, %xmm2
        vmovq   %xmm0, %rax
        vcvtsi2sd       %rax, %xmm3, %xmm0
        vmovlhps        %xmm2, %xmm0, %xmm0 # xmm0 = xmm0[0],xmm2[0]
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        retq
        nop

12 instructions vs 1.

chrisvwx · 2020-01-07T15:43:49Z

The vconvert function works for me now, and I see the AVX2 instructions as you describe.

Thanks!

chriselrod closed this as completed Jan 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mixed Float64 and Int64 #11

mixed Float64 and Int64 #11

chrisvwx commented Jan 7, 2020

chriselrod commented Jan 7, 2020 •

edited

Loading

chriselrod commented Jan 7, 2020

chrisvwx commented Jan 7, 2020

chriselrod commented Jan 7, 2020

chrisvwx commented Jan 7, 2020

mixed Float64 and Int64 #11

mixed Float64 and Int64 #11

Comments

chrisvwx commented Jan 7, 2020

chriselrod commented Jan 7, 2020 • edited Loading

chriselrod commented Jan 7, 2020

chrisvwx commented Jan 7, 2020

chriselrod commented Jan 7, 2020

chrisvwx commented Jan 7, 2020

chriselrod commented Jan 7, 2020 •

edited

Loading