Vectorized Bitwise operations #91

MasonProtter · 2020-04-09T17:15:27Z

It'd be nice if this worked:

julia> ai = [false for _ in 1:64];

julia> bi = [true for _ in 1:64];

julia> @btime $ai .& $bi;
  211.228 ns (2 allocations: 128 bytes)

julia> using LoopVectorization

julia> @avx ai .& bi
ERROR: MethodError: no method matching &(::VectorizationBase.SVec{8,Bool}, ::VectorizationBase.SVec{8,Bool})
Closest candidates are:
  &(::Any, ::Any, ::Any, ::Any...) at operators.jl:529
  &(::VectorizationBase.Static{N}, ::Any) where N at /home/mason/.julia/packages/VectorizationBase/T6nfs/src/static.jl:111
  &(::Any, ::VectorizationBase.Static{N}) where N at /home/mason/.julia/packages/VectorizationBase/T6nfs/src/static.jl:112
Stacktrace:
 [1] macro expansion at /home/mason/.julia/packages/LoopVectorization/UOrSO/src/broadcast.jl:231 [inlined]
 [2] vmaterialize! at /home/mason/.julia/packages/LoopVectorization/UOrSO/src/broadcast.jl:231 [inlined]
 [3] vmaterialize(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(&),Tuple{Array{Bool,1},Array{Bool,1}}}, ::Val{:Main}) at /home/mason/.julia/packages/LoopVectorization/UOrSO/src/broadcast.jl:278
 [4] top-level scope at REPL[7]:1
 [5] eval(::Module, ::Any) at ./boot.jl:331
 [6] eval_user_input(::Any, ::REPL.REPLBackend) at /home/mason/julia/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
 [7] run_backend(::REPL.REPLBackend) at /home/mason/.julia/packages/Revise/Pcs5V/src/Revise.jl:1073
 [8] top-level scope at none:0

chriselrod · 2020-04-09T18:06:14Z

Okay, I'll add support for vectorized byte-wise boolean operations. Bit-wise boolean and byte-wise integer should already be supported (as always, please keep filing issues as you run into problems).

Normally LoopVectorization represents bitwise operations using the Mask type, but that currently only supports vectors up to length 64 (represented by a struct wrapping a UInt64).
I'm wondering if I should stick with SVec{8,Bool}s when you load them from a vector, or truncate it down into the mask type. The latter would let me avoid re-implementing all the other methods using masks.
My concern over the truncation is that it would then require zero-extension if you're storing the values. I don't think LLVM is going to eliminate that song and dance, because if some of the zero bits in the Bools weren't zero bits, eliminating the truncation -> (ops) -> zero-extension would change behavior.
Although, I may be able to tell LLVM the contents of those bits is undefined. I'll do some experimenting before giving up and deciding I need to add support for SVec{W,Bool} to all the methods currently accepting masks.

Sometimes representing bools as bits instead of bytes (like a BitArray) does is much faster.

julia> @benchmark @. $ci = $ai & $bi # 64
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     27.019 ns (0.00% GC)
  median time:      33.704 ns (0.00% GC)
  mean time:        33.295 ns (0.00% GC)
  maximum time:     44.838 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     993

julia> @benchmark $(Ref(0))[] & $(Ref(-1))[] # 64 at a time
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.103 ns (0.00% GC)
  median time:      1.107 ns (0.00% GC)
  mean time:        1.113 ns (0.00% GC)
  maximum time:     10.978 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> bitstring(-1)
"1111111111111111111111111111111111111111111111111111111111111111"

julia> bitstring(0)
"0000000000000000000000000000000000000000000000000000000000000000"

julia> using SIMDPirates

julia> alltrue = vbroadcast(Val(8), typemax(UInt)) # should probably print using `repr` on elements
SVec{8,UInt64}<18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615>

julia> allfalse = vbroadcast(Val(8), zero(UInt))
SVec{8,UInt64}<0, 0, 0, 0, 0, 0, 0, 0>

julia> @benchmark $(Ref(alltrue))[] & $(Ref(allfalse))[] # 512 at a time
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.181 ns (0.00% GC)
  median time:      1.184 ns (0.00% GC)
  mean time:        1.194 ns (0.00% GC)
  maximum time:     12.668 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

But it depends on what you're doing with them. Accessing the 17th bit means (x >> (17-1)) & one(x)).

MasonProtter · 2020-04-09T19:34:27Z

Yeah, I'm open to the idea that in a vectorized context, it's probably always preferable to use BitArray, so maybe we should just always convert to bitstorage? I guess that could impose some hefty overhead though.

MasonProtter · 2020-04-09T19:37:38Z

Looks like LoopVectorization doesn't know how to deal with BitArray either yet:

julia> a = trues(64); b = falses(64); typeof(a)
BitArray{1}

julia> @avx a .& b;
ERROR: conversion to pointer not defined for BitArray{1}
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] unsafe_convert(::Type{Ptr{Bool}}, ::BitArray{1}) at ./pointer.jl:67
 [3] pointer(::BitArray{1}) at ./abstractarray.jl:933
 [4] stridedpointer_for_broadcast at /home/mason/.julia/packages/VectorizationBase/T6nfs/src/vectorizable.jl:612 [inlined]
 [5] macro expansion at /home/mason/.julia/packages/LoopVectorization/UOrSO/src/broadcast.jl:231 [inlined]
 [6] vmaterialize! at /home/mason/.julia/packages/LoopVectorization/UOrSO/src/broadcast.jl:231 [inlined]
 [7] vmaterialize(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(&),Tuple{BitArray{1},BitArray{1}}}, ::Val{:Main}) at /home/mason/.julia/packages/LoopVectorization/UOrSO/src/broadcast.jl:278
 [8] top-level scope at REPL[20]:1
 [9] eval(::Module, ::Any) at ./boot.jl:331
 [10] eval_user_input(::Any, ::REPL.REPLBackend) at /home/mason/julia/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
 [11] run_backend(::REPL.REPLBackend) at /home/mason/.julia/packages/Revise/Pcs5V/src/Revise.jl:1073
 [12] top-level scope at none:0

chriselrod · 2020-04-09T20:12:54Z

Looks like LoopVectorization doesn't know how to deal with BitArray either yet

You're right, thanks for testing.
I added support for loading from them in loops, but did not add tests for storing them.
If they did once work, they stopped working once LoopVectorization switched from calling vstore! to calling vnoaliasstore!.

I'll probably only support BitArray{1} and BitArray{2} at first, because their axis awkwardly aren't padded, making them uncomfortable to work with.
Maybe PaddedMatrices should add PaddedBitArrays. They need it more than any other array type.
They'd be Array{UInt8,N} under the hood (instead of Vector{UInt64}), padded to the nearest byte on each axis, so that each dim starts with the start of a new UInt8.
If they did work, I only added support for storing into BitArray{1}; I'd still need to add a vstore! function for BitArray{2} that does all the shifts like the vload function. (Regarding the method signature, a PackedStridedBitPointer{1} corresponds to a BitArray{2}; the various PackedStridedPointers parameter is 1 less than the associated array, because they store N-1 stride, treating the first stride as 1).

I also didn't add support to them for broadcasting.

I should take another look at stridedpointer_for_broadcast, to see if it can be made to piggy back off of stridedpointer better. The idea is to check for dims with size 1, and set the corresponding strides to 0, so that it'll broadcast that dimension.
(It doesn't do that for the first dimension, meaning arrays where size(A, 1) == 1 still aren't supported.)

Yeah, I'm open to the idea that in a vectorized context, it's probably always preferable to use BitArray, so maybe we should just always convert to bitstorage? I guess that could impose some hefty overhead though.

My hope was that LLVM could avoid converting to bitstorage when it's more efficient not to.
However, because
load bytes -> store bytes
and
load bytes -> trunc to bits -> extend to bytes -> store bytes
have observably different behavior -- you lose the 7 guaranteed-to-be-0 bits of the Bool (replacing them with the totally different value of 0), LLVM will not do that optimization.

Meaning I think I should handle this with promotion / conversion methods, and try to otherwise keep things in their natural representation.

chriselrod · 2020-04-09T20:21:25Z

I had hoped this would let LLVM make that optimization:

using VectorizationBase, SIMDPirates
@generated function vload(ptr::Ptr{Bool}, i::_MM{W,I}) where {W,I<:Integer}
    U = VectorizationBase.mask_type(W)
    utype = "i$(VectorizationBase.nextpow2(W))"
    itype = "i$(8sizeof(I))"
    ptyp = "i$(8sizeof(Int))"
    instrs = String[]
    push!(instrs, "%ptrbool = inttoptr $ptyp %0 to i8*")
    push!(instrs, "%ptroffset = getelementptr inbounds i8, i8* %ptrbool, $itype %1")
    push!(instrs, "%ptrvbool = bitcast i8* %ptroffset to <$W x i8>*")
    push!(instrs, "%vbool = load <$W x i8>, <$W x i8>* %ptrvbool")
    push!(instrs, "%vbooltrunc = trunc <$W x i8> %vbool to <$W x i1>")
    if 8sizeof(U) == W
        push!(instrs, "%mask = bitcast <$W x i1> %vbooltrunc to $utype")
    else
        push!(instrs, "%masktrunc = bitcast <$W x i1> %vbooltrunc to i$(W)")
        push!(instrs, "%mask = zext i$(W) %masktrunc to $utype")
    end
    push!(instrs, "ret $utype %mask")
    quote
        $(Expr(:meta,:inline))
        Mask{$W}(Base.llvmcall(
            $(join(instrs,"\n")),
            $U,
            Tuple{Ptr{Bool},$I},
            ptr, i.i
        ))
    end
end
@generated function vstore!(ptr::Ptr{Bool}, v::Mask{W,U}, i::I) where {W,U,I<:Integer}
    utype = "i$(8sizeof(U))"
    itype = "i$(8sizeof(I))"
    ptyp = "i$(8sizeof(Int))"
    instrs = String[]
    push!(instrs, "%ptrbool = inttoptr $ptyp %0 to i8*")
    push!(instrs, "%ptroffset = getelementptr inbounds i8, i8* %ptrbool, $itype %2")
    push!(instrs, "%ptr = bitcast i8* %ptroffset to <$(8W) x i1>*")
    if W == 8sizeof(U)
        push!(instrs, "%mask = bitcast $utype %1 to <$W x i1>")
    else
        push!(instrs, "%masktrunc = trunc $utype %1 to i$(W)")
        push!(instrs, "%mask = bitcast i$(W) %masktrunc to <$W x i1>")
    end
    push!(instrs, "%maskzext = zext <$W x i1> %mask to <$W x i8>")
    push!(instrs, "%vbits = bitcast <$W x i8> %maskzext to <$(8W) x i1>")
    undefmask = ""
    for w in 1:W
        for b in 1:7
            undefmask *= " i1 undef,"
        end
        undefmask *= w == W ? " i1 1" : " i1 1,"
    end
    push!(instrs, "%vundefbits = and <$(8W) x i1> %vbits, <$undefmask >")
    push!(instrs, "store <$(8W) x i1> %vundefbits, <$(8W) x i1>* %ptr")
    push!(instrs, "ret void")
    quote
        $(Expr(:meta,:inline))
        Base.llvmcall(
            $(join(instrs,"\n")),
            Cvoid,
            Tuple{Ptr{Bool}, $U, $I},
            ptr, v.u, i
        )
    end
end
function test!(ci, ai, bi)
    ma = vload(pointer(ai), _MM{4}(0))
    mb = vload(pointer(bi), _MM{4}(0))
    vstore!(pointer(ci), ma & mb, 0)
end
ai = [false for _ in 1:64];
bi = [true for _ in 1:64];
ci = similar(ai);
@code_llvm debuginfo=:none test!(ci, ai, bi)
@code_native debuginfo=:none test!(ci, ai, bi)

The intent was that the vstore! method above uses and with a vector of undefined values matched to each of the 7 zeros. I was hoping that would tell LLVM that we don't care what we're storing there, hence giving it the license to do whatever it wants with those values.
But it didn't work.
It should generate similar code to

aii = fill(zero(UInt32), 2);
bii = fill(0x01010101, 2);
cii = similar(aii);
test2!(cii, aii, bii) = (@inbounds cii[1] = aii[1] & bii[1]; nothing)
@code_llvm debuginfo=:none test2!(cii, aii, bii)
@code_native debuginfo=:none test2!(cii, aii, bii)

chriselrod · 2020-04-10T15:32:34Z

I fixed this locally.

julia> @benchmark $ai .& $bi
BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  2
  --------------
  minimum time:     124.222 ns (0.00% GC)
  median time:      132.327 ns (0.00% GC)
  mean time:        141.357 ns (4.84% GC)
  maximum time:     3.563 μs (95.34% GC)
  --------------
  samples:          10000
  evals/sample:     893

julia> @benchmark @avx $ai .& $bi
BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  2
  --------------
  minimum time:     32.483 ns (0.00% GC)
  median time:      35.576 ns (0.00% GC)
  mean time:        43.981 ns (16.89% GC)
  maximum time:     3.383 μs (98.14% GC)
  --------------
  samples:          10000
  evals/sample:     994

julia> @benchmark $a .& $b
BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  2
  --------------
  minimum time:     29.661 ns (0.00% GC)
  median time:      32.293 ns (0.00% GC)
  mean time:        40.743 ns (18.48% GC)
  maximum time:     3.325 μs (98.16% GC)
  --------------
  samples:          10000
  evals/sample:     995

julia> @benchmark @avx $a .& $b
BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  2
  --------------
  minimum time:     32.307 ns (0.00% GC)
  median time:      34.880 ns (0.00% GC)
  mean time:        43.482 ns (17.59% GC)
  maximum time:     3.384 μs (98.12% GC)
  --------------
  samples:          10000
  evals/sample:     994

I'm still working on an unrelated optimization. I'll push once I've added it.

chriselrod · 2020-04-10T22:12:55Z

The tests failed on Mac, Julia 1.1, and no AVX2:
https://travis-ci.com/github/chriselrod/LoopVectorization.jl/jobs/318363219
The no-AVX2 is probably the culprit. You need at least AVX2 for SIMD integer operations.

I'll look at it in the morning.

chriselrod · 2020-04-11T14:02:00Z

I had hoped compiling a generic/multiversioning system image and launching Julia-1.1 locally with -C=sandybridge would be a means of reproducing this locally, but I get

> ./julia -O3 --startup=no -C=sandybridge
ERROR: Your CPU does not support the CX16 instruction, which is required by this version of Julia!  This is often due to running inside of a virtualized environment.  Please read https://docs.julialang.org/en/stable/devdocs/sysimg/ for more.

Same error for ivybridge.
The next architecture is Haswell, which has AVX2, and therefore I don't expect it to be able to reproduce the problem.

chriselrod · 2020-04-15T16:08:26Z

This works with:

AVX2
Without AVX2 on Julia 1.4 and greater.

I'll close this as a "wontfix" for no-AVX2 and Julia < v1.3, unless someone needs that combination and can help me debug it.

chriselrod closed this as completed in a2b4bd0 Apr 10, 2020

chriselrod reopened this Apr 10, 2020

chriselrod closed this as completed Apr 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorized Bitwise operations #91

Vectorized Bitwise operations #91

MasonProtter commented Apr 9, 2020

chriselrod commented Apr 9, 2020

MasonProtter commented Apr 9, 2020 •

edited

Loading

MasonProtter commented Apr 9, 2020

chriselrod commented Apr 9, 2020

chriselrod commented Apr 9, 2020 •

edited

Loading

chriselrod commented Apr 10, 2020

chriselrod commented Apr 10, 2020

chriselrod commented Apr 11, 2020

chriselrod commented Apr 15, 2020

Vectorized Bitwise operations #91

Vectorized Bitwise operations #91

Comments

MasonProtter commented Apr 9, 2020

chriselrod commented Apr 9, 2020

MasonProtter commented Apr 9, 2020 • edited Loading

MasonProtter commented Apr 9, 2020

chriselrod commented Apr 9, 2020

chriselrod commented Apr 9, 2020 • edited Loading

chriselrod commented Apr 10, 2020

chriselrod commented Apr 10, 2020

chriselrod commented Apr 11, 2020

chriselrod commented Apr 15, 2020

MasonProtter commented Apr 9, 2020 •

edited

Loading

chriselrod commented Apr 9, 2020 •

edited

Loading