-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized Bitwise operations #91
Comments
Okay, I'll add support for vectorized byte-wise boolean operations. Bit-wise boolean and byte-wise integer should already be supported (as always, please keep filing issues as you run into problems). Normally LoopVectorization represents bitwise operations using the Sometimes representing bools as bits instead of bytes (like a julia> @benchmark @. $ci = $ai & $bi # 64
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 27.019 ns (0.00% GC)
median time: 33.704 ns (0.00% GC)
mean time: 33.295 ns (0.00% GC)
maximum time: 44.838 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 993
julia> @benchmark $(Ref(0))[] & $(Ref(-1))[] # 64 at a time
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.103 ns (0.00% GC)
median time: 1.107 ns (0.00% GC)
mean time: 1.113 ns (0.00% GC)
maximum time: 10.978 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> bitstring(-1)
"1111111111111111111111111111111111111111111111111111111111111111"
julia> bitstring(0)
"0000000000000000000000000000000000000000000000000000000000000000"
julia> using SIMDPirates
julia> alltrue = vbroadcast(Val(8), typemax(UInt)) # should probably print using `repr` on elements
SVec{8,UInt64}<18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615>
julia> allfalse = vbroadcast(Val(8), zero(UInt))
SVec{8,UInt64}<0, 0, 0, 0, 0, 0, 0, 0>
julia> @benchmark $(Ref(alltrue))[] & $(Ref(allfalse))[] # 512 at a time
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.181 ns (0.00% GC)
median time: 1.184 ns (0.00% GC)
mean time: 1.194 ns (0.00% GC)
maximum time: 12.668 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000 But it depends on what you're doing with them. Accessing the 17th bit means |
Yeah, I'm open to the idea that in a vectorized context, it's probably always preferable to use BitArray, so maybe we should just always convert to bitstorage? I guess that could impose some hefty overhead though. |
Looks like LoopVectorization doesn't know how to deal with BitArray either yet: julia> a = trues(64); b = falses(64); typeof(a)
BitArray{1}
julia> @avx a .& b;
ERROR: conversion to pointer not defined for BitArray{1}
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] unsafe_convert(::Type{Ptr{Bool}}, ::BitArray{1}) at ./pointer.jl:67
[3] pointer(::BitArray{1}) at ./abstractarray.jl:933
[4] stridedpointer_for_broadcast at /home/mason/.julia/packages/VectorizationBase/T6nfs/src/vectorizable.jl:612 [inlined]
[5] macro expansion at /home/mason/.julia/packages/LoopVectorization/UOrSO/src/broadcast.jl:231 [inlined]
[6] vmaterialize! at /home/mason/.julia/packages/LoopVectorization/UOrSO/src/broadcast.jl:231 [inlined]
[7] vmaterialize(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(&),Tuple{BitArray{1},BitArray{1}}}, ::Val{:Main}) at /home/mason/.julia/packages/LoopVectorization/UOrSO/src/broadcast.jl:278
[8] top-level scope at REPL[20]:1
[9] eval(::Module, ::Any) at ./boot.jl:331
[10] eval_user_input(::Any, ::REPL.REPLBackend) at /home/mason/julia/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
[11] run_backend(::REPL.REPLBackend) at /home/mason/.julia/packages/Revise/Pcs5V/src/Revise.jl:1073
[12] top-level scope at none:0 |
You're right, thanks for testing. I'll probably only support I also didn't add support to them for broadcasting. I should take another look at
My hope was that LLVM could avoid converting to bitstorage when it's more efficient not to. Meaning I think I should handle this with promotion / conversion methods, and try to otherwise keep things in their natural representation. |
I had hoped this would let LLVM make that optimization: using VectorizationBase, SIMDPirates
@generated function vload(ptr::Ptr{Bool}, i::_MM{W,I}) where {W,I<:Integer}
U = VectorizationBase.mask_type(W)
utype = "i$(VectorizationBase.nextpow2(W))"
itype = "i$(8sizeof(I))"
ptyp = "i$(8sizeof(Int))"
instrs = String[]
push!(instrs, "%ptrbool = inttoptr $ptyp %0 to i8*")
push!(instrs, "%ptroffset = getelementptr inbounds i8, i8* %ptrbool, $itype %1")
push!(instrs, "%ptrvbool = bitcast i8* %ptroffset to <$W x i8>*")
push!(instrs, "%vbool = load <$W x i8>, <$W x i8>* %ptrvbool")
push!(instrs, "%vbooltrunc = trunc <$W x i8> %vbool to <$W x i1>")
if 8sizeof(U) == W
push!(instrs, "%mask = bitcast <$W x i1> %vbooltrunc to $utype")
else
push!(instrs, "%masktrunc = bitcast <$W x i1> %vbooltrunc to i$(W)")
push!(instrs, "%mask = zext i$(W) %masktrunc to $utype")
end
push!(instrs, "ret $utype %mask")
quote
$(Expr(:meta,:inline))
Mask{$W}(Base.llvmcall(
$(join(instrs,"\n")),
$U,
Tuple{Ptr{Bool},$I},
ptr, i.i
))
end
end
@generated function vstore!(ptr::Ptr{Bool}, v::Mask{W,U}, i::I) where {W,U,I<:Integer}
utype = "i$(8sizeof(U))"
itype = "i$(8sizeof(I))"
ptyp = "i$(8sizeof(Int))"
instrs = String[]
push!(instrs, "%ptrbool = inttoptr $ptyp %0 to i8*")
push!(instrs, "%ptroffset = getelementptr inbounds i8, i8* %ptrbool, $itype %2")
push!(instrs, "%ptr = bitcast i8* %ptroffset to <$(8W) x i1>*")
if W == 8sizeof(U)
push!(instrs, "%mask = bitcast $utype %1 to <$W x i1>")
else
push!(instrs, "%masktrunc = trunc $utype %1 to i$(W)")
push!(instrs, "%mask = bitcast i$(W) %masktrunc to <$W x i1>")
end
push!(instrs, "%maskzext = zext <$W x i1> %mask to <$W x i8>")
push!(instrs, "%vbits = bitcast <$W x i8> %maskzext to <$(8W) x i1>")
undefmask = ""
for w in 1:W
for b in 1:7
undefmask *= " i1 undef,"
end
undefmask *= w == W ? " i1 1" : " i1 1,"
end
push!(instrs, "%vundefbits = and <$(8W) x i1> %vbits, <$undefmask >")
push!(instrs, "store <$(8W) x i1> %vundefbits, <$(8W) x i1>* %ptr")
push!(instrs, "ret void")
quote
$(Expr(:meta,:inline))
Base.llvmcall(
$(join(instrs,"\n")),
Cvoid,
Tuple{Ptr{Bool}, $U, $I},
ptr, v.u, i
)
end
end
function test!(ci, ai, bi)
ma = vload(pointer(ai), _MM{4}(0))
mb = vload(pointer(bi), _MM{4}(0))
vstore!(pointer(ci), ma & mb, 0)
end
ai = [false for _ in 1:64];
bi = [true for _ in 1:64];
ci = similar(ai);
@code_llvm debuginfo=:none test!(ci, ai, bi)
@code_native debuginfo=:none test!(ci, ai, bi) The intent was that the aii = fill(zero(UInt32), 2);
bii = fill(0x01010101, 2);
cii = similar(aii);
test2!(cii, aii, bii) = (@inbounds cii[1] = aii[1] & bii[1]; nothing)
@code_llvm debuginfo=:none test2!(cii, aii, bii)
@code_native debuginfo=:none test2!(cii, aii, bii) |
I fixed this locally. julia> @benchmark $ai .& $bi
BenchmarkTools.Trial:
memory estimate: 128 bytes
allocs estimate: 2
--------------
minimum time: 124.222 ns (0.00% GC)
median time: 132.327 ns (0.00% GC)
mean time: 141.357 ns (4.84% GC)
maximum time: 3.563 μs (95.34% GC)
--------------
samples: 10000
evals/sample: 893
julia> @benchmark @avx $ai .& $bi
BenchmarkTools.Trial:
memory estimate: 128 bytes
allocs estimate: 2
--------------
minimum time: 32.483 ns (0.00% GC)
median time: 35.576 ns (0.00% GC)
mean time: 43.981 ns (16.89% GC)
maximum time: 3.383 μs (98.14% GC)
--------------
samples: 10000
evals/sample: 994
julia> @benchmark $a .& $b
BenchmarkTools.Trial:
memory estimate: 128 bytes
allocs estimate: 2
--------------
minimum time: 29.661 ns (0.00% GC)
median time: 32.293 ns (0.00% GC)
mean time: 40.743 ns (18.48% GC)
maximum time: 3.325 μs (98.16% GC)
--------------
samples: 10000
evals/sample: 995
julia> @benchmark @avx $a .& $b
BenchmarkTools.Trial:
memory estimate: 128 bytes
allocs estimate: 2
--------------
minimum time: 32.307 ns (0.00% GC)
median time: 34.880 ns (0.00% GC)
mean time: 43.482 ns (17.59% GC)
maximum time: 3.384 μs (98.12% GC)
--------------
samples: 10000
evals/sample: 994 I'm still working on an unrelated optimization. I'll push once I've added it. |
The tests failed on Mac, Julia 1.1, and no AVX2: I'll look at it in the morning. |
I had hoped compiling a generic/multiversioning system image and launching Julia-1.1 locally with > ./julia -O3 --startup=no -C=sandybridge
ERROR: Your CPU does not support the CX16 instruction, which is required by this version of Julia! This is often due to running inside of a virtualized environment. Please read https://docs.julialang.org/en/stable/devdocs/sysimg/ for more. Same error for ivybridge. |
This works with:
I'll close this as a "wontfix" for no-AVX2 and Julia < v1.3, unless someone needs that combination and can help me debug it. |
It'd be nice if this worked:
The text was updated successfully, but these errors were encountered: