-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor @avx with generated functions #17
Conversation
Codecov Report
@@ Coverage Diff @@
## master #17 +/- ##
==========================================
+ Coverage 84.43% 86.41% +1.98%
==========================================
Files 11 11
Lines 1330 1583 +253
==========================================
+ Hits 1123 1368 +245
- Misses 207 215 +8
Continue to review full report at Codecov.
|
Sorry for the delay here. I've been busy trying to make the library better able to handle different loops. I'm starting to look at this now. julia> mydotavx(a,b)
quote
#= /home/c285497/.julia/dev/LoopVectorization/src/_avx.jl:74 =#
begin
var"##a#274" = var_nt[Symbol("##a#274")]
var"##b#275" = var_nt[Symbol("##b#275")]
var"##s#276" = var_nt[Symbol("##s#276")]
end
#= /home/c285497/.julia/dev/LoopVectorization/src/_avx.jl:75 =#
begin
begin
var"##T#283" = promote_type(eltype(a), eltype(b))
var"##W#282" = LoopVectorization.pick_vector_width_val(var"##T#283")
var"##loopi#284" = length(var"##a#274")
var"##vptr##_a" = LoopVectorization.stridedpointer(a)
var"##vptr##_b" = LoopVectorization.stridedpointer(b)
var"##mask##" = LoopVectorization.masktable(var"##W#282", LoopVectorization.valrem(var"##W#282", var"##loopi#284"))
end
begin
#= /home/c285497/.julia/dev/LoopVectorization/src/lowering.jl:651 =# GC.@preserve a b begin
i = LoopVectorization._MM(var"##W#282", 0)
var"####s#276_0" = LoopVectorization.vbroadcast(var"##W#282", zero(var"##T#283"))
var"####s#276_1" = LoopVectorization.vbroadcast(var"##W#282", zero(var"##T#283"))
var"####s#276_2" = LoopVectorization.vbroadcast(var"##W#282", zero(var"##T#283"))
var"####s#276_3" = LoopVectorization.vbroadcast(var"##W#282", zero(var"##T#283"))
while i < var"##loopi#284" - LoopVectorization.valmuladd(var"##W#282", 4, -1)
var"####tempload#285_0" = LoopVectorization.vload(var"##vptr##_a", (i,))
var"####tempload#285_1" = LoopVectorization.vload(var"##vptr##_a", (i + LoopVectorization.valmul(var"##W#282", 1),))
var"####tempload#285_2" = LoopVectorization.vload(var"##vptr##_a", (i + LoopVectorization.valmul(var"##W#282", 2),))
var"####tempload#285_3" = LoopVectorization.vload(var"##vptr##_a", (i + LoopVectorization.valmul(var"##W#282", 3),))
var"####tempload#286_0" = LoopVectorization.vload(var"##vptr##_b", (i,))
var"####tempload#286_1" = LoopVectorization.vload(var"##vptr##_b", (i + LoopVectorization.valmul(var"##W#282", 1),))
var"####tempload#286_2" = LoopVectorization.vload(var"##vptr##_b", (i + LoopVectorization.valmul(var"##W#282", 2),))
var"####tempload#286_3" = LoopVectorization.vload(var"##vptr##_b", (i + LoopVectorization.valmul(var"##W#282", 3),))
var"####s#276_0" = LoopVectorization.vmuladd(var"####tempload#285_0", var"####tempload#286_0", var"####s#276_0")
var"####s#276_1" = LoopVectorization.vmuladd(var"####tempload#285_1", var"####tempload#286_1", var"####s#276_1")
var"####s#276_2" = LoopVectorization.vmuladd(var"####tempload#285_2", var"####tempload#286_2", var"####s#276_2")
var"####s#276_3" = LoopVectorization.vmuladd(var"####tempload#285_3", var"####tempload#286_3", var"####s#276_3")
i += LoopVectorization.valmul(var"##W#282", 4)
end
if var"##loopi#284" != i
if i > var"##loopi#284" - LoopVectorization.valmuladd(var"##W#282", 1, 1)
var"####tempload#285_0" = LoopVectorization.vload(var"##vptr##_a", (i,), var"##mask##")
var"####tempload#286_0" = LoopVectorization.vload(var"##vptr##_b", (i,), var"##mask##")
var"####s#276_0" = LoopVectorization.vifelse(var"##mask##", LoopVectorization.vmuladd(var"####tempload#285_0", var"####tempload#286_0", var"####s#276_0"), var"####s#276_0")
i += LoopVectorization.valmul(var"##W#282", 1)
elseif i > var"##loopi#284" - LoopVectorization.valmuladd(var"##W#282", 2, 1)
var"####tempload#285_0" = LoopVectorization.vload(var"##vptr##_a", (i,))
var"####tempload#285_1" = LoopVectorization.vload(var"##vptr##_a", (i + LoopVectorization.valmul(var"##W#282", 1),), var"##mask##")
var"####tempload#286_0" = LoopVectorization.vload(var"##vptr##_b", (i,))
var"####tempload#286_1" = LoopVectorization.vload(var"##vptr##_b", (i + LoopVectorization.valmul(var"##W#282", 1),), var"##mask##")
var"####s#276_0" = LoopVectorization.vmuladd(var"####tempload#285_0", var"####tempload#286_0", var"####s#276_0")
var"####s#276_1" = LoopVectorization.vifelse(var"##mask##", LoopVectorization.vmuladd(var"####tempload#285_1", var"####tempload#286_1", var"####s#276_1"), var"####s#276_1")
i += LoopVectorization.valmul(var"##W#282", 2)
elseif i > var"##loopi#284" - LoopVectorization.valmuladd(var"##W#282", 3, 1)
var"####tempload#285_0" = LoopVectorization.vload(var"##vptr##_a", (i,))
var"####tempload#285_1" = LoopVectorization.vload(var"##vptr##_a", (i + LoopVectorization.valmul(var"##W#282", 1),))
var"####tempload#285_2" = LoopVectorization.vload(var"##vptr##_a", (i + LoopVectorization.valmul(var"##W#282", 2),), var"##mask##")
var"####tempload#286_0" = LoopVectorization.vload(var"##vptr##_b", (i,))
var"####tempload#286_1" = LoopVectorization.vload(var"##vptr##_b", (i + LoopVectorization.valmul(var"##W#282", 1),))
var"####tempload#286_2" = LoopVectorization.vload(var"##vptr##_b", (i + LoopVectorization.valmul(var"##W#282", 2),), var"##mask##")
var"####s#276_0" = LoopVectorization.vmuladd(var"####tempload#285_0", var"####tempload#286_0", var"####s#276_0")
var"####s#276_1" = LoopVectorization.vmuladd(var"####tempload#285_1", var"####tempload#286_1", var"####s#276_1")
var"####s#276_2" = LoopVectorization.vifelse(var"##mask##", LoopVectorization.vmuladd(var"####tempload#285_2", var"####tempload#286_2", var"####s#276_2"), var"####s#276_2")
i += LoopVectorization.valmul(var"##W#282", 3)
elseif i > var"##loopi#284" - LoopVectorization.valmuladd(var"##W#282", 4, 1)
var"####tempload#285_0" = LoopVectorization.vload(var"##vptr##_a", (i,))
var"####tempload#285_1" = LoopVectorization.vload(var"##vptr##_a", (i + LoopVectorization.valmul(var"##W#282", 1),))
var"####tempload#285_2" = LoopVectorization.vload(var"##vptr##_a", (i + LoopVectorization.valmul(var"##W#282", 2),))
var"####tempload#285_3" = LoopVectorization.vload(var"##vptr##_a", (i + LoopVectorization.valmul(var"##W#282", 3),), var"##mask##")
var"####tempload#286_0" = LoopVectorization.vload(var"##vptr##_b", (i,))
var"####tempload#286_1" = LoopVectorization.vload(var"##vptr##_b", (i + LoopVectorization.valmul(var"##W#282", 1),))
var"####tempload#286_2" = LoopVectorization.vload(var"##vptr##_b", (i + LoopVectorization.valmul(var"##W#282", 2),))
var"####tempload#286_3" = LoopVectorization.vload(var"##vptr##_b", (i + LoopVectorization.valmul(var"##W#282", 3),), var"##mask##")
var"####s#276_0" = LoopVectorization.vmuladd(var"####tempload#285_0", var"####tempload#286_0", var"####s#276_0")
var"####s#276_1" = LoopVectorization.vmuladd(var"####tempload#285_1", var"####tempload#286_1", var"####s#276_1")
var"####s#276_2" = LoopVectorization.vmuladd(var"####tempload#285_2", var"####tempload#286_2", var"####s#276_2")
var"####s#276_3" = LoopVectorization.vifelse(var"##mask##", LoopVectorization.vmuladd(var"####tempload#285_3", var"####tempload#286_3", var"####s#276_3"), var"####s#276_3")
i += LoopVectorization.valmul(var"##W#282", 4)
end
end
nothing
end
var"####s#276_0" = LoopVectorization.evadd(var"####s#276_0", var"####s#276_2")
var"####s#276_1" = LoopVectorization.evadd(var"####s#276_1", var"####s#276_3")
var"####s#276_0" = LoopVectorization.evadd(var"####s#276_0", var"####s#276_1")
var"##s#276" = LoopVectorization.reduced_add(var"##s#276", var"####s#276_0")
end
end
#= /home/c285497/.julia/dev/LoopVectorization/src/_avx.jl:76 =#
(var"##a#274", var"##b#275", var"##s#276")
end It looks like the tuple keys got gensymed, while the references to I can confirm this by just returning the named tuple directly: julia> mydotavx(a,b)
(##a#287 = [0.7702508858576866, 0.27125741138509984, 0.2512650413535191, 0.5204401098341251, 0.27205976231476003, 0.23474339992992976, 0.5570796460400627, 0.7205218785523702, 0.5984216579244239, 0.36296560923666066 … 0.07604538990377652, 0.6266985310926816, 0.34310891991732584, 0.8851344040903615, 0.7995187421164516, 0.0005825142632456259, 0.9282648483407934, 0.12082568282793815, 0.8090613866910383, 0.10680305880634222], ##b#288 = [0.38245589708592886, 0.8925247442184854, 0.24020525496226242, 0.30175188620269733, 0.7388350218559065, 0.7183704303506355, 0.17844015074792519, 0.997310375316625, 0.6475644703083443, 0.10300231950864136 … 0.17662478553283512, 0.5539501003250507, 0.15253593338506888, 0.7179117757050635, 0.27106266201048146, 0.5543679923558256, 0.2872232681542324, 0.9154144869489411, 0.9473090109761051, 0.5269456294707151], ##s#289 = 0.0) |
Can I push to this remote PR? Try these changes: $ git diff
diff --git a/src/_avx.jl b/src/_avx.jl
index 30b7017..314798e 100644
--- a/src/_avx.jl
+++ b/src/_avx.jl
@@ -34,7 +34,7 @@ function find_vars_and_gensym!(x::Symbol, vars::Dict{Symbol, Symbol}, ivars::Vec
push!(vars, x => gx)
gx
else
- x
+ get(vars, x, x)
end
end
@@ -56,8 +56,9 @@ macro _avx(ex)
tgvars = Tuple(values(D))
quote
- kwargs = nt($(QuoteNode(tgvars)), $(Expr(:tuple, tvars...)))
- $(Expr(:tuple, tvars...)) = _avx($(QuoteNode(type_ex)), kwargs)
+ kwargs = LoopVectorization.nt($(QuoteNode(tgvars)), $(Expr(:tuple, tvars...)))
+ $(Expr(:tuple, tvars...)) = LoopVectorization._avx($(QuoteNode(type_ex)), kwargs)
+ # LoopVectorization._avx($(QuoteNode(type_ex)), kwargs) # comment out the above line, uncomment this one, and get rid of the `@generated` on _avx to see the function body.
end |> esc
end I don't see any type instabilities. However, why The next step is solving the heap allocations caused by creating tuples of heap-allocated objects. I have two proposals that should both work: 1Create a function barrier. That is, have three layers:
This way the construction and deconstruction of tuples with heap objects takes place entirely inside the calling function, and the compiler will compile them away entirely (unless they escape the called function, in which case I guess that's what the user wanted). The non-inlined function's tuples are entirely of stack-allocated objects. 2Canonical ordering of arguments (ie, |
Option 3 would be to just add julia> function mydotavx(a, b)
s = 0.0
@avx for i ∈ eachindex(a,b)
s += a[i]*b[i]
end
s
end
mydotavx (generic function with 1 method)
julia> function mydot_avx(a, b)
s = 0.0
@_avx for i ∈ eachindex(a,b)
s += a[i]*b[i]
end
s
end
mydot_avx (generic function with 1 method)
julia> @benchmark mydotavx($a, $b)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 17.505 ns (0.00% GC)
median time: 17.518 ns (0.00% GC)
mean time: 17.840 ns (0.00% GC)
maximum time: 1.596 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 998
julia> @benchmark mydot_avx($a, $b)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 16.901 ns (0.00% GC)
median time: 17.511 ns (0.00% GC)
mean time: 17.655 ns (0.00% GC)
maximum time: 229.722 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 998 Complete diff (aside from merging master), if you want to apply it: $ git diff
diff --git a/src/_avx.jl b/src/_avx.jl
index 30b7017..9476911 100644
--- a/src/_avx.jl
+++ b/src/_avx.jl
@@ -34,7 +34,7 @@ function find_vars_and_gensym!(x::Symbol, vars::Dict{Symbol, Symbol}, ivars::Vec
push!(vars, x => gx)
gx
else
- x
+ get(vars, x, x)
end
end
@@ -56,8 +56,9 @@ macro _avx(ex)
tgvars = Tuple(values(D))
quote
- kwargs = nt($(QuoteNode(tgvars)), $(Expr(:tuple, tvars...)))
- $(Expr(:tuple, tvars...)) = _avx($(QuoteNode(type_ex)), kwargs)
+ kwargs = LoopVectorization.nt($(QuoteNode(tgvars)), $(Expr(:tuple, tvars...)))
+ $(Expr(:tuple, tvars...)) = LoopVectorization._avx($(QuoteNode(type_ex)), kwargs)
+ # LoopVectorization._avx($(QuoteNode(type_ex)), kwargs) # comment out the above line, uncomment this one, and get rid of the `@generated` on _avx to see the function body.
end |> esc
end
@@ -70,6 +71,7 @@ end
end
quote
+ $(Expr(:meta,:inline))
$var_defs
$(lower(LoopSet(ex)))
$(Expr(:tuple, keys...)) I guess this would be good for merging (to have in addition to the current Was there anything else that should be added now? |
I went ahead and pushed my changes. Let me know if you're happy with this PR as is. If so, I'll go ahead and merge it. |
Hi Chris, thanks for looking through and making those fixes. Always feel free to push to my PR branches.
The reason I gensym'd the variables was that without the gensyms, when I splat out the variables inside the generated function (or when the generated function returns), I'd get an Here's a MWE showing what I mean: julia> a = 1
1
julia> let
a = a
end
ERROR: UndefVarError: a not defined
Stacktrace:
[1] top-level scope at REPL[6]:2 The fact that this gensym forces recompilation every time the expression changes is a bit unfortunate. If you're okay merging it as is, I am too. Maybe let me write up some tests first though so we can make sure |
Hmm. What are the scoping rules in play there? An awful hack to find the reduction variables would be to create a julia> using LoopVectorization: LoopSet, name, operations
julia> dq = :(for i ∈ eachindex(a,b)
s += a[i]*b[i]
end)
:(for i = eachindex(a, b)
#= REPL[64]:2 =#
s += a[i] * b[i]
end)
julia> ls = LoopSet(dq);
julia> [name(operations(ls)[i]) for i ∈ ls.outer_reductions]
1-element Array{Symbol,1}:
:s We don't want to constructing julia> using BenchmarkTools
julia> @benchmark LoopSet(dq)
BenchmarkTools.Trial:
memory estimate: 13.92 KiB
allocs estimate: 217
--------------
minimum time: 52.527 μs (0.00% GC)
median time: 59.864 μs (0.00% GC)
mean time: 64.836 μs (4.75% GC)
maximum time: 11.007 ms (98.07% GC)
--------------
samples: 10000
evals/sample: 1 However, it is substantially faster than: julia> using MacroTools: prewalk
julia> using LoopVectorization: find_vars_and_gensym!, to_type
julia> function to_type_and_dict(ex)
D = Dict{Symbol,Symbol}()
ivars = Symbol[]
gex = prewalk(x -> find_vars_and_gensym!(x, D, ivars), ex)
to_type(gex), D
end
to_type_and_dict (generic function with 1 method)
julia> @benchmark to_type_and_dict(dq)
BenchmarkTools.Trial:
memory estimate: 145.35 KiB
allocs estimate: 2510
--------------
minimum time: 7.430 ms (0.00% GC)
median time: 9.247 ms (0.00% GC)
mean time: 9.411 ms (0.90% GC)
maximum time: 56.488 ms (79.40% GC)
--------------
samples: 531
evals/sample: 1 We'll also eventually have to work on a way to use the type information we get. |
The macro is fine, it's the generated function that causes the scoping problem, since the generated function is a new scope. Yeah, the way I'm doing things here is definitely not the direction we actually want to go. The real approach would be to have the macro parse the loop expression and then lower it to a series of E.g. something like
and then have |
Input arguments are defined because they're inputs. function foo(a)
a = a
end
foo(3) I removed the gensyms. It probably won't cut down on recompilation a lot, but figured it is worthwhile. The goal would be to create a single LoopSet object with as much information about the loops as possible (and give it the ability to use the information we provide it). It'll need to extract the type info from the SIMDPirates will need a PR implementing For now, the first thing I'll work on is using some type information of the incoming arrays. Eg, if the array is an Adjoint or Transpose, that will help it make better loop order decisions. |
See #14 for context and lead-up discussion.