Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor @avx with generated functions #17

Merged
merged 6 commits into from
Jan 16, 2020

Conversation

MasonProtter
Copy link
Contributor

@MasonProtter MasonProtter commented Jan 11, 2020

See #14 for context and lead-up discussion.

@codecov-io
Copy link

codecov-io commented Jan 11, 2020

Codecov Report

Merging #17 into master will increase coverage by 1.98%.
The diff coverage is 83.4%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #17      +/-   ##
==========================================
+ Coverage   84.43%   86.41%   +1.98%     
==========================================
  Files          11       11              
  Lines        1330     1583     +253     
==========================================
+ Hits         1123     1368     +245     
- Misses        207      215       +8
Impacted Files Coverage Δ
src/costs.jl 77.77% <ø> (-2.78%) ⬇️
src/LoopVectorization.jl 100% <ø> (ø) ⬆️
src/map.jl 100% <100%> (ø) ⬆️
src/_avx.jl 93.93% <100%> (+93.93%) ⬆️
src/determinestrategy.jl 95.86% <100%> (-1.12%) ⬇️
src/operations.jl 60.78% <57.14%> (-3.74%) ⬇️
src/graphs.jl 86.97% <84.18%> (-0.99%) ⬇️
src/lowering.jl 91.3% <88.53%> (+3.05%) ⬆️
src/broadcast.jl 88.59% <95.83%> (+0.3%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 27867b1...8b9cd1b. Read the comment docs.

@chriselrod
Copy link
Member

chriselrod commented Jan 16, 2020

Sorry for the delay here. I've been busy trying to make the library better able to handle different loops.

I'm starting to look at this now.
I haven't been able to get your code to run, which may be a clue regarding the type instability.
I took off the @generated and made a couple other changes so that I can see the generated function body:

julia> mydotavx(a,b)
quote
    #= /home/c285497/.julia/dev/LoopVectorization/src/_avx.jl:74 =#
    begin
        var"##a#274" = var_nt[Symbol("##a#274")]
        var"##b#275" = var_nt[Symbol("##b#275")]
        var"##s#276" = var_nt[Symbol("##s#276")]
    end
    #= /home/c285497/.julia/dev/LoopVectorization/src/_avx.jl:75 =#
    begin
        begin
            var"##T#283" = promote_type(eltype(a), eltype(b))
            var"##W#282" = LoopVectorization.pick_vector_width_val(var"##T#283")
            var"##loopi#284" = length(var"##a#274")
            var"##vptr##_a" = LoopVectorization.stridedpointer(a)
            var"##vptr##_b" = LoopVectorization.stridedpointer(b)
            var"##mask##" = LoopVectorization.masktable(var"##W#282", LoopVectorization.valrem(var"##W#282", var"##loopi#284"))
        end
        begin
            #= /home/c285497/.julia/dev/LoopVectorization/src/lowering.jl:651 =# GC.@preserve a b begin
                    i = LoopVectorization._MM(var"##W#282", 0)
                    var"####s#276_0" = LoopVectorization.vbroadcast(var"##W#282", zero(var"##T#283"))
                    var"####s#276_1" = LoopVectorization.vbroadcast(var"##W#282", zero(var"##T#283"))
                    var"####s#276_2" = LoopVectorization.vbroadcast(var"##W#282", zero(var"##T#283"))
                    var"####s#276_3" = LoopVectorization.vbroadcast(var"##W#282", zero(var"##T#283"))
                    while i < var"##loopi#284" - LoopVectorization.valmuladd(var"##W#282", 4, -1)
                        var"####tempload#285_0" = LoopVectorization.vload(var"##vptr##_a", (i,))
                        var"####tempload#285_1" = LoopVectorization.vload(var"##vptr##_a", (i + LoopVectorization.valmul(var"##W#282", 1),))
                        var"####tempload#285_2" = LoopVectorization.vload(var"##vptr##_a", (i + LoopVectorization.valmul(var"##W#282", 2),))
                        var"####tempload#285_3" = LoopVectorization.vload(var"##vptr##_a", (i + LoopVectorization.valmul(var"##W#282", 3),))
                        var"####tempload#286_0" = LoopVectorization.vload(var"##vptr##_b", (i,))
                        var"####tempload#286_1" = LoopVectorization.vload(var"##vptr##_b", (i + LoopVectorization.valmul(var"##W#282", 1),))
                        var"####tempload#286_2" = LoopVectorization.vload(var"##vptr##_b", (i + LoopVectorization.valmul(var"##W#282", 2),))
                        var"####tempload#286_3" = LoopVectorization.vload(var"##vptr##_b", (i + LoopVectorization.valmul(var"##W#282", 3),))
                        var"####s#276_0" = LoopVectorization.vmuladd(var"####tempload#285_0", var"####tempload#286_0", var"####s#276_0")
                        var"####s#276_1" = LoopVectorization.vmuladd(var"####tempload#285_1", var"####tempload#286_1", var"####s#276_1")
                        var"####s#276_2" = LoopVectorization.vmuladd(var"####tempload#285_2", var"####tempload#286_2", var"####s#276_2")
                        var"####s#276_3" = LoopVectorization.vmuladd(var"####tempload#285_3", var"####tempload#286_3", var"####s#276_3")
                        i += LoopVectorization.valmul(var"##W#282", 4)
                    end
                    if var"##loopi#284" != i
                        if i > var"##loopi#284" - LoopVectorization.valmuladd(var"##W#282", 1, 1)
                            var"####tempload#285_0" = LoopVectorization.vload(var"##vptr##_a", (i,), var"##mask##")
                            var"####tempload#286_0" = LoopVectorization.vload(var"##vptr##_b", (i,), var"##mask##")
                            var"####s#276_0" = LoopVectorization.vifelse(var"##mask##", LoopVectorization.vmuladd(var"####tempload#285_0", var"####tempload#286_0", var"####s#276_0"), var"####s#276_0")
                            i += LoopVectorization.valmul(var"##W#282", 1)
                        elseif i > var"##loopi#284" - LoopVectorization.valmuladd(var"##W#282", 2, 1)
                            var"####tempload#285_0" = LoopVectorization.vload(var"##vptr##_a", (i,))
                            var"####tempload#285_1" = LoopVectorization.vload(var"##vptr##_a", (i + LoopVectorization.valmul(var"##W#282", 1),), var"##mask##")
                            var"####tempload#286_0" = LoopVectorization.vload(var"##vptr##_b", (i,))
                            var"####tempload#286_1" = LoopVectorization.vload(var"##vptr##_b", (i + LoopVectorization.valmul(var"##W#282", 1),), var"##mask##")
                            var"####s#276_0" = LoopVectorization.vmuladd(var"####tempload#285_0", var"####tempload#286_0", var"####s#276_0")
                            var"####s#276_1" = LoopVectorization.vifelse(var"##mask##", LoopVectorization.vmuladd(var"####tempload#285_1", var"####tempload#286_1", var"####s#276_1"), var"####s#276_1")
                            i += LoopVectorization.valmul(var"##W#282", 2)
                        elseif i > var"##loopi#284" - LoopVectorization.valmuladd(var"##W#282", 3, 1)
                            var"####tempload#285_0" = LoopVectorization.vload(var"##vptr##_a", (i,))
                            var"####tempload#285_1" = LoopVectorization.vload(var"##vptr##_a", (i + LoopVectorization.valmul(var"##W#282", 1),))
                            var"####tempload#285_2" = LoopVectorization.vload(var"##vptr##_a", (i + LoopVectorization.valmul(var"##W#282", 2),), var"##mask##")
                            var"####tempload#286_0" = LoopVectorization.vload(var"##vptr##_b", (i,))
                            var"####tempload#286_1" = LoopVectorization.vload(var"##vptr##_b", (i + LoopVectorization.valmul(var"##W#282", 1),))
                            var"####tempload#286_2" = LoopVectorization.vload(var"##vptr##_b", (i + LoopVectorization.valmul(var"##W#282", 2),), var"##mask##")
                            var"####s#276_0" = LoopVectorization.vmuladd(var"####tempload#285_0", var"####tempload#286_0", var"####s#276_0")
                            var"####s#276_1" = LoopVectorization.vmuladd(var"####tempload#285_1", var"####tempload#286_1", var"####s#276_1")
                            var"####s#276_2" = LoopVectorization.vifelse(var"##mask##", LoopVectorization.vmuladd(var"####tempload#285_2", var"####tempload#286_2", var"####s#276_2"), var"####s#276_2")
                            i += LoopVectorization.valmul(var"##W#282", 3)
                        elseif i > var"##loopi#284" - LoopVectorization.valmuladd(var"##W#282", 4, 1)
                            var"####tempload#285_0" = LoopVectorization.vload(var"##vptr##_a", (i,))
                            var"####tempload#285_1" = LoopVectorization.vload(var"##vptr##_a", (i + LoopVectorization.valmul(var"##W#282", 1),))
                            var"####tempload#285_2" = LoopVectorization.vload(var"##vptr##_a", (i + LoopVectorization.valmul(var"##W#282", 2),))
                            var"####tempload#285_3" = LoopVectorization.vload(var"##vptr##_a", (i + LoopVectorization.valmul(var"##W#282", 3),), var"##mask##")
                            var"####tempload#286_0" = LoopVectorization.vload(var"##vptr##_b", (i,))
                            var"####tempload#286_1" = LoopVectorization.vload(var"##vptr##_b", (i + LoopVectorization.valmul(var"##W#282", 1),))
                            var"####tempload#286_2" = LoopVectorization.vload(var"##vptr##_b", (i + LoopVectorization.valmul(var"##W#282", 2),))
                            var"####tempload#286_3" = LoopVectorization.vload(var"##vptr##_b", (i + LoopVectorization.valmul(var"##W#282", 3),), var"##mask##")
                            var"####s#276_0" = LoopVectorization.vmuladd(var"####tempload#285_0", var"####tempload#286_0", var"####s#276_0")
                            var"####s#276_1" = LoopVectorization.vmuladd(var"####tempload#285_1", var"####tempload#286_1", var"####s#276_1")
                            var"####s#276_2" = LoopVectorization.vmuladd(var"####tempload#285_2", var"####tempload#286_2", var"####s#276_2")
                            var"####s#276_3" = LoopVectorization.vifelse(var"##mask##", LoopVectorization.vmuladd(var"####tempload#285_3", var"####tempload#286_3", var"####s#276_3"), var"####s#276_3")
                            i += LoopVectorization.valmul(var"##W#282", 4)
                        end
                    end
                    nothing
                end
            var"####s#276_0" = LoopVectorization.evadd(var"####s#276_0", var"####s#276_2")
            var"####s#276_1" = LoopVectorization.evadd(var"####s#276_1", var"####s#276_3")
            var"####s#276_0" = LoopVectorization.evadd(var"####s#276_0", var"####s#276_1")
            var"##s#276" = LoopVectorization.reduced_add(var"##s#276", var"####s#276_0")
        end
    end
    #= /home/c285497/.julia/dev/LoopVectorization/src/_avx.jl:76 =#
    (var"##a#274", var"##b#275", var"##s#276")
end

It looks like the tuple keys got gensymed, while the references to a and b within the function did not (resulting in an UndefVarError when I tried it).

I can confirm this by just returning the named tuple directly:

julia> mydotavx(a,b)
(##a#287 = [0.7702508858576866, 0.27125741138509984, 0.2512650413535191, 0.5204401098341251, 0.27205976231476003, 0.23474339992992976, 0.5570796460400627, 0.7205218785523702, 0.5984216579244239, 0.36296560923666066  …  0.07604538990377652, 0.6266985310926816, 0.34310891991732584, 0.8851344040903615, 0.7995187421164516, 0.0005825142632456259, 0.9282648483407934, 0.12082568282793815, 0.8090613866910383, 0.10680305880634222], ##b#288 = [0.38245589708592886, 0.8925247442184854, 0.24020525496226242, 0.30175188620269733, 0.7388350218559065, 0.7183704303506355, 0.17844015074792519, 0.997310375316625, 0.6475644703083443, 0.10300231950864136  …  0.17662478553283512, 0.5539501003250507, 0.15253593338506888, 0.7179117757050635, 0.27106266201048146, 0.5543679923558256, 0.2872232681542324, 0.9154144869489411, 0.9473090109761051, 0.5269456294707151], ##s#289 = 0.0)

@chriselrod
Copy link
Member

Can I push to this remote PR?

Try these changes:

$ git diff
diff --git a/src/_avx.jl b/src/_avx.jl
index 30b7017..314798e 100644
--- a/src/_avx.jl
+++ b/src/_avx.jl
@@ -34,7 +34,7 @@ function find_vars_and_gensym!(x::Symbol, vars::Dict{Symbol, Symbol}, ivars::Vec
         push!(vars, x => gx)
         gx
     else
-        x
+        get(vars, x, x)
     end
 end

@@ -56,8 +56,9 @@ macro _avx(ex)
     tgvars = Tuple(values(D))

     quote
-        kwargs = nt($(QuoteNode(tgvars)), $(Expr(:tuple, tvars...)))
-        $(Expr(:tuple, tvars...)) = _avx($(QuoteNode(type_ex)), kwargs)
+        kwargs = LoopVectorization.nt($(QuoteNode(tgvars)), $(Expr(:tuple, tvars...)))
+        $(Expr(:tuple, tvars...)) = LoopVectorization._avx($(QuoteNode(type_ex)), kwargs)
+        # LoopVectorization._avx($(QuoteNode(type_ex)), kwargs) # comment out the above line, uncomment this one, and get rid of the `@generated` on _avx to see the function body.
     end |> esc
 end

I don't see any type instabilities.

However, why gensym the variables? Is this just to force recompilation of the generated function?
That seems reasonable while developing. We may want to have the option to not force recompilation.

The next step is solving the heap allocations caused by creating tuples of heap-allocated objects.
Currently, it constructs two such tuples: one on the function call, and another on the return.

I have two proposals that should both work:

1

Create a function barrier. That is, have three layers:

  1. The call site, where the macro replaces a loop with the first function call, passing it a named tuple.
  2. A wrapper function that has $(Expr(:meta,:inline)) inserted into the top of the generated function body to force inlining. This function uses GC.@preserve, and the body contains the LoopSet preamble, which creates the stridedpointers. None of those objects are heap allocated. It wraps them in another named tuple, and punts to layer 3.
  3. The vectorized loops (where we'll let the compiler decide whether or not it wants to inline).

This way the construction and deconstruction of tuples with heap objects takes place entirely inside the calling function, and the compiler will compile them away entirely (unless they escape the called function, in which case I guess that's what the user wanted).

The non-inlined function's tuples are entirely of stack-allocated objects.

2

Canonical ordering of arguments (ie, sort, unless there's a better idea as to an order) + some analysis to find out which variables are outer reduction variables to be returned, and which will not be returned.
Using a regular function call without constructing tuples saves us from allocating there, while our returned tuple (which may be empty) will only contain scalars reduced by the loop.

@chriselrod
Copy link
Member

chriselrod commented Jan 16, 2020

Option 3 would be to just add
$(Expr(:meta,:inline)) to the generated function.
Doing that, I get

julia> function mydotavx(a, b)
           s = 0.0
           @avx for i  eachindex(a,b)
               s += a[i]*b[i]
           end
           s
       end
mydotavx (generic function with 1 method)

julia> function mydot_avx(a, b)
           s = 0.0
           @_avx for i  eachindex(a,b)
               s += a[i]*b[i]
           end
           s
       end
mydot_avx (generic function with 1 method)

julia> @benchmark mydotavx($a, $b)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     17.505 ns (0.00% GC)
  median time:      17.518 ns (0.00% GC)
  mean time:        17.840 ns (0.00% GC)
  maximum time:     1.596 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @benchmark mydot_avx($a, $b)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.901 ns (0.00% GC)
  median time:      17.511 ns (0.00% GC)
  mean time:        17.655 ns (0.00% GC)
  maximum time:     229.722 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

Complete diff (aside from merging master), if you want to apply it:

$ git diff
diff --git a/src/_avx.jl b/src/_avx.jl
index 30b7017..9476911 100644
--- a/src/_avx.jl
+++ b/src/_avx.jl
@@ -34,7 +34,7 @@ function find_vars_and_gensym!(x::Symbol, vars::Dict{Symbol, Symbol}, ivars::Vec
         push!(vars, x => gx)
         gx
     else
-        x
+        get(vars, x, x)
     end
 end

@@ -56,8 +56,9 @@ macro _avx(ex)
     tgvars = Tuple(values(D))

     quote
-        kwargs = nt($(QuoteNode(tgvars)), $(Expr(:tuple, tvars...)))
-        $(Expr(:tuple, tvars...)) = _avx($(QuoteNode(type_ex)), kwargs)
+        kwargs = LoopVectorization.nt($(QuoteNode(tgvars)), $(Expr(:tuple, tvars...)))
+        $(Expr(:tuple, tvars...)) = LoopVectorization._avx($(QuoteNode(type_ex)), kwargs)
+        # LoopVectorization._avx($(QuoteNode(type_ex)), kwargs) # comment out the above line, uncomment this one, and get rid of the `@generated` on _avx to see the function body.
     end |> esc
 end

@@ -70,6 +71,7 @@ end
     end

     quote
+        $(Expr(:meta,:inline))
         $var_defs
         $(lower(LoopSet(ex)))
         $(Expr(:tuple, keys...))

I guess this would be good for merging (to have in addition to the current @avx) once you make these changes.

Was there anything else that should be added now?
"1" or "2" could be a future PR. As could work to try and actually take advantage of the type information we do have.

@chriselrod
Copy link
Member

I went ahead and pushed my changes.

Let me know if you're happy with this PR as is. If so, I'll go ahead and merge it.
You can always create new PRs with more changes later.

@MasonProtter
Copy link
Contributor Author

MasonProtter commented Jan 16, 2020

Hi Chris, thanks for looking through and making those fixes. Always feel free to push to my PR branches.

However, why gensym the variables? Is this just to force recompilation of the generated function?
That seems reasonable while developing. We may want to have the option to not force recompilation.

The reason I gensym'd the variables was that without the gensyms, when I splat out the variables inside the generated function (or when the generated function returns), I'd get an UndefVarError because julia thinks I'm trying to shadow a varaible.

Here's a MWE showing what I mean:

julia> a = 1
1

julia> let
           a = a
       end
ERROR: UndefVarError: a not defined
Stacktrace:
 [1] top-level scope at REPL[6]:2

The fact that this gensym forces recompilation every time the expression changes is a bit unfortunate.

If you're okay merging it as is, I am too.

Maybe let me write up some tests first though so we can make sure @_avx is producing the same answers and number of allocations as @avx in a variaty of circumstances?

@chriselrod
Copy link
Member

Hmm. What are the scoping rules in play there?
It's not acting entirely like a let block, otherwise redefining the variables wouldn't be visible outside the macro, and mydotavx would return 0.0.

An awful hack to find the reduction variables would be to create a LoopSet within the _avx macro:

julia> using LoopVectorization: LoopSet, name, operations

julia> dq = :(for i  eachindex(a,b)
                      s += a[i]*b[i]
                  end)
:(for i = eachindex(a, b)
      #= REPL[64]:2 =#
      s += a[i] * b[i]
  end)

julia> ls = LoopSet(dq);

julia> [name(operations(ls)[i]) for i  ls.outer_reductions]
1-element Array{Symbol,1}:
 :s

We don't want to constructing LoopSets excessively:

julia> using BenchmarkTools

julia> @benchmark LoopSet(dq)
BenchmarkTools.Trial:
  memory estimate:  13.92 KiB
  allocs estimate:  217
  --------------
  minimum time:     52.527 μs (0.00% GC)
  median time:      59.864 μs (0.00% GC)
  mean time:        64.836 μs (4.75% GC)
  maximum time:     11.007 ms (98.07% GC)
  --------------
  samples:          10000
  evals/sample:     1

However, it is substantially faster than:

julia> using MacroTools: prewalk

julia> using LoopVectorization: find_vars_and_gensym!, to_type

julia> function to_type_and_dict(ex)
           D = Dict{Symbol,Symbol}()
           ivars = Symbol[]

           gex = prewalk(x -> find_vars_and_gensym!(x, D, ivars), ex)
           to_type(gex), D
       end
to_type_and_dict (generic function with 1 method)

julia> @benchmark to_type_and_dict(dq)
BenchmarkTools.Trial:
  memory estimate:  145.35 KiB
  allocs estimate:  2510
  --------------
  minimum time:     7.430 ms (0.00% GC)
  median time:      9.247 ms (0.00% GC)
  mean time:        9.411 ms (0.90% GC)
  maximum time:     56.488 ms (79.40% GC)
  --------------
  samples:          531
  evals/sample:     1

We'll also eventually have to work on a way to use the type information we get.
This probably means (inside the _avx generated function) replacing the the type -> expression -> LoopSet pattern with type & var input types -> LoopSet.
That can come later.

@MasonProtter
Copy link
Contributor Author

Hmm. What are the scoping rules in play there?
It's not acting entirely like a let block, otherwise redefining the variables wouldn't be visible outside the macro, and mydotavx would return 0.0.

The macro is fine, it's the generated function that causes the scoping problem, since the generated function is a new scope.


Yeah, the way I'm doing things here is definitely not the direction we actually want to go. The real approach would be to have the macro parse the loop expression and then lower it to a series of @generated function calls that specialize on the variable types to produce the actual kernels.

E.g. something like

julia> @macroexpand @avx  for i ∈ eachindex(a,b)
    s += a[i]*b[i]
end
quote
    s = reduce((l::LoopStep) -> l.s + l.a[i] * l.b[i], LoopOver((a, b), reduction_variables=(s,)))
end

and then have @generated reduce and @generated LoopOver methods that are capable of using the type information and whatnot to generate specialized vectorized kernels

@chriselrod
Copy link
Member

The macro is fine, it's the generated function that causes the scoping problem, since the generated function is a new scope.

Input arguments are defined because they're inputs.

function foo(a)
    a = a
end
foo(3)

I removed the gensyms. It probably won't cut down on recompilation a lot, but figured it is worthwhile.

The goal would be to create a single LoopSet object with as much information about the loops as possible (and give it the ability to use the information we provide it).
That is probably easiest with a single generated function.

It'll need to extract the type info from the var_types, which it isn't doing now.
Making use of it will be a bit harder.
It'll need to have a concept of vectors taking up multiple registers.

SIMDPirates will need a PR implementing StructVectors, following the same approach as StructArrays.
stridedpointer will need overloads that return objects so that loading from them will load a StructVector that is a collection of these.
We'll probably need other overloads, too, for arithmetic to work. This still won't be as fast as actually using StructArrays, but it will speed up cases where users don't do that.

For now, the first thing I'll work on is using some type information of the incoming arrays. Eg, if the array is an Adjoint or Transpose, that will help it make better loop order decisions.

@chriselrod chriselrod marked this pull request as ready for review January 16, 2020 20:56
@chriselrod chriselrod merged commit 88103ef into JuliaSIMD:master Jan 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants