Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor @avx to rewrite loops into a @generated function. #14

Closed
MasonProtter opened this issue Jan 8, 2020 · 6 comments
Closed

Refactor @avx to rewrite loops into a @generated function. #14

MasonProtter opened this issue Jan 8, 2020 · 6 comments

Comments

@MasonProtter
Copy link
Contributor

@chriselrod mentioned on Discourse that the fact that @avx is a macro imposes some codegen difficulties since macros don't see type information, and mentioned generated functions as an alternative. I agree that the best solution would be to gut all the complicated parts out of @avx and put them into @generated function avx and then make @avx just rewrite the input loops into appropriate data structures to be passed to the avx function.

This would cleanly separate two the distinct functionalities which are currently conflated: syntax transformations and complicated, heavily datatype dependant code generation. Furthermore, it'd make it much easier to expose an interface for other packages to write methods and make their types compatible with @avx.

This won't be news to Chris, but I figured it'd be good to have an issue here for the benefit of other readers.

I'd be happy to help with the generated functions and macro rewrite if you're looking for help, though it will take me a while to go through and digest the code. If you could point out places where the syntax transformations occur, that would accelerate the process.

@chriselrod
Copy link
Member

We will need to ensure that, whatever approach we take, the compiler can compile it all away so that it does not introduce any runtime overhead.

If you could point out places where the syntax transformations occur, that would accelerate the process.

There are 3 basic steps:

  1. Create a LoopSet object.
  2. Determine the strategy for converting the LoopSet back to an expression.
  3. Converting it back into an expression.

Once you have the LoopSet, calling lower performs steps 2 and 3.

All we need for a new "frontend" is 1.

When taking the generated function approach (like in broadcasting), we can subdivide it into two parts:
a) A macro that transforms the expression into a call to a generated function.
b) The generated function, which needs to be able to construct the LoopSet using only type information. The transform and call in a must provide all of this.

You can see an example of b in the broadcast code. The type signature of a broadcasted object (before creating axes, which the macro skips) is Broadcasted{S,Nothing,F,A}. S is the style, F function, and A is a tuple of all the arguments to the function.
vmaterialize! recursively calls add_broadcast! on the type parameters of A, until A is something other than a broadcast, like a scalar or an array.
Each called function generates code to extract objects from the broadcasted struct, and adds operations as necessary, such as loads from arrays or computation.

The functions they call to add operations to the LoopSet are in the graphs file.

Regarding a, a straightforward approach may be to walk the loop expression, transforming it into nested tuples, and wrap each symbol with a Val.

julia> dotq = :(for i  eachindex(a,b)
               s += a[i]*b[i]
           end)
:(for i = eachindex(a, b)
      #= REPL[46]:2 =#
      s += a[i] * b[i]
  end)

julia> dump(dotq)
Expr
  head: Symbol for
  args: Array{Any}((2,))
    1: Expr
      head: Symbol =
      args: Array{Any}((2,))
        1: Symbol i
        2: Expr
          head: Symbol call
          args: Array{Any}((3,))
            1: Symbol eachindex
            2: Symbol a
            3: Symbol b
    2: Expr
      head: Symbol block
      args: Array{Any}((2,))
        1: LineNumberNode
          line: Int64 2
          file: Symbol REPL[46]
        2: Expr
          head: Symbol +=
          args: Array{Any}((2,))
            1: Symbol s
            2: Expr
              head: Symbol call
              args: Array{Any}((3,))
                1: Symbol *
                2: Expr
                3: Expr

Perhaps something like:

struct For{T<:Tuple} end
struct Block{T<:Tuple} end
struct Equal{T<:Tuple} end
struct Call{T<:Tuple} end
struct ArrayRef{T<:Tuple} end
# implement macro that transforms `dotq` into:
looptype = For{Tuple{Equal{Tuple{:i,Call{Tuple{:eachindex,:a,:b}}()}}(),Block{Tuple{Equal{Tuple{:s,Call{Tuple{:muladd,ArrayRef{Tuple{:a,:i}}(),ArrayRef{Tuple{:b,:i}}(),:s}}()}}()}}()}}()
# implement a generated function that can transform the above into
lsdot = LoopVectorization.LoopSet(dotq);

The generated function would be called on looptype, and also receive all variables referenced (in this case, just a, b, and s -- although we may want to handle reduction variables like s specially).

We would define a canonical ordering of these variables (perhaps sort([:s,:a,:b])) so that the generated function can map each to the symbol in the looptype.

Would you like to give it a go, or wait until I get around to it?

A couple parting notes:

  1. I've been using contract_pass to fuse multiply adds before converting expressions into LoopSets, and to eliminate += and friends. There are two chief reasons for this: (1) because I haven't implemented fusion in the cost models of LoopVectorization yet, it currents considers a * b + c to take twice as long as muladd(a,b,c) and (2) while the compiler can automatically fuse the operations, it also adds a bunch of vmovapd instructions to gemm kernels when doing so that it does not do with muladd calls that forbid compiler reordering. It's been a few months since I last checked for that LLVM bug.
  2. No new code is allowed to use MacroTools, and MacroTools should be removed from any old code getting touched if possible. I'm getting serious about compile time performance. Some old code of mine took 40+ seconds to compile. MacroTools is one of the culprits. I'm not willing to sacrifice runtime performance, but I am willing to sacrifice a little bit of convenience. I'm open to any advice on reducing compile times. I think using lots of small type stable functions is key, but I'm no expert on that.

@MasonProtter
Copy link
Contributor Author

MasonProtter commented Jan 10, 2020

Okay, so I played around with this today and here's what I've got so far:

struct Ex{T, Tup} end

function to_type!(ex::Expr, vars::Vector{Symbol}, ivars::Vector{Symbol})
    if ex.head == :(=) && ex.args[1] isa Symbol
        push!(ivars, ex.args[1])
    end
    Ex{ex.head, Tuple{to_type!.(ex.args, Ref(vars), Ref(ivars))...}}
end

function to_type!(x, vars::Vector{Symbol}, ivars::Vector{Symbol})
    x
end

to_type!(::LineNumberNode, vars::Vector{Symbol}, ivars::Vector{Symbol}) = nothing

function to_type!(x::Symbol, vars::Vector{Symbol}, ivars::Vector{Symbol}) 
    x  vars && x  ivars && push!(vars, x)
    x
end

function to_expr(ex::Type{Ex{Head, Tup}}) where {Head, Tup}
    Expr(Head, (to_expr(x) for x in Tup.parameters)...)
end
to_expr(x) = x

nt(keys, vals) = NamedTuple{keys, typeof(vals)}(vals)

macro avx_parse(ex)
    vars    = Symbol[]
    ivars   = Symbol[]
    type_ex = to_type!(ex, vars, ivars)
    tvars   = Tuple(vars)
    qtivars = QuoteNode(Tuple(ivars))
    quote
        $(QuoteNode(type_ex)), nt($(QuoteNode(tvars)), (tuple($(vars...)))), $qtivars
    end |> esc
end
julia> a = rand(5); b = randn(5);

julia> ex_t, vars, ivars = @avx_parse begin
           s = 0.0
           for i  eachindex(a,b)
               s += a[i]*b[i]
           end
       end;

julia> ex_t
Ex{:block,Tuple{nothing,Ex{:(=),Tuple{:s,0.0}},nothing,Ex{:for,Tuple{Ex{:(=),Tuple{:i,Ex{:call,Tuple{:eachindex,:a,:b}}}},Ex{:block,Tuple{nothing,Ex{:+=,Tuple{:s,Ex{:call,Tuple{:*,Ex{:ref,Tuple{:a,:i}},Ex{:ref,Tuple{:b,:i}}}}}}}}}}}}

julia> vars
(eachindex = eachindex, a = [0.864891273566464, 0.016615952609847495, 0.24528618468148156, 0.9045614698725608, 0.24968365320302022], b = [0.10359624603037468, 0.7595643433938144, -0.8973765174693641, 0.20134830886960015, 0.49462379671835727], * = *)

julia> ivars
(:s, :i)

julia> to_expr(ex_t)
quote
    nothing
    s = 0.0
    nothing
    for i = eachindex(a, b)
        nothing
        s += a[i] * b[i]
    end
end

Currently, I'm just converting LineNumberNodes to nothing, but I could strip them out altogether or lift them to the type domain depending on your preferences. I didn't try stripping them out yet becuase my goto solution for that is MacroTools.striplines.


As a high level summary, the above code can take (I believe) a completely arbitrary expression, convert it into a type, and then output a named tuple matching variables from the expression, excluding the variables which are assigned within the expression (such as i and s), and putting those internal variables in their own tuple without values.

My plan is then that this macro would then pass it's three outputs to a generated function which would call to_expr on the Ex type to turn it back into an expression, and use the type of the named tuple (e.g. NamedTuple{(:eachindex, :a, :b, :*), Tuple{typeof(eachindex),Array{Float64,1},Array{Float64,1},typeof(*)}}) to make choices depending on the eltype of the input data.

The functions such as eachindex, *, etc. can be filtered out, but it could be nice to have them there just so we can throw an error as soon as we see an unsupported function being used in and @avx block.

What are your thoughts on this approach so far? Any questions or concerns?

@MasonProtter
Copy link
Contributor Author

MasonProtter commented Jan 11, 2020

Okay, so I played around a bit more and implemented a very stupid early version of @avx with a @generated function:

using MacroTools: prewalk
using LoopVectorization: LoopVectorization, LoopSet, lower

#----------------------------------------------------------------------------------------------------

struct Ex{T, Tup} end

function to_type(ex::Expr)
    Ex{ex.head, Tuple{to_type.(ex.args)...}}
end

to_type(x) = x
to_type(::LineNumberNode) = nothing

#----------------------------------------------------------------------------------------------------

to_expr(ex::Type{Ex{Head, Tup}}) where {Head, Tup} = Expr(Head, (to_expr(x) for x in Tup.parameters)...)
to_expr(x) = x

#----------------------------------------------------------------------------------------------------

function find_vars_and_gensym!(ex::Expr, vars::Dict{Symbol, Symbol}, ivars::Vector{Symbol})
    if ex.head == :(=) && ex.args[1] isa Symbol
        push!(ivars, ex.args[1])
    elseif ex.head == :call
        push!(ivars, ex.args[1])
    end
    ex
end

function find_vars_and_gensym!(x::Symbol, vars::Dict{Symbol, Symbol}, ivars::Vector{Symbol})
    if (x  keys(vars)) && (x  ivars)
        gx = gensym(x)
        push!(vars, x => gx)
        gx
    else
        x
    end    
end

find_vars_and_gensym!(x, vars::Dict{Symbol, Symbol}, ivars::Vector{Symbol}) = x

#----------------------------------------------------------------------------------------------------

nt(keys, vals) = NamedTuple{keys, typeof(vals)}(vals)

macro _avx(ex)
    D     = Dict{Symbol, Symbol}()
    ivars = Symbol[]
    
    gex = prewalk(x -> find_vars_and_gensym!(x, D, ivars), ex)

    type_ex = to_type(gex)

    tvars  = Tuple(keys(D))
    tgvars = Tuple(values(D))

    quote
        kwargs = nt($(QuoteNode(tgvars)), $(Expr(:tuple, tvars...)))
        $(Expr(:tuple, tvars...)) = _avx($(QuoteNode(type_ex)), kwargs)
    end |> esc
end

@generated function _avx(::Type{ex_t}, var_nt::NamedTuple{keys, var_types}) where {ex_t <: Ex, keys, var_types}
    ex = to_expr(ex_t)
    var_defs = Expr(:block)
    for k in keys 
        push!(var_defs.args, :($k = var_nt[$(QuoteNode(k))]))
    end
    quote
        $var_defs
        $(lower(LoopSet(ex)))
        $(Expr(:tuple, keys...))
    end
end

It technically does work, though something is causing a big performance bug:

function mydot(a, b)
    s = 0.0
    for i  eachindex(a,b)
        s += a[i]*b[i]
    end
    s
end

function mydotavx(a, b)
    s = 0.0
    @_avx for i  eachindex(a,b)
        s += a[i]*b[i]
    end
    s
end

a = rand(10); b = rand(10);
@btime mydot($a, $b)                # 8.521 ns (0 allocations: 0 bytes)
@btime mydotavx($a, $b)             # 2.321 μs (30 allocations: 1.03 KiB)
@show  mydotavx(a, b)  mydot(a, b) # true

I think the problem is that I'm causing arrays to be reallocated a bunch, since the performance cost is scaling linearly with the size of a and b.

The reason I need the above weird convoluted design is that unlike a macro, variables defined outside a @generated function won't be visible inside the generated function and vice versa, so I end up doing something like turning the expression

@_avx for i  eachindex(a,b)
    s += a[i]*b[i]
end

into

a, b, s = _avx(#= a type encoding the expression =#, (_a=a, _b=b, _s=s))

and the generated function _avx does the normal @avx stuff and then returns (a, b, s). This way, any overwriting of varaibles such as s += a[i]*b[i] gets communicated back to the enclosing scope.

Yikes. I'm really feeling like I'm shoving a square peg in a round hole. After being quite gung-ho about the generated function approach, I'm starting to have concerns.

@chriselrod
Copy link
Member

chriselrod commented Jan 11, 2020

Thanks for this. I'm preoccupied with unfortunate personal affairs until Sunday, so it will be a little while until I can give this a serious look (although I'm still trying to fix the other recent issue as well at the moment).

I assume the @code_warntype of the generated function looks fine?

I would only return the s, and not the arrays.
I'd get these from the LoopSet object directly. ls.outer_reductions gives their ids, so something like [ls.operations[i].var for i in ls.outer_reductions] should give you the set of all the bindings that changed.

30 is a lot of allocations though. More than I'd expect from just constructing a tuple with two arrays in it. What happens if you inline _avx?
And try the canonical ordering approach (eg, sorting the symbols) instead of kwargs?

@MasonProtter
Copy link
Contributor Author

Thanks for this. I'm preoccupied with unfortunate personal affairs until Sunday, so it will be a little while until I can give this a serious look

No problem, take your time.

I assume the @code_warntype of the generated function looks fine?

Actually, it doesn't. Strangely, s of all variables is inferred to Any.

julia> @code_warntype mydotavx(a, b)
Variables
  #self#::Core.Compiler.Const(mydotavx, false)
  a@_2::Array{Float64,1}
  b@_3::Array{Float64,1}
  kwargs::NamedTuple{(Symbol("##a#402"), Symbol("##b#403"), Symbol("##s#404")),Tuple{Array{Float64,1},Array{Float64,1},Float64}}
  @_5::Int64
  s::Any
  a@_7::Array{Float64,1}
  b@_8::Array{Float64,1}

Body::Any
1 ─       (b@_8 = b@_3)
│         (a@_7 = a@_2)
│         (s = Main.zero(Main.Float64))
│   %4  = Core.tuple(a@_7, b@_8, s::Core.Compiler.Const(0.0, false))::Core.Compiler.PartialStruct(Tuple{Array{Float64,1},Array{Float64,1},Float64}, Any[Array{Float64,1}, Array{Float64,1}, Core.Compiler.Const(0.0, false)])
│         (kwargs = Main.nt($(QuoteNode((Symbol("##a#402"), Symbol("##b#403"), Symbol("##s#404")))), %4))
│   %6  = Main._avx($(QuoteNode(Ex{:for,Tuple{Ex{:(=),Tuple{:i,Ex{:call,Tuple{:eachindex,Symbol("##a#402"),Symbol("##b#403")}}}},Ex{:block,Tuple{nothing,Ex{:+=,Tuple{Symbol("##s#404"),Ex{:call,Tuple{:*,Ex{:ref,Tuple{:a,:i}},Ex{:ref,Tuple{:b,:i}}}}}}}}}})), kwargs)::Tuple{Array{Float64,1},Array{Float64,1},Any}%7  = Base.indexed_iterate(%6, 1)::Core.Compiler.PartialStruct(Tuple{Array{Float64,1},Int64}, Any[Array{Float64,1}, Core.Compiler.Const(2, false)])
│         (a@_7 = Core.getfield(%7, 1))
│         (@_5 = Core.getfield(%7, 2))
│   %10 = Base.indexed_iterate(%6, 2, @_5::Core.Compiler.Const(2, false))::Core.Compiler.PartialStruct(Tuple{Array{Float64,1},Int64}, Any[Array{Float64,1}, Core.Compiler.Const(3, false)])
│         (b@_8 = Core.getfield(%10, 1))
│         (@_5 = Core.getfield(%10, 2))
│   %13 = Base.indexed_iterate(%6, 3, @_5::Core.Compiler.Const(3, false))::Core.Compiler.PartialStruct(Tuple{Any,Int64}, Any[Any, Core.Compiler.Const(4, false)])
│         (s = Core.getfield(%13, 1))
└──       return s

If I make it put in type assertions on all the variables before they're returned, the code_warntype for mydotavx is fine but performance doesn't improve. I think that's because there's still a type instability inside _avx. Bizarre.

I would only return the s, and not the arrays.
I'd get these from the LoopSet object directly. ls.outer_reductions gives their ids, so something like [ls.operations[i].var for i in ls.outer_reductions] should give you the set of all the bindings that changed.

Hm, so that will require some analysis of the loop prior to the generated function then so that we know what variables need to be splatted into. I guess determining which bindings change doesn't require type info so I guess it'd be okay to hoist that analysis to the macro.

I think the discussion here is starting to go beyond an issue. I'll open a draft PR.

@chriselrod
Copy link
Member

I rewrote everything, but I don't have the impression that it actually helped compile times at all.
I'd probably have to go through and benchmark/@code_warntype pieces.
The goal is to both minimize the type unstable regions, and move as quickly as possible to concretely typed, dynamically sized, arrays.

@avx is now the generated function, while @_avx rewrites the raw for loops.
This is to reflect the general preference for using @avx over using @_avx.

I have added support for specialization on adjoint/transpose, as well as using size information.
The API for size info is overloading maybestaticsize and/or maybestaticlength.

using StaticArrays, LoopVectorization
LoopVectorization.maybestaticsize(::MArray{<:Tuple{M,Vararg}}, ::Val{1}) where {M} = LoopVectorization.Static{M}()
LoopVectorization.maybestaticsize(::MArray{<:Tuple{M,N,Vararg}}, ::Val{2}) where {M,N} = LoopVectorization.Static{N}()

If you defined loopbounds as 1:size(A,2), it will substitute the size(A,2) with maybestaticsize(A,Val(2)) (which has the fallback definition size(A,2)).

The static size parameters must be placed on loop bounds. If you defined variables that the macro can't see, it naturally won't be able to pick them up.
It otherwise also cannot infer loop bounds from the types of the arrays, because while loop bounds will normally match the size of the array axis, that isn't true in general.

I will close this, so that future issues on this theme can focus more on specific features / was to take advantage of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants