Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: support for CartesianIndices #66

Merged
merged 9 commits into from
Mar 21, 2020
Merged

WIP: support for CartesianIndices #66

merged 9 commits into from
Mar 21, 2020

Conversation

timholy
Copy link
Contributor

@timholy timholy commented Mar 6, 2020

In the test below, avxgeneric! is actually how I want to write this function so that it supports arbitrary dimensions. It fails because of a mismatch between expectations from the number of explicit loops and the number of implicit loops when you expand out the CartesianIndices. I started trying to fix the issues but at a certain point it started feeling like whack-a-mole, and I wondered if you have any suggestions for how to make this better. (I am happy to do the work as a way to get to know this package better, though if you'd prefer to take this over that's fine with me.)

@timholy timholy mentioned this pull request Mar 6, 2020
@codecov-io
Copy link

codecov-io commented Mar 6, 2020

Codecov Report

Merging #66 into master will increase coverage by 0.10%.
The diff coverage is 95.65%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #66      +/-   ##
==========================================
+ Coverage   93.32%   93.42%   +0.10%     
==========================================
  Files          27       27              
  Lines        2501     2586      +85     
==========================================
+ Hits         2334     2416      +82     
- Misses        167      170       +3     
Impacted Files Coverage Δ
src/LoopVectorization.jl 100.00% <ø> (ø)
src/add_loads.jl 93.10% <ø> (-0.23%) ⬇️
src/determinestrategy.jl 94.79% <ø> (ø)
src/graphs.jl 86.59% <ø> (ø)
src/lower_load.jl 100.00% <ø> (ø)
src/lower_memory_common.jl 94.02% <ø> (ø)
src/memory_ops_common.jl 90.76% <0.00%> (ø)
src/operations.jl 86.02% <ø> (ø)
src/reconstruct_loopset.jl 94.39% <96.24%> (+1.98%) ⬆️
src/condense_loopset.jl 93.82% <100.00%> (-0.06%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 89395b7...51b37e2. Read the comment docs.

function Loop(ls::LoopSet, l::Int, k::Int, ::Type{<:CartesianIndices{N}}) where N
start = gensym(:loopstart); stop = gensym(:loopstop)
axisexpr = Expr(:ref, Expr(:., Expr(:ref, :lb, l), QuoteNode(:indices)), k)
pushpreamble!(ls, Expr(:(=), start, Expr(:macrocall, Symbol("@inbounds"), LineNumberNode(@__LINE__, Symbol(@__FILE__)), Expr(:(.), axisexpr, QuoteNode(:start)))))
Copy link
Member

@chriselrod chriselrod Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the package assumes arrays have been adjusted to 0-indexing (because it makes calculating the pointer offsets a little easier).
I.e., note the staticm1 (static minus 1) here: https://github.com/chriselrod/LoopVectorization.jl/blob/master/src/graphs.jl#L359

I think it would be simpler to instead leave things [whatever they originally were] indexed, and only subtract 1 while lowering here.

An alternative to handling it at lowering is to instead make stridedpointers use the offsets of the parent array types.

@chriselrod
Copy link
Member

What needs to happen is that the ArrayReference objects getting constructed need to hold all the loop index symbols corresponding to the loops.
Currently, each array reference lists indices into the original set of loops, indicating which loop index symbol to use.
If loops were originally written as

for i in I, j in J, k in K, l in L
    # do something with A[i,k], B[i,j], and C[i,l]
end

the add_loops! function would have added 4 loops, and then the ArrayReferenceMeta would get called with ind values of 1 and 3 (A), 1 and 2 (B), and 1 and 4 (C).
This is how it draws the connection between loops and indexing the loops.

Now lets say that J isa CartesianIndices{3}. The add_loops! function would have added the following loops (with different mangling):

  1. for i in I
  2. for j3 in J3
  3. for j2 in J2
  4. for j1 in J1
  5. for k in K
  6. for l in L
    (the J3, J2, J1 order is because you defined the add_loops method to iterate N:-1:1.)

Meaning that if index_type == LoopIndex and ind == 2, it should add symbol inds 2-4 instead of just 2.
If ind > 2, we also need to add an offset of 3-1 = 2.

One fix would be to have the add_loops! function create a cumulative_offset vector (prepended by 0), which we can use in the following way inside the ArrayReferenceMeta constructor:

index_vec = Symbol[]
while index_types != zero(UInt64)
    ind = indices % UInt8
    if index_types == LoopIndex
        for inda in 1+cumulative_offset[ind]:cumulative_offset[ind+1]
            pushfirst!(index_vec, ls.loopsymbols[inda])
        end
    else
        symind= if...
        end
        pushfirt!(index_vec, symind)
    end
    index_types >>>= 8
    ...
end

The loop 1+cumulative_offset[ind]:cumulative_offset[ind+1] iterates in increasing order because the cartesian indices are ordered in decreasing order (J3, J2, J1). These have to be placed in the correct order, and they're reversed by pushfirst!, meaning we need another reversal somewhere else. It may be cleaner to reverse this loop rather than the one adding the cartesian loops.

This reverse order is because of the first in/last out approach to writing information to and then reading it from bits

julia> @macroexpand @avx for i  I, j  J, k  K, l  L
           s += A[i,k] * B[i,j] * C[i,l]
       end
quote
    var"##loopi#274" = LoopVectorization.maybestaticrange(I)
    var"##i_loop_lower_bound#275" = LoopVectorization.staticm1(first(var"##loopi#274"))
    var"##i_loop_upper_bound#276" = last(var"##loopi#274")
    var"##loopj#277" = LoopVectorization.maybestaticrange(J)
    var"##j_loop_lower_bound#278" = LoopVectorization.staticm1(first(var"##loopj#277"))
    var"##j_loop_upper_bound#279" = last(var"##loopj#277")
    var"##loopk#280" = LoopVectorization.maybestaticrange(K)
    var"##k_loop_lower_bound#281" = LoopVectorization.staticm1(first(var"##loopk#280"))
    var"##k_loop_upper_bound#282" = last(var"##loopk#280")
    var"##loopl#283" = LoopVectorization.maybestaticrange(L)
    var"##l_loop_lower_bound#284" = LoopVectorization.staticm1(first(var"##loopl#283"))
    var"##l_loop_upper_bound#285" = last(var"##loopl#283")
    var"##vptr##_A" = LoopVectorization.stridedpointer(A)
    var"##vptr##_B" = LoopVectorization.stridedpointer(B)
    var"##vptr##_C" = LoopVectorization.stridedpointer(C)
    local var"##s_0"
    begin
        $(Expr(:gc_preserve, :(var"##s_0" = LoopVectorization._avx_!(Val{(0, 0)}(), Tuple{:LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000013, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000014, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x03), :LoopVectorization, :vmul, LoopVectorization.OperationStruct(0x0000000000000124, 0x0000000000000000, 0x0000000000000000, 0x0000000000000203, LoopVectorization.compute, 0x00, 0x04), :LoopVectorization, Symbol("##254"), LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000001324, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x05), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000001324, 0x0000000000001324, 0x0000000000000000, 0x0000000000010405, LoopVectorization.compute, 0x00, 0x05)}, Tuple{LoopVectorization.ArrayRefStruct(0x0000000000000101, 0x0000000000000103, 0x0000000000007074), LoopVectorization.ArrayRefStruct(0x0000000000000101, 0x0000000000000102, 0xfffffffffffff07d), LoopVectorization.ArrayRefStruct(0x0000000000000101, 0x0000000000000104, 0xffffffffffffffff)}, Tuple{0, Tuple{6}, Tuple{5}, Tuple{}, Tuple{}, Tuple{}, Tuple{}}, (var"##i_loop_lower_bound#275":var"##i_loop_upper_bound#276", var"##j_loop_lower_bound#278":var"##j_loop_upper_bound#279", var"##k_loop_lower_bound#281":var"##k_loop_upper_bound#282", var"##l_loop_lower_bound#284":var"##l_loop_upper_bound#285"), var"##vptr##_A", var"##vptr##_B", var"##vptr##_C", s)), :A, :B, :C))
    end
    s = LoopVectorization.reduced_add(var"##s_0", s)
end

To save you from scrolling far enough to see the relevant part, it is:

Tuple{LoopVectorization.ArrayRefStruct(0x0000000000000101, 0x0000000000000103, 0x0000000000007074), LoopVectorization.ArrayRefStruct(0x0000000000000101, 0x0000000000000102, 0xfffffffffffff07d), LoopVectorization.ArrayRefStruct(0x0000000000000101, 0x0000000000000104, 0xffffffffffffffff)}

The third field looks like junk, but I'm not using it yet (it's intended to carry offsets relative to a base array, to make it aware of the CSE opportunities in a loop with indices D[i,j] and D[i,j+1]; LLVM will already CSE them if it gets unrolled, but making LoopVectorization aware can improve the cost modeling and encourage it to unroll )
The second field of the ArrayRefStructs gives indices. For the three Arrays, we have indices 0103 (loops 1 and 3), 0102 (loops 1 and 2), and 0104 (loops 1 and 4).
The ArrayReferenceMeta constructor reads these in a first in/last out order (it'll read the 03 before it reads the 01.

In our example where the second loop J isa CartesianIndices{3}, cumulative_offset would be [0, 1, 4, 5, 6].
For array C (0x...0104), the above proposed code for the ArrayReferenceMeta constructor would iterate from 1+[0, 1, 4, 5, 6][4]:[0, 1, 4, 5, 6][4+1] = 6:6 (adding l) and then iterate from 1+[0, 1, 4, 5, 6][1]:[0, 1, 4, 5, 6][1+1] = 1:1, pushfirst!ing i in front of the vector of indices.
For B (0102), we'd iterate from 1+[0, 1, 4, 5, 6][2]:[0, 1, 4, 5, 6][2+1] = 2:4, pushfirst!ing j3, j2, and then j1 (so that they're in j1,j2,j3) order, and finally it'll push i (iterating 1:1) to the very front.

The cost modelling assumes that all arrays are column major, so if an array happens to be row major (i.e., an adjoint or a tranpose), it'll tranpose it to make it column major and reverse the order of indices.
So the order of the indices is meaningful, especially the very first one (because the first is assumed to be contiguous and therefore efficient to load contiguous elements from; it adds a dummy index later if there are no contiguous indices, e.g. in some SubArrays).

Does this make it clear how it works?
I'm open to abstractions that make this code simpler, more general, or easier to modify.

Another change that we may want to make is in the register_loop! function from the file graphs.jl, as well as loop_boundaries from condense_loopset.jl (that function should probably also be renamed if we're transitioning it from passing loop boundaries to passing iterators). As you can see in the above macroexpand, it tries to summarize the loop with a range of the boundaries:

var"##j_loop_lower_bound#278":var"##j_loop_upper_bound#279"

This looks very wrong to me, but by the power of multiple dispatch...

julia> first(CartesianIndices(A)):last(CartesianIndices(A)) == CartesianIndices(A)
true

It'll give us the correct answer. But it'll probably be preferable to pass the original CartesianIndices

julia> CartesianIndices(A) |> typeof
CartesianIndices{2,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}}

julia> first(CartesianIndices(A)):last(CartesianIndices(A)) |> typeof
CartesianIndices{2,Tuple{UnitRange{Int64},UnitRange{Int64}}}

We could also fix it by defining a maybestaticfirst and maybestaticlast, where the first can return something that'll produce OneTos again.
As you've noted before, that entire part of the api (and the register_loop!) function could use some rework.

I'm very appreciative of this work and really looking forward to this landing. It'll make a lot of my code in PaddedMatrices (which is in the middle of an overhaul) much simpler and more generic. Functions as simple as sum on views or padded arrays need Cartesian indexing.
If this sounds like too much, I'm also happy to work on it and will have more time tonight and over the weekend. Explaining how things work probably takes at least as long as doing the work, but can be worth it if you're interested in making more contributions down the line, or if I get feedback that helps improve the design of the library.

@timholy
Copy link
Contributor Author

timholy commented Mar 6, 2020

Wow, this looks amazing! I haven't looked it over carefully, but I will this weekend, and I'm very happy to dive in. If it starts becoming not worth your time feel free to just short-circuit me and do it yourself, but in the meantime I'm happy to try to push this over the finish line.

@chriselrod
Copy link
Member

Great, I'm happy to have collaborators!
Any outside contributions are also nice to point out if/when I ever have to justify open sourcing a project. ;)

Maybe I could work on some "developer docs" to describe how it works in more detail to would-be contributors.

@timholy
Copy link
Contributor Author

timholy commented Mar 7, 2020

Any outside contributions are also nice to point out if/when I ever have to justify open sourcing a project. ;)

Agreed. For me this comes up in grant applications, where having other contributors is convincing evidence that people actually care about the package. It also makes it more fun.

Maybe I could work on some "developer docs" to describe how it works in more detail to would-be contributors.

I had thought about the same thing. I was thinking of typing up notes as I learn and sort of blurting them out there; you can edit or suggest edits until they actually become true 😄.

@chriselrod
Copy link
Member

Any outside contributions are also nice to point out if/when I ever have to justify open sourcing a project. ;)

Agreed. For me this comes up in grant applications, where having other contributors is convincing evidence that people actually care about the package. It also makes it more fun.

Certainly.
Seeing that others care about a package can also help validate all the free time I've poured into these projects.

Maybe I could work on some "developer docs" to describe how it works in more detail to would-be contributors.

I had thought about the same thing. I was thinking of typing up notes as I learn and sort of blurting them out there; you can edit or suggest edits until they actually become true smile.

You're more than welcome too. I've also created some initial dev docs; you're more than welcome to add any notes there as well.

I'm more than happy to edit notes / answer questions / expand on something, etc.

But if something comes up and you're no longer looking to finish, no worries, just let me know and I'll wrap it up.
I'm eager to have some of my projects start depending on this PR.

@timholy
Copy link
Contributor Author

timholy commented Mar 8, 2020

Progress. I haven't yet addressed the issue of offset-by-1, basically because there's a prior sticking point: when I run avxgeneric! I get up to

ERROR: MethodError: no method matching vload(::LoopVectorization.LoopValue, ::Tuple{Int64,VectorizationBase._MM{8,Int64}})
Closest candidates are:
  vload(::LoopVectorization.LoopValue, ::Tuple{VectorizationBase._MM{W,I} where I<:Number}) where W at /home/tim/.julia/dev/LoopVectorization/src/vectorizationbase_extensions.jl:4
  vload(::LoopVectorization.LoopValue, ::Tuple{VectorizationBase._MM{W,I} where I<:Number}, ::VectorizationBase.Mask) where W at /home/tim/.julia/dev/LoopVectorization/src/vectorizationbase_extensions.jl:6
  vload(::LoopVectorization.LoopValue, ::Integer) at /home/tim/.julia/dev/LoopVectorization/src/vectorizationbase_extensions.jl:7
  ...
Stacktrace:
 [1] macro expansion at /home/tim/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:295 [inlined]
 [2] _avx_! at /home/tim/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:295 [inlined]
 [3] macro expansion at ./gcutils.jl:91 [inlined]
 [4] avxgeneric!(::OffsetArray{Float32,2,Array{Float32,2}}, ::Array{Float32,2}, ::OffsetArray{Float32,2,Array{Float32,2}}, ::CartesianIndices{2,Tuple{OffsetArrays.IdOffsetRange{Int64,Base.OneTo{Int64}},OffsetArrays.IdOffsetRange{Int64,Base.OneTo{Int64}}}}, ::Float32) at ./none:3
 [5] avxgeneric!(::OffsetArray{Float32,2,Array{Float32,2}}, ::Array{Float32,2}, ::OffsetArray{Float32,2,Array{Float32,2}}) at ./none:2
 [6] top-level scope at none:0

As far as I can tell this comes from here, where mo is something like :((var"##n#446", var"##n#443")). Any thoughts? You'll see that in vectorizationbase_extensions.jl I pasted in a template based on vload pointer method, but in this case we don't seem to have any strides available. Any suggestions? My hunch is perhaps lower_nest!, but I confess to not understanding what's going on sufficiently well yet to know if this is even close.

@chriselrod
Copy link
Member

LoopValues were used whenever a loop iterable was used for arithmetic, or something other than indexing an array. LoopVectorization initially didn't have any support for that, so the simplest way to add support was to pretend you're loading a value.

This also allowed for correcting the offsets, e.g. if someone wrote for I in R and used I within that loop, loading I could allow offsetting from the value it should have held.

Because you're doing arithmetic on the loop iterables here (I + J)

tmp += A[I+J]*kern[J]

It introduces the LoopValues and loads from them:

julia> @macroexpand @_avx for I in R
                  tmp = z
                  for J in Rk
                      tmp += A[I+J]*kern[J]
                  end
                  out[I] = tmp
              end
quote
    $(Expr(:meta, :inline))
    begin
        var"##loopI#286" = LoopVectorization.maybestaticrange(R)
        var"##I_loop_lower_bound#287" = LoopVectorization.staticm1(first(var"##loopI#286"))
        var"##I_loop_upper_bound#288" = last(var"##loopI#286")
        var"##loopJ#289" = LoopVectorization.maybestaticrange(Rk)
        var"##J_loop_lower_bound#290" = LoopVectorization.staticm1(first(var"##loopJ#289"))
        var"##J_loop_upper_bound#291" = last(var"##loopJ#289")
        var"##vptr##_A" = LoopVectorization.stridedpointer(A)
        var"##LOOPSYMBOL##I" = LoopVectorization.LoopValue()
        var"##vptr##_##LOOPSYMBOL##I" = LoopVectorization.stridedpointer(var"##LOOPSYMBOL##I")
        var"##LOOPSYMBOL##J" = LoopVectorization.LoopValue()
        var"##vptr##_##LOOPSYMBOL##J" = LoopVectorization.stridedpointer(var"##LOOPSYMBOL##J")
        var"##vptr##_kern" = LoopVectorization.stridedpointer(kern)
        var"##vptr##_out" = LoopVectorization.stridedpointer(out)
        var"##T#285" = promote_type(eltype(A), eltype(kern), eltype(out))
        var"##W#284" = LoopVectorization.pick_vector_width_val(eltype(A), eltype(kern), eltype(out))
        var"##mask##" = LoopVectorization.masktable(var"##W#284", LoopVectorization.valrem(var"##W#284", var"##J_loop_upper_bound#291" - var"##J_loop_lower_bound#290"))
    end
    begin
        $(Expr(:gc_preserve, :(begin
            var"##outer##I##outer##" = LoopVectorization.unwrap(var"##I_loop_lower_bound#287")
            while var"##outer##I##outer##" < var"##I_loop_upper_bound#288" - 3
                J = LoopVectorization._MM(var"##W#284", var"##J_loop_lower_bound#290")
                I = var"##outer##I##outer##"
                var"######LOOPSYMBOL##I#293_0_" = LoopVectorization.vload(var"##vptr##_##LOOPSYMBOL##I", (I,))
                var"####reduction#297_0_0" = LoopVectorization.vzero(var"##W#284", var"##T#285")
                var"####reduction#297_0_1" = LoopVectorization.vzero(var"##W#284", var"##T#285")
                var"####reduction#297_0_2" = LoopVectorization.vzero(var"##W#284", var"##T#285")
                var"####reduction#297_0_3" = LoopVectorization.vzero(var"##W#284", var"##T#285")
                I += 1
                var"######LOOPSYMBOL##I#293_1_" = LoopVectorization.vload(var"##vptr##_##LOOPSYMBOL##I", (I,))
                var"####reduction#297_1_0" = LoopVectorization.vzero(var"##W#284", var"##T#285")
                var"####reduction#297_1_1" = LoopVectorization.vzero(var"##W#284", var"##T#285")
                var"####reduction#297_1_2" = LoopVectorization.vzero(var"##W#284", var"##T#285")
                var"####reduction#297_1_3" = LoopVectorization.vzero(var"##W#284", var"##T#285")
                I += 1
                var"######LOOPSYMBOL##I#293_2_" = LoopVectorization.vload(var"##vptr##_##LOOPSYMBOL##I", (I,))
                var"####reduction#297_2_0" = LoopVectorization.vzero(var"##W#284", var"##T#285")
                var"####reduction#297_2_1" = LoopVectorization.vzero(var"##W#284", var"##T#285")
                var"####reduction#297_2_2" = LoopVectorization.vzero(var"##W#284", var"##T#285")
                var"####reduction#297_2_3" = LoopVectorization.vzero(var"##W#284", var"##T#285")
                I += 1
                var"######LOOPSYMBOL##I#293_3_" = LoopVectorization.vload(var"##vptr##_##LOOPSYMBOL##I", (I,))
                var"####reduction#297_3_0" = LoopVectorization.vzero(var"##W#284", var"##T#285")
                var"####reduction#297_3_1" = LoopVectorization.vzero(var"##W#284", var"##T#285")
                var"####reduction#297_3_2" = LoopVectorization.vzero(var"##W#284", var"##T#285")
                var"####reduction#297_3_3" = LoopVectorization.vzero(var"##W#284", var"##T#285")
                begin
                    while J < var"##J_loop_upper_bound#291" - LoopVectorization.valmuladd(var"##W#284", 4, -1)
                        var"######LOOPSYMBOL##J#294_0" = LoopVectorization.vload(var"##vptr##_##LOOPSYMBOL##J", (J,))
                        var"######LOOPSYMBOL##J#294_1" = LoopVectorization.vload(var"##vptr##_##LOOPSYMBOL##J", (J + LoopVectorization.valmul(var"##W#284", 1),))
                        var"######LOOPSYMBOL##J#294_2" = LoopVectorization.vload(var"##vptr##_##LOOPSYMBOL##J", (J + LoopVectorization.valmul(var"##W#284", 2),))
                        var"######LOOPSYMBOL##J#294_3" = LoopVectorization.vload(var"##vptr##_##LOOPSYMBOL##J", (J + LoopVectorization.valmul(var"##W#284", 3),))
                        var"####tempload#296_0" = LoopVectorization.vload(var"##vptr##_kern", (J,))
                        var"####tempload#296_1" = LoopVectorization.vload(var"##vptr##_kern", (J + LoopVectorization.valmul(var"##W#284", 1),))
                        var"####tempload#296_2" = LoopVectorization.vload(var"##vptr##_kern", (J + LoopVectorization.valmul(var"##W#284", 2),))
                        var"####tempload#296_3" = LoopVectorization.vload(var"##vptr##_kern", (J + LoopVectorization.valmul(var"##W#284", 3),))
                        I = var"##outer##I##outer##"
                        var"####indexpr#292_0_0" = var"######LOOPSYMBOL##I#293_0_" + var"######LOOPSYMBOL##J#294_0"
                        var"####indexpr#292_0_1" = var"######LOOPSYMBOL##I#293_0_" + var"######LOOPSYMBOL##J#294_1"
                        var"####indexpr#292_0_2" = var"######LOOPSYMBOL##I#293_0_" + var"######LOOPSYMBOL##J#294_2"
                        var"####indexpr#292_0_3" = var"######LOOPSYMBOL##I#293_0_" + var"######LOOPSYMBOL##J#294_3"
                        var"####tempload#295_0_0" = LoopVectorization.vload(var"##vptr##_A", (var"####indexpr#292_0_0" - 1,))
                        var"####tempload#295_0_1" = LoopVectorization.vload(var"##vptr##_A", (var"####indexpr#292_0_1" - 1,))
                        var"####tempload#295_0_2" = LoopVectorization.vload(var"##vptr##_A", (var"####indexpr#292_0_2" - 1,))
                        var"####tempload#295_0_3" = LoopVectorization.vload(var"##vptr##_A", (var"####indexpr#292_0_3" - 1,))
                        var"####reduction#297_0_0" = LoopVectorization.vfmadd(var"####tempload#295_0_0", var"####tempload#296_0", var"####reduction#297_0_0")
                        var"####reduction#297_0_1" = LoopVectorization.vfmadd(var"####tempload#295_0_1", var"####tempload#296_1", var"####reduction#297_0_1")
                        var"####reduction#297_0_2" = LoopVectorization.vfmadd(var"####tempload#295_0_2", var"####tempload#296_2", var"####reduction#297_0_2")
                        var"####reduction#297_0_3" = LoopVectorization.vfmadd(var"####tempload#295_0_3", var"####tempload#296_3", var"####reduction#297_0_3")
                        I += 1

I think it may be best to render them obsolete, by making the AbstractStridedPointer objects always respect the parent array's offsets so that offsetting is no longer needed, and using a different dummy operation -- one that doesn't incorrectly have the cost of getindex -- to allow for using it. The new operation should also use the ArrayReference type to hold the index it corresponds to, so that the existing changes in reconstruct_loopset.jl will expand it to multiple.

The new dummy should lower simply to the loop indexes themselves.
When R is a CartesianIndex so that I now corresponds to multiple loops, it'll have to produce something that represents those multiple loops, obeys arithmetic how we'd expect, and can still be used for indexing.
We can't use a CartesianIndex, because a vectorized loop will be indexed by _MM{W} rather than an integer. The simplest approach is probably defining a new CartesianIndex-like type that can. I can update the stridedpointers to use it when I update them to respect parental indexing.

The idea would be this new object would at least have +, vload, and vstore! methods defined.

Thoughts? If you agree that it'll simplify things, I'll go ahead and make all these updates to VectorizationBase and SIMDPirates, and issue new releases.


lower_nest lowers the nth loop, where n it the outermost loop. loopq_old is "nested" within.
fillorder! fills the ops array used in it, determining the placement of each operation with respect to the loop nest.

@timholy
Copy link
Contributor Author

timholy commented Mar 9, 2020

Yes, that sounds like an excellent plan. IIUC only the first index would be _MM{W}; would it make sense to define the new type as

struct CartesianVIndex{N,W,K} <: Base.AbstractCartesianIndex{N}
    vi::VectorizationBase._MM{W}
    tail::CartesianIndex{K}
end

It's a pity we can't use N-1 for K (JuliaLang/julia#8322), but one can enforce it with an inner constructor. It seems there might be an advantage in that you could piggy-back on the infrastructure for CartesianIndex/CartesianIndices for the tail indices.

@chriselrod
Copy link
Member

chriselrod commented Mar 9, 2020

Unfortunately, this seems more complicated than I originally thought.

Yes, that sounds like an excellent plan. IIUC only the first index would be _MM{W}; would it make sense to define the new type as

struct CartesianVIndex{N,W,K} <: Base.AbstractCartesianIndex{N}
    vi::VectorizationBase._MM{W}
    tail::CartesianIndex{K}
end

It's a pity we can't use N-1 for K (JuliaLang/julia#8322), but one can enforce it with an inner constructor. It seems there might be an advantage in that you could piggy-back on the infrastructure for CartesianIndex/CartesianIndices for the tail indices.

We can't rely on the _MM being the first index, because any of the loops could be chosen as the vectorized loop.
We could use:

struct CartesianVIndex{N,W,K} <: Base.AbstractCartesianIndex{N}
    i::CartesianIndex{N}
end

where K gives the index of the _MM, or

struct CartesianVIndex{N,T<:Tuple} <: Base.AbstractCartesianIndex{N}
    i::T
end

We would need something closer to this last definition if we wanted to always use it for lowering (currently, it always uses tuples for lowering; I think switching to always using CartesianVIndex instead sounds like a good idea [it would be easier to define methods on than heterogenous tuples, since I don't know how to query N of a heterogenous tuple without an @generated function]).

We would need this flexibility, because it is possible for multiple indexes to be _MM (eg, if someone was iterating over the diagonals A[i,i]), and it is also possible for indexes to be SVec{W,<:Integer}, if for example they're indexed with i::_MM * 3. The _MM always represents contiguous elements, but if multiplying an index by X, we now have a stride of X between indexed elements.

For convenience, I'd avoid nesting CartesianVIndex objects, i.e. make it so CartesianVIndex(k,3,I::CartesianVIndex,j) === CartesianVIndex(k,3,I.i...,j).

We'd also have to be mindful of how we update the ArrayReferenceMeta constructor, discussed in one of my first comments in this PR. Basically, if we go this route, if I is a CartesianIndex{N}

A[I]

can be expanded to

A[I_1,I_2,I_3]

while

A[I + R]

will not be expanded, still using just the single index.
This will hold in the suggestion I gave in that post, because I only suggested expanding the indexes when index_types == LoopIndex, and not when index_types == ComputedIndex.

If we go this route, another change we should make is adding a mutable contiguousindex or fastindex field to the ArrayReference[Meta] objects, and dropping the transposing and index-reversing used now (as well as the Symbol("##DISCONTIGUOUSSUBARRAY##"), instead setting the contiguousindex to 0 or a negative number). The cost of getindex and setindex! would then have to check against this.

  • We'd also need to make sure the costs correctly see through A[I + J]. That is, if 1 is the contiguous index of A, and I and J are CartesianVIndex{2}s, it would have to recognize that vectorizing the first index of I and J will lead to faster loads and stores than vectorizing the second.

The problem faced by the the un-transposing and reversing index order trick is that it would also have to reverse the order of CartesianVIndex objects not directly used for indexing.
For direct indexes, reversing the order, e.g. A[i,j,k] to A[k,j,i] is straightforward.
But when the multiple indices are hiding behind a CartesianVIndex object, as in the A[I+J] case, it would have to transpose the I and J objects as well. I think this sounds more complicated than just specifying which index is the fast one.


Another alternative we can consider is that instead of creating I and J in I + J as CartesianVIndexes, we expand the 1 dummy op with N ops, and similarly expand A[I + J] into A[I_1 + J_1, I_2 + J_2, ..., I_N + J_N].
This has the advantage of involving touching less code outside of loopset reconstruction by trying to put things into a form determine_strategy and lowering already understand, but it may have the disadvantage of resulting in more special-cas behavior.

This will still require updating determinestrategy to see that I_1 and J_1 lead to fast loads/stores, but that the others do not. I should have already implemented that, but had not.
Updating that would require updating definition of unitstride(op::Operation, s::Symbol), which the function returnining true/false if the corresponding operations are contiguous/discontiguous.
I realize now that the current definition is broken. This should return false:

julia> q = :(for i  1:100
           s += a[i,i]
       end);

julia> LoopVectorization.operations(lsq)
3-element Array{LoopVectorization.Operation,1}:
 s = 0
 var"##tempload#266" = a[i, i]
 s = LoopVectorization.vadd(var"##tempload#266", s)

julia> LoopVectorization.unitstride(LoopVectorization.operations(lsq)[2], :i)
true

It incorrectly returns true because it was not checking that a non-fast index also dependended on the loop variable.
On the other hand, the following should return true:

julia> q = :(for i  1:100
           s += a[i + 3]
       end)
:(for i = 1:100
      #= REPL[22]:2 =#
      s += a[i + 3]
  end)

julia> lsq = LoopVectorization.LoopSet(q);

julia> LoopVectorization.operations(lsq)
6-element Array{LoopVectorization.Operation,1}:
 s = 0
 var"####LOOPSYMBOL##i#270" = var"##LOOPSYMBOL##i"[i]
 var"##loopconstnumber#271" = 0
 var"##indexpr#269" = var"####LOOPSYMBOL##i#270" + var"##loopconstnumber#271"
 var"##tempload#272" = a[var"##DISCONTIGUOUSSUBARRAY##", var"##indexpr#269"]
 s = LoopVectorization.vadd(var"##tempload#272", s)

julia> LoopVectorization.unitstride(LoopVectorization.operations(lsq)[5], :i)
false

but returns false, both because a var"##DISCONTIGUOUSSUBARRAY##" was added conservatively (and incorrectly), and it would anyway because var"##indexpr#269" !== :i

The new definition should should look for s, the vectorized loop, in all indexes other than the fast/contiguous index. Maybe something roughly like:

function unitstride(op, s, contigind = contiguousindex(op))
    inds = getindices(op)
    parentindoffset = ??
    for i in eachindex(inds)
        if op.ref.loopedindex[i] # then this index is a loop symbol
            contigind -= 1
            iszero(contigind) && continue # currently contiguous index always == 1
            inds[i] == s && return false
        else
            parentindoffset += 1
            parenttosearch = parents(op)[parentindoffset]
            unitstride(parenttosearch, s, contigind) || return false          
            contigind -= numindices(parenttosearch)  
        end
    end
    true
end

If the vectorized index s shows up anywhere other than along a contiguous index, the function should return false. Additional modifications needed:

  1. parentindoffset would have to be defined, e.g. 0 for memloads and 1 for setindex!, 2 for conditionalstore!.
  2. It would have to have the correct behavior when accessesmemory(op) is false, e.g. for I + J, where I itself doesn't access memory. Perhaps have children of the propsed dummyop inherit relevant info from their parent's ArrayReferenceMetas.

contigind is supposed to pass the correct index to skip to a potential CartesianVIndex object. If an array is indexed with these, length(getindices(arrayref)) will be less than the dimensionality of the original array. contiguousindex(op) is supposed to refer to the original array.
Say that for some weird reason we had a 8 dimensional array, where the 7th dimension was contiguous, and that we're indexing it with A[I+J,k,L+M] where I and J are CartesianVIndex{4}, and L and M are CartesianVIndex{3}. numindices would have to return 4, 1, and 3 on the IpJ, k, and LpM operations.
While iterating in this function, we would call

unitstride(IpJ, s, 7)
unitstride(k, s, 3)
unitstride(LpM, s, 2)

The values of 7 and 3 are too large for the respective 4 and 1 dimensional indices, indicating that none of them are allowed to be vectorized (iszero(contigind) && continue cannot trigger), while 2 is passed as the contiguous index to LpM.
[EDIT: I think the above code shows one of the advantages of the "canonicalization" approach; by transforming everything to either be column-major (first index fast) or no-index-fast, we can simplify the above code and reduce the number of cases we consider in general.]

Thoughts as to the approaches we should take?
I think we should replace LoopValues with a new operation type (what I've been calling a "dummy", because it's a placeholder/ not performing an operation).
But from there, do we have it lower to CartesianVIndex{N} when the loop variable is a CartesianIndex, or do you think we should replace it with N operations?

For now, I suspect the "expand to N operations" will be simpler.

@chriselrod
Copy link
Member

I've released VectorizationBase v0.8, and SIMDPirates 0.7 should be tagged in a few minutes.
These have switched stridedpointers to 1-based indexing:

julia> A = rand(100,100);

julia> using VectorizationBase, SIMDPirates
[ Info: Precompiling SIMDPirates [21efa798-c60a-11e8-04d3-e1a92915a26a]

julia> vload(stridedpointer(A), (1,1))
0.742845271680725

julia> A[1:8,1:8]
8×8 Array{Float64,2}:
 0.742845  0.757178  0.300299  0.603263  0.93428    0.831621   0.0556697  0.304235
 0.321009  0.992542  0.691654  0.894312  0.322318   0.815803   0.173785   0.472578
 0.839145  0.236637  0.863609  0.855714  0.830834   0.852366   0.817774   0.271009
 0.497822  0.406762  0.448195  0.827784  0.391112   0.197293   0.985954   0.483548
 0.406454  0.186679  0.630388  0.510679  0.0327818  0.106299   0.920731   0.360811
 0.597366  0.422358  0.620349  0.642949  0.318277   0.0382444  0.807226   0.516668
 0.70947   0.703065  0.599923  0.238758  0.742942   0.171481   0.307564   0.234002
 0.309503  0.868046  0.444636  0.785198  0.802417   0.892564   0.378186   0.981091

julia> vload(stridedpointer(A), (_MM{8}(1),1))
SVec{8,Float64}<0.742845271680725, 0.32100884916103745, 0.8391447789719655, 0.4978220946937666, 0.4064540466312667, 0.5973661806262187, 0.709469624346265, 0.30950340386114306>

julia> vload(stridedpointer(A), (1,_MM{8}(1)))
SVec{8,Float64}<0.742845271680725, 0.7571776100487808, 0.30029898328718097, 0.6032634638402294, 0.9342799038628531, 0.8316213365756264, 0.055669703980538054, 0.30423537816346524>

julia> vload(stridedpointer(A), (_MM{8}(1),_MM{8}(1)))
SVec{8,Float64}<0.742845271680725, 0.9925421783730857, 0.8636089381258407, 0.8277841006116959, 0.0327818040145591, 0.03824440231241022, 0.30756379402750467, 0.9810911051207567>

@timholy
Copy link
Contributor Author

timholy commented Mar 9, 2020

We can't rely on the _MM being the first index

Yeah, I realized that as I was going to bed last night.

struct CartesianVIndex{N,T<:Tuple} <: Base.AbstractCartesianIndex{N}
   i::T
end

Especially if we need multiple _MM that seems like a fine design. Keep in mind that CartesianIndex at this point is basically a tool for providing a few extra methods (e.g., I+J) and deliberately blocking iteration/broadcasting (you don't need to wrap them in Ref or a tuple to make them act like a scalar).

it would be easier to define methods on than heterogenous tuples, since I don't know how to query N of a heterogenous tuple without an @generated function

Perhaps I can help here. I'm not sure I understand what you mean by "query N" but

julia> nitems(::Tuple{Vararg{Any,N}}) where N = N
nitems (generic function with 1 method)

julia> nitems((1, "hello", 3.2))
3

is viable. There's a lot you can do on tuples without generated functions. For example, there's a possibility of replacing some of @generated functions with tuple-manipulations (e.g., stridedpointer on a LowDimArray seems a likely candidate), but I'm not sure it's important because things like _avx_! probably have to be @generated and the package won't be substantially Revise-able unless we got rid of all of them.

I think we should replace LoopValues with a new operation type (what I've been calling a "dummy", because it's a placeholder/ not performing an operation).

That seems sensible. Mildly relatedly, in your s += a[i+3] example above I don't understand why

var"##loopconstnumber#271" = 0

rather than 3.

For now, I suspect the "expand to N operations" will be simpler.

That's my impression too. I'm planning on fixing JuliaLang/julia#9080 that way.

These have switched stridedpointers to 1-based indexing

Hope you didn't sacrifice anything substantial by doing this! (I checked the commits and they don't look bad, but...) It does seem easier to use the native indexing, thanks for doing this!

@chriselrod
Copy link
Member

Perhaps I can help here. I'm not sure I understand what you mean by "query N" but

That is what I meant, thanks. ::Tuple{Vararg{Any,N}} solves it.

That's my impression too. I'm planning on fixing JuliaLang/julia#9080 that way.

Regarding that issue, out of curiosity, how/where will you do the expansion? Seems like this would require expression manipulation?

I'm not sure it's important because things like avx! probably have to be @generated and the package won't be substantially Revise-able unless we got rid of all of them.

I suspect that it will still have some compile-time benefits.
Would there be an easy way to force _avx_! to recompile, so that it'll be updated?

Hope you didn't sacrifice anything substantial by doing this! (I checked the commits and they don't look bad, but...) It does seem easier to use the native indexing, thanks for doing this!

No problem. Those commits were small; I suspect much smaller than the amount of code we can now remove from this library.

Mildly relatedly, in your s += a[i+3] example above I don't understand why

I should fix the show method to at least not be terribly misleading:

function Base.show(io::IO, op::Operation)
    if isconstant(op)
        if op.instruction === LOOPCONSTANT
            print(io, Expr(:(=), op.variable, 0))
        else
            print(io, Expr(:(=), op.variable, op.instruction.instr))
        end

It always prints 0, regardless of what the actual value is.

@timholy
Copy link
Contributor Author

timholy commented Mar 9, 2020

Would there be an easy way to force avx! to recompile, so that it'll be updated?

Right now no, but that could be added. Now that I think about it, it might work to call revise(LoopVectorization) twice. I've tried once and that doesn't work, but that might be due to order effects. But it would probably be better to support revise(LoopVectorization._avx_!) so that you don't have to recompile everything.

how/where will you do the expansion

Probably right before this line. At that point the code has been lowered and type-inferred, but not inlined or subjected to a number of other optimization passes. When you replace it there are basically two steps:

  • generate appropriately type-inferred code; I'm not yet sure whether it will be better to generate it using the standard surface AST and call lowering and inference, or if it will be easier to directly emit lowered & inferred code (for this case I'm going to guess the latter will actually be easier)
  • renumber SSAValues that occur later in the code (easy)

@chriselrod
Copy link
Member

chriselrod commented Mar 11, 2020

I updated master to:

  1. Use (parent array)-based indexing.
  2. Fix unitstride
  3. Make loopvalue an operation type.

I'll run benchmarks later, if it doesn't look like I'm done fiddling.

@timholy
Copy link
Contributor Author

timholy commented Mar 11, 2020

Awesome! Good timing too; I wanted to see what that compiler pass would look like, but now that it's out there I can let that sit until a few more people have looked at it. So I might get back to this soon.

@timholy
Copy link
Contributor Author

timholy commented Mar 14, 2020

Progress! It actually runs now. The only issue is returning the correct answer 😄. I still need to take a close look at the generated code.

@chriselrod
Copy link
Member

chriselrod commented Mar 14, 2020

I think it looks good and I like your approach.

But errors are a lot easier than silently wrong answers! Looking at the generated expression:

var"####op#855_0_0" = var"##n#845" + var"##n#851"
var"####op#855_0_1" = var"##n#845" + var"##n#851"
var"####op#855_0_2" = var"##n#845" + var"##n#851"
var"####op#855_0_3" = var"##n#845" + var"##n#851"
var"####op#856_0_0" = LoopVectorization.vload(var"##862", (var"####op#855_0_0",))
var"####op#856_0_1" = LoopVectorization.vload(var"##862", (var"####op#855_0_1",))
var"####op#856_0_2" = LoopVectorization.vload(var"##862", (var"####op#855_0_2",))
var"####op#856_0_3" = LoopVectorization.vload(var"##862", (var"####op#855_0_3",))

var"##862" is A, var"##n#845" is the row loop for CartesianIndices(out), and var"##n#851" is the row loop for CartesianIndices(kernel).

So what's missing is adding extra loopvalue operations for each Cartesian index beyond the first, and also adding them as indices to the associated arrays.

It is correctly loading from kernel, where the loop indices are used directly instead of for computation:

var"####op#857_0" = LoopVectorization.vload(var"##863", (var"##n#851", var"##n#848"))
var"####op#857_1" = LoopVectorization.vload(var"##863", (var"##n#851", var"##n#848" + 1))
var"####op#857_2" = LoopVectorization.vload(var"##863", (var"##n#851", var"##n#848" + 2))
var"####op#857_3" = LoopVectorization.vload(var"##863", (var"##n#851", var"##n#848" + 3))

@timholy
Copy link
Contributor Author

timholy commented Mar 17, 2020

More progress, but this is starting to feel a bit messy so I thought I'd ask for help rather than go down a crazy rabbit hole. What I realized was that, just like I needed to add more loop symbols and then keep track of the offsets relative to the original definition, so too I need to add additional operations (to handle I+J for CartesianIndexes) and keep track of their operation-offsets. Unfortunately this one doesn't seem to be going as well.

Here's a replay of my latest test session. The key part of this is to pay attention to is that it shows all the operations after creating them (and setting their parents).

julia> using LoopVectorization, OffsetArrays
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]

julia> function avxgeneric!(out, A, kern, R=CartesianIndices(out), z=zero(eltype(out)))
          Rk = CartesianIndices(kern)
          @avx for I in R
              tmp = z
              for J in Rk
                  tmp += A[I+J]*kern[J]
              end
              out[I] = tmp
          end
          out
       end
avxgeneric! (generic function with 3 methods)

julia> T = Float32
Float32

julia> A = rand(T, 100, 100);

julia> kern = OffsetArray(rand(T, 3, 3), -1:1, -1:1);

julia> out1 = OffsetArray(similar(A, size(A).-2), 1, 1);   # stay away from the edges of A

julia> avxgeneric!(out1, A, kern);
op = var"##op#447#1#" = z
op = var"##op#447#2#" = z
op = var"##op#448#1#" = var"##op#448#1#"
op = var"##op#448#2#" = var"##op#448#2#"
op = var"##op#449#1#" = var"##op#449#1#"
op = var"##op#449#2#" = var"##op#449#2#"
op = var"##op#450#1#" = var"##op#449#1#" + var"##op#448#1#"
op = var"##op#450#2#" = var"##op#449#2#" + var"##op#448#2#"
op = var"##op#451#1#" = A[var"##op#450"]
op = var"##op#451#2#" = A[var"##op#450"]
op = var"##op#452#1#" = kern[var"J#1#", var"J#2#"]
op = var"##op#452#2#" = kern[var"J#1#", var"J#2#"]
op = var"##op#453#1#" = var"##reductzero#434"
op = var"##op#453#2#" = var"##reductzero#434"
op = var"##op#453#1#" = LoopVectorization.vfmadd(var"##op#453#1#", var"##op#452#1#", var"##op#451#1#")
op = var"##op#453#2#" = LoopVectorization.vfmadd(var"##op#453#2#", var"##op#452#2#", var"##op#451#2#")
op = var"##op#447#1#" = LoopVectorization.reduced_add(var"##op#447#1#", var"##op#453#1#")
op = var"##op#447#2#" = LoopVectorization.reduced_add(var"##op#447#2#", var"##op#453#2#")
op = out[var"I#1#", var"I#2#"] = var"##op#447#1#"
op = out[var"I#1#", var"I#2#"] = var"##op#447#2#"
op = var"##op#467#1#" = z
op = var"##op#467#2#" = z
op = var"##op#468#1#" = var"##op#468#1#"
op = var"##op#468#2#" = var"##op#468#2#"
op = var"##op#469#1#" = var"##op#469#1#"
op = var"##op#469#2#" = var"##op#469#2#"
op = var"##op#470#1#" = var"##op#469#1#" + var"##op#468#1#"
op = var"##op#470#2#" = var"##op#469#2#" + var"##op#468#2#"
op = var"##op#471#1#" = A[var"##op#470"]
op = var"##op#471#2#" = A[var"##op#470"]
op = var"##op#472#1#" = kern[var"J#1#", var"J#2#"]
op = var"##op#472#2#" = kern[var"J#1#", var"J#2#"]
op = var"##op#473#1#" = var"##reductzero#434"
op = var"##op#473#2#" = var"##reductzero#434"
op = var"##op#473#1#" = LoopVectorization.vfmadd(var"##op#473#1#", var"##op#472#1#", var"##op#471#1#")
op = var"##op#473#2#" = LoopVectorization.vfmadd(var"##op#473#2#", var"##op#472#2#", var"##op#471#2#")
op = var"##op#467#1#" = LoopVectorization.reduced_add(var"##op#467#1#", var"##op#473#1#")
op = var"##op#467#2#" = LoopVectorization.reduced_add(var"##op#467#2#", var"##op#473#2#")
op = out[var"I#1#", var"I#2#"] = var"##op#467#1#"
op = out[var"I#1#", var"I#2#"] = var"##op#467#2#"
op = var"##op#487#1#" = z
op = var"##op#487#2#" = z
op = var"##op#488#1#" = var"##op#488#1#"
op = var"##op#488#2#" = var"##op#488#2#"
op = var"##op#489#1#" = var"##op#489#1#"
op = var"##op#489#2#" = var"##op#489#2#"
op = var"##op#490#1#" = var"##op#489#1#" + var"##op#488#1#"
op = var"##op#490#2#" = var"##op#489#2#" + var"##op#488#2#"
op = var"##op#491#1#" = A[var"##op#490"]
op = var"##op#491#2#" = A[var"##op#490"]
op = var"##op#492#1#" = kern[var"J#1#", var"J#2#"]
op = var"##op#492#2#" = kern[var"J#1#", var"J#2#"]
op = var"##op#493#1#" = var"##reductzero#434"
op = var"##op#493#2#" = var"##reductzero#434"
op = var"##op#493#1#" = LoopVectorization.vfmadd(var"##op#493#1#", var"##op#492#1#", var"##op#491#1#")
op = var"##op#493#2#" = LoopVectorization.vfmadd(var"##op#493#2#", var"##op#492#2#", var"##op#491#2#")
op = var"##op#487#1#" = LoopVectorization.reduced_add(var"##op#487#1#", var"##op#493#1#")
op = var"##op#487#2#" = LoopVectorization.reduced_add(var"##op#487#2#", var"##op#493#2#")
op = out[var"I#1#", var"I#2#"] = var"##op#487#1#"
op = out[var"I#1#", var"I#2#"] = var"##op#487#2#"
ERROR: MethodError: no method matching extract_external_functions!(::LoopVectorization.LoopSet, ::Nothing)
Closest candidates are:
  extract_external_functions!(::LoopVectorization.LoopSet, ::Int64) at /home/tim/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:257
Stacktrace:
 [1] avx_loopset(::Array{LoopVectorization.Instruction,1}, ::Array{LoopVectorization.OperationStruct,1}, ::Array{LoopVectorization.ArrayRefStruct,1}, ::Core.SimpleVector, ::Core.SimpleVector, ::Core.SimpleVector, ::NTuple{4,DataType}) at /home/tim/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:289
 [2] _avx_loopset(::Core.SimpleVector, ::Core.SimpleVector, ::Core.SimpleVector, ::Core.SimpleVector, ::Core.SimpleVector, ::NTuple{4,DataType}) at /home/tim/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:308
 [3] #s68#88(::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Type, ::Type, ::Type, ::Type, ::Any, ::Any, ::Any) at /home/tim/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:315
 [4] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any,N} where N) at ./boot.jl:524
 [5] macro expansion at ./gcutils.jl:91 [inlined]
 [6] avxgeneric!(::OffsetArray{Float32,2,Array{Float32,2}}, ::Array{Float32,2}, ::OffsetArray{Float32,2,Array{Float32,2}}, ::CartesianIndices{2,Tuple{OffsetArrays.IdOffsetRange{Int64,Base.OneTo{Int64}},OffsetArrays.IdOffsetRange{Int64,Base.OneTo{Int64}}}}, ::Float32) at ./REPL[2]:3
 [7] avxgeneric!(::OffsetArray{Float32,2,Array{Float32,2}}, ::Array{Float32,2}, ::OffsetArray{Float32,2,Array{Float32,2}}) at ./REPL[2]:2
 [8] top-level scope at REPL[7]:1

There are a couple of things that seem undesirable here:

  • there's unnecessary duplication of scalars like z and even array accesses (the latter is especially Not Good)
  • things like A[var"##op#450"] still need to be translated into A[var"##op#450#1#", var"##op#450#2#"] (presumably I can fix this fairly easily)

Any suggestions about how to proceed?

@chriselrod
Copy link
Member

chriselrod commented Mar 17, 2020

What I realized was that, just like I needed to add more loop symbols and then keep track of the offsets relative to the original definition, so too I need to add additional operations (to handle I+J for CartesianIndexes) and keep track of their operation-offsets.

I would make sure they're unnecessary for later steps. That is, you should be able to discard them as soon as add_ops! returns.

The new add_op! should check the operation type to determine behavior when nops > 1.

  1. if optype(os) == loopvalue, then I think the current implementation looks good. You can mark them as expanded, figuratively or literally.
  2. if optype(os) == compute. Check parents. If any parents are expanded, then these must be expanded as well; I'd have to look more closely, but I think the code would be fine in that case. If the parents are not expanded, then you'd add just a single operation depending on those parents. If they are expanded, you'd add multiple operations, each depending on the corresponding expanded parents.
  3. if optype(os) == constant -- not expanded. Add a single operation with loop-dependencies equal to the union. So instead of two zs, where one is dependent on I_1 and the other on I_2, there should be a single z dependent on both.
  4. if optype(os) == memload || optype(os) == memstore -- not expanded. Add a single operation where the loopdependencies are the union, and add all of the expanded loop indices if !loopedindex[n], and add the appropriate ops as parents as well, and insert the appropriate falses to loopedindex. You could also add the indices if loopedindex[n], in which chase you could drop the indexlookup (and insert the appropriate trues into loopedindex).

@chriselrod
Copy link
Member

I made a few changes so that tests pass locally.
Let me know what you think.

…even when there is only a single loop (because that single loop could be over a CartesianIndex).
@chriselrod chriselrod merged commit 4123024 into JuliaSIMD:master Mar 21, 2020
@timholy timholy deleted the teh/cartesianindices branch March 23, 2020 09:47
@timholy
Copy link
Contributor Author

timholy commented Mar 23, 2020

Love it---thanks so much for finishing this off, and sorry I couldn't help more!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants