WIP: support mixed Int/CartesianIndex indexing #84

timholy · 2020-04-05T12:16:20Z

ImageFiltering contains optimizations for separable filters: if the filter size along each dimension is n[i], then per-pixel a fully-separable filter has cost O(n[1] + ... + n[d]), where as a non-separable filter has cost O(n[1] * ... * n[d]). Consequently this can be a very large savings. This PR aims to support such operations by extending support for the exponential-filter demo in https://julialang.org/blog/2016/02/iteration/.

This doesn't yet work, because I haven't yet compensated for what the commented-out @assert is supposed to catch. I thought I'd submit this now to discuss the right architecture changes. My presumption is that the easiest approach is to return Δidxs from calcnops and thus make nopsv a Vector{Vector{Int}}. But if you have a better suggestion I am all ears.

I'll leave a couple of comments in the changes.

src/reconstruct_loopset.jl

src/split_loops.jl

chriselrod · 2020-04-06T09:10:24Z

There is something else missing for the exponential filter demo:

function expfiltdim(x, dim::Integer, α)
    s = similar(x)
    Rpre = CartesianIndices(size(x)[1:dim-1])
    Rpost = CartesianIndices(size(x)[dim+1:end])
    _expfilt!(s, x, α, Rpre, size(x, dim), Rpost)
end

function _expfilt!(s, x, α, Rpre, n, Rpost)
    for Ipost in Rpost
        # Initialize the first value along the filtered dimension
        for Ipre in Rpre
            s[Ipre, 1, Ipost] = x[Ipre, 1, Ipost]
        end
        # Handle all other entries
        for i = 2:n
            for Ipre in Rpre
                s[Ipre, i, Ipost] = α*x[Ipre, i, Ipost] + (1-α)*s[Ipre, i-1, Ipost]
            end
        end
    end
    s
end

There is a loop-carried dependency:

s[Ipre, i, Ipost] = α*x[Ipre, i, Ipost] + (1-α)*s[Ipre, i-1, Ipost]

That is s[Ipre, i, Ipost] depends on the previous iteration, s[Ipre, i - 1, Ipost].

LoopVectorization doesn't yet support loop-carried dependencies, but it's something that would be very nice to support, since they show up in lots of code, and don't always prevent.

The simplest way to get them to work would be to prevent the loop with a loop-carried dependency (the i loop) from being vectorized and from being unrolled/tiled.
Disallowing vectorization is a fundamental limitation of SIMD. You can't vectorize cumsum(::AbstractVector{<:Number}}).
Disallowing unrolling is an implementation detail of LoopVectorization. It interleaves operations while lowering them.

Onα = 1 - α

s_1 = s[Ipre, i-1, Ipost]
s_2 = s[Ipre, i  , Ipost]
x_1 = x[Ipre, i, Ipost]
x_2 = x[Ipre, i+1, Ipost]
αx_1 = α * x_1
αx_2 = α * x_2
Onαs_1 = Onα * s_1
Onαs_2 = Onα * s_2
s[Ipre, i, Ipost] = αx_1 + Onαs_1
s[Ipre, i+1, Ipost] = αx_2 + Onαs_2

The idea behind this interleaving is to better support super-scalar parallelism. The operations next to each other wouldn't depend on each other, so no complicated reordering is needed by the CPU's scheduler to execute them in parallel.

The problem is that in this case, the above unrolling is wrong. s_2 should equal αx_1 + Onαs_1, not s[Ipre, i , Ipost].
I think limiting unrolling these loops to 1 is an okay solution. One of the advantages of unrolling is to get that increased parallelism. But we would get a little less if this loop were lowered correctly:

Onα = 1 - α

s_1 = s[Ipre, i-1, Ipost]
x_1 = x[Ipre, i, Ipost]
Onαs_1 = Onα*s_1
αx_1 = α*x_1
s_2 = αx_1 + Onαs_1
s[Ipre, i  , Ipost] = s_2
x_2 = x[Ipre, i+1, Ipost]
αx_2 = α * x_2
Onαs_2 = Onα * s_2
s[Ipre, i+1, Ipost] = αx_2 + Onαs_2

Onα * s_2 has to wait on s_2 = αx_1 + Onαs_1 to complete before starting, but αx_2 could still be computed in parallel with earlier operations.
The biggest advantage from unrolling this loop would be eliminating the load for s_2. With 2x unrolling, it needs just 3 loads, not 4. Because the loads are probably slower than the arithmetic (2x * + 2x fma), that's likely to still help performance.

EDIT:
If you want, I can work on the loop-carried dependencies, but you're welcome to tackle that yourself as well.
Currently the offsets field of the ArrayReference type is broken and not used.
My plan would be to get it working, and use it for lowering loads and stores with offset indices.

Then, this would also supply a means of checking for loop carried dependencies. If there's the (offset of a store) > (offset of a load), there's a dependency.
A concern is that the offset is currently an Int8. [-128, 127] is a small range for field offsets. The idea was that this would let an UInt64 hold the offsets for 8 dimensional indexes for the condense_loopset -> reconstruct_loopset transformation, which I figured would be enough for most applications. Perhaps I could use a UInt128 instead. Then it could handle 8 indices, with offsets in the range [-32768, 32767].

chriselrod · 2020-04-06T09:26:26Z

My presumption is that the easiest approach is to return Δidxs from calcnops and thus make nopsv a Vector{Vector{Int}}

Yes, I think that is the easiest approach.

An alternative that would be more complicated (at least in the short term) would be to define a NopsDescription struct to make a cleaner API for querying about the operations. That could make things simpler long term.
It could house Δidxs and expandedv.

Out of curiosity, what are your thoughts on explicit type signatures for all these functions? My thought was that using nopsv::Vector{Int} makes it easier to catch errors (including the method error on lower from the loop_split code), and ensures I look at each function taking nopsv as an argument.
A secondary thought is that I'd rather avoid compiling more than a single method of any of these functions.

timholy · 2020-04-06T09:44:58Z

Thanks! Interesting about the loop-carried dependency, makes sense. Despite this example it is not an issue for most of ImageFiltering, only the IIRGaussian method I think. The most common uses of separable filters are FIR filtering.

what are your thoughts on explicit type signatures for all these functions

Interesting question. In general if it's the case that you only expect particular input types, then adding checks essentially becomes a set of asserts that can catch bugs early. It's the type annotations that restrict what you can do with a function (e.g., x::Vector{Float64} when the method would work for any AbstractArray) that are annoying. I can't remember seen anything I'd disagree with.

I will go about making that change, and re-jiggering the test to avoid the loop-carried dependency. Maybe I'll put one in as a @test_broken just so you can reap the reward when it does become supported.

ImageFiltering contains optimizations for separable filters: if the size along each dimension is `n[i]`, then per-pixel a fully-separable filter has cost `O(n[1] + ... + n[d])`, where as a non-separable filter has cost `O(n[1] * ... * n[d])`. Consequently this can be a very large savings. This PR aims to support such operations by extending support for the exponential-filter demo in https://julialang.org/blog/2016/02/iteration/.

timholy

This PR almost works now, but I may have gotten to the stage where I need a bit of help. As you'll see, it's a bit of a Frankensteinian approach: in its current form, it's essentially deferring the old @assert all(isequal(nops), Δidxs) to the last possible point-of-use. Doing it this way was easier for me in my current state of understanding, and to my surprise it ended up circumventing the need to deal with heterogeneous nops. (By effectively delaying the @assert it never ends up firing.) I am not sure how, or whether it's possible, to write a test that would force me to deal with heterogeneous nops.

I don't necessarily propose that we keep the implementation this way, just that this was the first successful approach I came up with.

timholy · 2020-04-07T10:35:46Z

test/miscellaneous.jl

+                # Initialize the first value along the filtered dimension
+                for Ipre in Rpre
+                    s[Ipre, ifirst, Ipost] = x[Ipre, ifirst, Ipost]
+                end


At least for d=1 in the test-loop at the bottom of this file, the test failure is due to this loop failing to run for all of Rpost; it only runs for first(Rpost). (And I do mean Rpost rather than Rpre.) I am guessing this is a loop-ordering thing? I looked at the expanded code but struggled to figure out which pieces come from what parts of this code; if you get a chance, mind looking at that? For the 3d test below, test failures occur for d=1 and d=3, whereas it passes on d=2. Would probably be best to reenable the 5d (or at least, 4d) case before merging, since you need d=5 to test the case of indexing with x[::CI{2}, ::Int, ::CI{2}] (CI=CartesianIndex).

I think heterogenous nops should either only be allowed for memload and memstore operations, or perhaps also allowed for compute iff Set(Δidxs) == Set([1,n]) (i.e., broadcasting some nops == 1s to match the others).

I'd lean toward the former now, for simplicity. That means, maintain the old behavior, except for indexing operations.
That means you could make the @assert immediate, but conditional on !accesses_memory(op).

I'll take a deeper look at the code in the next few hours and leave more comments.

I found a couple problems. I'll push when I've fixed them and leave a few comments.
Two of the highlights:

subset_vptr! is broken with CartesianIndices, because it used the number of loops to find the axis. It also gets called when constructing the initial LoopSet, not in reconstruct_loopset, meaning we can't supply it the offset vector. I'll either drop it and trust LLVM to take care of that optimization, or adjust the function call to do the right thing anyway.

I'm using const NOpsType = Int, and the calcnops function returns maximum(i->offsets[i+1]-offsets[i], idxs).

The tile=(1,1) is unnecessary. I just added that while experimenting.
But I think @avx is being overly aggressive in unrolling this. I'll run benchmarks later.

LoopVectorization also doesn't support multiple loops at the same level in a loop nest yet.

for m in 1:M for n in 1:N end for n in 1:N end end

Would be good to add eventually.

But it's nice that I don't have to think about cache-friendliness when writing out those loops!

chriselrod · 2020-04-08T12:06:29Z

src/memory_ops_common.jl

+    end
+    nothing
+end
+function subset_vptr!(ls::LoopSet, vptr::Symbol, indnum::Int, ind, previndices, loopindex)


For now, I decided it would be easier to update the old subsetview function. This function is used for constant slices, like by ifirst. This function gets called very early, to avoid storing or representing that there was an index at all.
This probably caused the failures you observed, because it assumed the preceding indices corresponded to one axis each. That assumption was violated when d == 1 and d == 3, where the first index corresponded to 0 and 2 axes, instead.

It now calls valdims on each of the preceding loops instead.
I currently defined valdims to return Val(1) for everything except CartesianIndices{N}, where it returns Val(N) instead.

With this change, it now slices the correct axis.

chriselrod · 2020-04-08T12:19:37Z

src/condense_loopset.jl

@@ -327,7 +337,8 @@ function setup_call_noinline(ls::LoopSet, U = zero(Int8), T = zero(Int8))
        push!(q.args, ex)
    end
    insert!(call.args, 8, Expr(:call, Expr(:curly, :Val, vptrarrays)))
-    insert!(call.args, 9, Expr(:call, Expr(:curly, :Val, vptrsubsetdims)))
+    # insert!(call.args, 9, Expr(:call, Expr(:curly, :Val, vptrsubsetdims)))
+    insert!(call.args, 9, vptrsubsetdims)


This function parsed Val{N}() expressions to extract the N, and create a expression for Val{tuple_of_Ns}(). This broke as soon as the Val{N}() expressions were replaced with calls that returned Vals.
So now, instead of parsing expressions, it makes them arguments to a function that will return the desired answer.

The purpose of that song and dance is so that __avx__! is not inlined, and calls _avx_!, which is inlined.
Within the __avx__! call, it'll subset the arrays, get pointers, etc, so that LLVM still has this aliasing / relative pointer offset information available. This reduced the performance disparity between setting inline=false and inline=true. The latter is the default.
This is not a good solution, but a hack added on later when I decided to add this.

codecov-io · 2020-04-08T12:32:31Z

Codecov Report

Merging #84 into master will decrease coverage by 0.47%.
The diff coverage is 78.00%.

@@            Coverage Diff             @@
##           master      #84      +/-   ##
==========================================
- Coverage   91.44%   90.97%   -0.48%     
==========================================
  Files          28       28              
  Lines        2841     2892      +51     
==========================================
+ Hits         2598     2631      +33     
- Misses        243      261      +18

Impacted Files	Coverage Δ
src/lower_store.jl	`87.59% <ø> (ø)`
src/lower_load.jl	`85.50% <69.69%> (-14.50%)`	⬇️
src/memory_ops_common.jl	`86.74% <70.83%> (-4.03%)`	⬇️
src/add_ifelse.jl	`93.84% <85.71%> (-2.83%)`	⬇️
src/condense_loopset.jl	`93.28% <85.71%> (-0.25%)`	⬇️
src/reconstruct_loopset.jl	`91.41% <100.00%> (-0.08%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca31ba1...a311e1a. Read the comment docs.

chriselrod · 2020-04-08T12:37:28Z

test/miscellaneous.jl

-                        s[Ipre, i, Ipost] = α*x[Ipre, i, Ipost] + (1-α)*x[Ipre, i-1, Ipost]
+                        xi = x[Ipre, i, Ipost]
+                        xim = ifelse(i == ifirst, xi, x[Ipre, i-1, Ipost])
+                        s[Ipre, i, Ipost] = α*xi + (1-α)*xim
                    end
                end


Not as pretty, but another way to write the function.

I think this may risk crashing because of the read to x[Ipre, ifirst-1, Ipost], even though that value wont be used, so I'm not inclined to recommend this until I add explicit support for conditional loads, like the conditional store support it has now.

This did cause crashes on Travis. I'll work on turning this into a conditional load.

chriselrod · 2020-04-08T12:38:26Z

test/miscellaneous.jl

-            Rpost = CartesianIndices(axes(x)[d+1:end])
+            # @show d
+            Rpre  = CartesianIndices(axes(x)[1:d-1]);
+            Rpost = CartesianIndices(axes(x)[d+1:end]);


The semicolons are because I don't want to be bombarded by output when running these interactively.

…Meta constructor.

chriselrod · 2020-04-08T12:42:08Z

src/reconstruct_loopset.jl

@@ -239,11 +239,12 @@ function calcnops(ls::LoopSet, os::OperationStruct)
    offsets = ls.loopsymbol_offsets
    idxs = loopindex(ls, os.loopdeps, 0x04)  # FIXME DRY
    iszero(length(idxs)) && return 1
-    return map(i->offsets[i+1]-offsets[i], idxs)
+    return maximum(i->offsets[i+1]-offsets[i], idxs)
 end


For now, only loads and stores are allowed to have a mix of nops as inputs.
nops for loads and stores are also always 1, so we don't need a vector.

chriselrod · 2020-04-08T12:42:38Z

Tests passed locally (I re-enabled 5d). I set Travis to allow failures on master, because master currently seems to be broken.

I made a few comments discussing some of the changes.

I'll work on teaching it that constant-offset indices reduce the cost of loads if unrolling that loop (i.e., unrolling the i loop lets you reuse x).

timholy · 2020-04-08T14:21:26Z

Wow, amazing work! Sorry I didn't bring this closer to completion, I was hoping it was a simple matter.

Certainly I'd approve merging this whenever you are satisfied with it. Anything else I can do to help?

chriselrod · 2020-04-08T20:38:24Z

Thanks, I appreciate your contributions and getting this started! I'd also be happy to write up how different things work. Some of that could also help me figure out what it should look like after components get rewritten/refactored with internal APIs.

I merged because tests passed. Performance needs work, but I can handle that later.

Performance work mostly means adding support for recognizing constant integer offsets, and then using these in determinestrategy.jl to encourage it to unroll the loops that allow eliminating loads (i.e., the i loop here).
You're also more than welcome to take a stab at it. That would also lay the same groundwork for recognized loop-carried dependencies.

I looked at the expanded code but struggled to figure out which pieces come from what parts of this code

Anything in particular I can answer?

timholy · 2020-04-11T13:39:20Z

Anything in particular I can answer?

I need to go back and figure out how operations work in greater detail. https://chriselrod.github.io/LoopVectorization.jl/latest/devdocs/loopset_structure/ was really helpful in getting going, but now I see I should go back to that page and really internalize all the stuff there. So mostly I think this is me just not doing my homework carefully enough.

Some things that might help:

comments or a docstring associated with each element of OperationType
comments on the fields of Operation

I can take a stab at some of these, it would be good for me.

timholy commented Apr 5, 2020

View reviewed changes

src/reconstruct_loopset.jl Outdated Show resolved Hide resolved

src/split_loops.jl Outdated Show resolved Hide resolved

timholy commented Apr 7, 2020

View reviewed changes

Tests pass locally for d=5 mixed integer-Cartesian indexing.

cf2df70

chriselrod reviewed Apr 8, 2020

View reviewed changes

nops cannot currently be a vector, so removed check in ArrayReference…

8eb2440

…Meta constructor.

chriselrod reviewed Apr 8, 2020

View reviewed changes

chriselrod added 2 commits April 8, 2020 14:01

Merge branch 'master' into teh/multiindices

d238ab2

Interpret lonely loads in if/else statements as being conditional loads.

a311e1a

chriselrod merged commit 8657d6b into JuliaSIMD:master Apr 8, 2020

timholy deleted the teh/multiindices branch April 11, 2020 13:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: support mixed Int/CartesianIndex indexing #84

WIP: support mixed Int/CartesianIndex indexing #84

timholy commented Apr 5, 2020 •

edited

Loading

chriselrod commented Apr 6, 2020 •

edited

Loading

chriselrod commented Apr 6, 2020

timholy commented Apr 6, 2020

timholy left a comment

timholy Apr 7, 2020

chriselrod Apr 7, 2020

chriselrod Apr 8, 2020 •

edited

Loading

chriselrod Apr 8, 2020 •

edited

Loading

timholy Apr 8, 2020

chriselrod Apr 8, 2020 •

edited

Loading

chriselrod Apr 8, 2020

codecov-io commented Apr 8, 2020 •

edited

Loading

chriselrod Apr 8, 2020

chriselrod Apr 8, 2020

chriselrod Apr 8, 2020

chriselrod Apr 8, 2020

chriselrod commented Apr 8, 2020 •

edited

Loading

timholy commented Apr 8, 2020

chriselrod commented Apr 8, 2020

timholy commented Apr 11, 2020

WIP: support mixed Int/CartesianIndex indexing #84

WIP: support mixed Int/CartesianIndex indexing #84

Conversation

timholy commented Apr 5, 2020 • edited Loading

chriselrod commented Apr 6, 2020 • edited Loading

chriselrod commented Apr 6, 2020

timholy commented Apr 6, 2020

timholy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chriselrod Apr 8, 2020 • edited Loading

Choose a reason for hiding this comment

chriselrod Apr 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chriselrod Apr 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Apr 8, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chriselrod commented Apr 8, 2020 • edited Loading

timholy commented Apr 8, 2020

chriselrod commented Apr 8, 2020

timholy commented Apr 11, 2020

timholy commented Apr 5, 2020 •

edited

Loading

chriselrod commented Apr 6, 2020 •

edited

Loading

chriselrod Apr 8, 2020 •

edited

Loading

chriselrod Apr 8, 2020 •

edited

Loading

chriselrod Apr 8, 2020 •

edited

Loading

codecov-io commented Apr 8, 2020 •

edited

Loading

chriselrod commented Apr 8, 2020 •

edited

Loading