Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: support mixed Int/CartesianIndex indexing #84

Merged
merged 5 commits into from
Apr 8, 2020
Merged

WIP: support mixed Int/CartesianIndex indexing #84

merged 5 commits into from
Apr 8, 2020

Conversation

timholy
Copy link
Contributor

@timholy timholy commented Apr 5, 2020

ImageFiltering contains optimizations for separable filters: if the filter size along each dimension is n[i], then per-pixel a fully-separable filter has cost O(n[1] + ... + n[d]), where as a non-separable filter has cost O(n[1] * ... * n[d]). Consequently this can be a very large savings. This PR aims to support such operations by extending support for the exponential-filter demo in https://julialang.org/blog/2016/02/iteration/.

This doesn't yet work, because I haven't yet compensated for what the commented-out @assert is supposed to catch. I thought I'd submit this now to discuss the right architecture changes. My presumption is that the easiest approach is to return Δidxs from calcnops and thus make nopsv a Vector{Vector{Int}}. But if you have a better suggestion I am all ears.

I'll leave a couple of comments in the changes.

src/reconstruct_loopset.jl Outdated Show resolved Hide resolved
src/split_loops.jl Outdated Show resolved Hide resolved
@chriselrod
Copy link
Member

chriselrod commented Apr 6, 2020

There is something else missing for the exponential filter demo:

function expfiltdim(x, dim::Integer, α)
    s = similar(x)
    Rpre = CartesianIndices(size(x)[1:dim-1])
    Rpost = CartesianIndices(size(x)[dim+1:end])
    _expfilt!(s, x, α, Rpre, size(x, dim), Rpost)
end

function _expfilt!(s, x, α, Rpre, n, Rpost)
    for Ipost in Rpost
        # Initialize the first value along the filtered dimension
        for Ipre in Rpre
            s[Ipre, 1, Ipost] = x[Ipre, 1, Ipost]
        end
        # Handle all other entries
        for i = 2:n
            for Ipre in Rpre
                s[Ipre, i, Ipost] = α*x[Ipre, i, Ipost] + (1-α)*s[Ipre, i-1, Ipost]
            end
        end
    end
    s
end

There is a loop-carried dependency:

s[Ipre, i, Ipost] = α*x[Ipre, i, Ipost] + (1-α)*s[Ipre, i-1, Ipost]

That is s[Ipre, i, Ipost] depends on the previous iteration, s[Ipre, i - 1, Ipost].

LoopVectorization doesn't yet support loop-carried dependencies, but it's something that would be very nice to support, since they show up in lots of code, and don't always prevent.

The simplest way to get them to work would be to prevent the loop with a loop-carried dependency (the i loop) from being vectorized and from being unrolled/tiled.
Disallowing vectorization is a fundamental limitation of SIMD. You can't vectorize cumsum(::AbstractVector{<:Number}}).
Disallowing unrolling is an implementation detail of LoopVectorization. It interleaves operations while lowering them.

Onα = 1 - α

s_1 = s[Ipre, i-1, Ipost]
s_2 = s[Ipre, i  , Ipost]
x_1 = x[Ipre, i, Ipost]
x_2 = x[Ipre, i+1, Ipost]
αx_1 = α * x_1
αx_2 = α * x_2
Onαs_1 = Onα * s_1
Onαs_2 = Onα * s_2
s[Ipre, i, Ipost] = αx_1 + Onαs_1
s[Ipre, i+1, Ipost] = αx_2 + Onαs_2

The idea behind this interleaving is to better support super-scalar parallelism. The operations next to each other wouldn't depend on each other, so no complicated reordering is needed by the CPU's scheduler to execute them in parallel.

The problem is that in this case, the above unrolling is wrong. s_2 should equal αx_1 + Onαs_1, not s[Ipre, i , Ipost].
I think limiting unrolling these loops to 1 is an okay solution. One of the advantages of unrolling is to get that increased parallelism. But we would get a little less if this loop were lowered correctly:

Onα = 1 - α

s_1 = s[Ipre, i-1, Ipost]
x_1 = x[Ipre, i, Ipost]
Onαs_1 = Onα*s_1
αx_1 = α*x_1
s_2 = αx_1 + Onαs_1
s[Ipre, i  , Ipost] = s_2
x_2 = x[Ipre, i+1, Ipost]
αx_2 = α * x_2
Onαs_2 = Onα * s_2
s[Ipre, i+1, Ipost] = αx_2 + Onαs_2

Onα * s_2 has to wait on s_2 = αx_1 + Onαs_1 to complete before starting, but αx_2 could still be computed in parallel with earlier operations.
The biggest advantage from unrolling this loop would be eliminating the load for s_2. With 2x unrolling, it needs just 3 loads, not 4. Because the loads are probably slower than the arithmetic (2x * + 2x fma), that's likely to still help performance.

EDIT:
If you want, I can work on the loop-carried dependencies, but you're welcome to tackle that yourself as well.
Currently the offsets field of the ArrayReference type is broken and not used.
My plan would be to get it working, and use it for lowering loads and stores with offset indices.

Then, this would also supply a means of checking for loop carried dependencies. If there's the (offset of a store) > (offset of a load), there's a dependency.
A concern is that the offset is currently an Int8. [-128, 127] is a small range for field offsets. The idea was that this would let an UInt64 hold the offsets for 8 dimensional indexes for the condense_loopset -> reconstruct_loopset transformation, which I figured would be enough for most applications. Perhaps I could use a UInt128 instead. Then it could handle 8 indices, with offsets in the range [-32768, 32767].

@chriselrod
Copy link
Member

My presumption is that the easiest approach is to return Δidxs from calcnops and thus make nopsv a Vector{Vector{Int}}

Yes, I think that is the easiest approach.

An alternative that would be more complicated (at least in the short term) would be to define a NopsDescription struct to make a cleaner API for querying about the operations. That could make things simpler long term.
It could house Δidxs and expandedv.

Out of curiosity, what are your thoughts on explicit type signatures for all these functions? My thought was that using nopsv::Vector{Int} makes it easier to catch errors (including the method error on lower from the loop_split code), and ensures I look at each function taking nopsv as an argument.
A secondary thought is that I'd rather avoid compiling more than a single method of any of these functions.

@timholy
Copy link
Contributor Author

timholy commented Apr 6, 2020

Thanks! Interesting about the loop-carried dependency, makes sense. Despite this example it is not an issue for most of ImageFiltering, only the IIRGaussian method I think. The most common uses of separable filters are FIR filtering.

what are your thoughts on explicit type signatures for all these functions

Interesting question. In general if it's the case that you only expect particular input types, then adding checks essentially becomes a set of asserts that can catch bugs early. It's the type annotations that restrict what you can do with a function (e.g., x::Vector{Float64} when the method would work for any AbstractArray) that are annoying. I can't remember seen anything I'd disagree with.

I will go about making that change, and re-jiggering the test to avoid the loop-carried dependency. Maybe I'll put one in as a @test_broken just so you can reap the reward when it does become supported.

ImageFiltering contains optimizations for separable filters: if the size
along each dimension is `n[i]`, then per-pixel a fully-separable
filter has cost `O(n[1] + ... + n[d])`, where as a non-separable
filter has cost `O(n[1] * ... * n[d])`. Consequently this can be a
very large savings. This PR aims to support such operations by
extending support for the exponential-filter demo in
https://julialang.org/blog/2016/02/iteration/.
Copy link
Contributor Author

@timholy timholy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR almost works now, but I may have gotten to the stage where I need a bit of help. As you'll see, it's a bit of a Frankensteinian approach: in its current form, it's essentially deferring the old @assert all(isequal(nops), Δidxs) to the last possible point-of-use. Doing it this way was easier for me in my current state of understanding, and to my surprise it ended up circumventing the need to deal with heterogeneous nops. (By effectively delaying the @assert it never ends up firing.) I am not sure how, or whether it's possible, to write a test that would force me to deal with heterogeneous nops.

I don't necessarily propose that we keep the implementation this way, just that this was the first successful approach I came up with.

# Initialize the first value along the filtered dimension
for Ipre in Rpre
s[Ipre, ifirst, Ipost] = x[Ipre, ifirst, Ipost]
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for d=1 in the test-loop at the bottom of this file, the test failure is due to this loop failing to run for all of Rpost; it only runs for first(Rpost). (And I do mean Rpost rather than Rpre.) I am guessing this is a loop-ordering thing? I looked at the expanded code but struggled to figure out which pieces come from what parts of this code; if you get a chance, mind looking at that? For the 3d test below, test failures occur for d=1 and d=3, whereas it passes on d=2. Would probably be best to reenable the 5d (or at least, 4d) case before merging, since you need d=5 to test the case of indexing with x[::CI{2}, ::Int, ::CI{2}] (CI=CartesianIndex).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think heterogenous nops should either only be allowed for memload and memstore operations, or perhaps also allowed for compute iff Set(Δidxs) == Set([1,n]) (i.e., broadcasting some nops == 1s to match the others).

I'd lean toward the former now, for simplicity. That means, maintain the old behavior, except for indexing operations.
That means you could make the @assert immediate, but conditional on !accesses_memory(op).

I'll take a deeper look at the code in the next few hours and leave more comments.

Copy link
Member

@chriselrod chriselrod Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a couple problems. I'll push when I've fixed them and leave a few comments.
Two of the highlights:

  1. subset_vptr! is broken with CartesianIndices, because it used the number of loops to find the axis. It also gets called when constructing the initial LoopSet, not in reconstruct_loopset, meaning we can't supply it the offset vector. I'll either drop it and trust LLVM to take care of that optimization, or adjust the function call to do the right thing anyway.
  2. I'm using const NOpsType = Int, and the calcnops function returns maximum(i->offsets[i+1]-offsets[i], idxs).

Copy link
Member

@chriselrod chriselrod Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tile=(1,1) is unnecessary. I just added that while experimenting.
But I think @avx is being overly aggressive in unrolling this. I'll run benchmarks later.

LoopVectorization also doesn't support multiple loops at the same level in a loop nest yet.

for m in 1:M
    for n in 1:N
    end
    for n in 1:N
    end
end

Would be good to add eventually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's nice that I don't have to think about cache-friendliness when writing out those loops!

end
nothing
end
function subset_vptr!(ls::LoopSet, vptr::Symbol, indnum::Int, ind, previndices, loopindex)
Copy link
Member

@chriselrod chriselrod Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I decided it would be easier to update the old subsetview function. This function is used for constant slices, like by ifirst. This function gets called very early, to avoid storing or representing that there was an index at all.
This probably caused the failures you observed, because it assumed the preceding indices corresponded to one axis each. That assumption was violated when d == 1 and d == 3, where the first index corresponded to 0 and 2 axes, instead.

It now calls valdims on each of the preceding loops instead.
I currently defined valdims to return Val(1) for everything except CartesianIndices{N}, where it returns Val(N) instead.

With this change, it now slices the correct axis.

@@ -327,7 +337,8 @@ function setup_call_noinline(ls::LoopSet, U = zero(Int8), T = zero(Int8))
push!(q.args, ex)
end
insert!(call.args, 8, Expr(:call, Expr(:curly, :Val, vptrarrays)))
insert!(call.args, 9, Expr(:call, Expr(:curly, :Val, vptrsubsetdims)))
# insert!(call.args, 9, Expr(:call, Expr(:curly, :Val, vptrsubsetdims)))
insert!(call.args, 9, vptrsubsetdims)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function parsed Val{N}() expressions to extract the N, and create a expression for Val{tuple_of_Ns}(). This broke as soon as the Val{N}() expressions were replaced with calls that returned Vals.
So now, instead of parsing expressions, it makes them arguments to a function that will return the desired answer.

The purpose of that song and dance is so that __avx__! is not inlined, and calls _avx_!, which is inlined.
Within the __avx__! call, it'll subset the arrays, get pointers, etc, so that LLVM still has this aliasing / relative pointer offset information available. This reduced the performance disparity between setting inline=false and inline=true. The latter is the default.
This is not a good solution, but a hack added on later when I decided to add this.

@codecov-io
Copy link

codecov-io commented Apr 8, 2020

Codecov Report

Merging #84 into master will decrease coverage by 0.47%.
The diff coverage is 78.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #84      +/-   ##
==========================================
- Coverage   91.44%   90.97%   -0.48%     
==========================================
  Files          28       28              
  Lines        2841     2892      +51     
==========================================
+ Hits         2598     2631      +33     
- Misses        243      261      +18     
Impacted Files Coverage Δ
src/lower_store.jl 87.59% <ø> (ø)
src/lower_load.jl 85.50% <69.69%> (-14.50%) ⬇️
src/memory_ops_common.jl 86.74% <70.83%> (-4.03%) ⬇️
src/add_ifelse.jl 93.84% <85.71%> (-2.83%) ⬇️
src/condense_loopset.jl 93.28% <85.71%> (-0.25%) ⬇️
src/reconstruct_loopset.jl 91.41% <100.00%> (-0.08%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca31ba1...a311e1a. Read the comment docs.

s[Ipre, i, Ipost] = α*x[Ipre, i, Ipost] + (1-α)*x[Ipre, i-1, Ipost]
xi = x[Ipre, i, Ipost]
xim = ifelse(i == ifirst, xi, x[Ipre, i-1, Ipost])
s[Ipre, i, Ipost] = α*xi + (1-α)*xim
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not as pretty, but another way to write the function.

I think this may risk crashing because of the read to x[Ipre, ifirst-1, Ipost], even though that value wont be used, so I'm not inclined to recommend this until I add explicit support for conditional loads, like the conditional store support it has now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This did cause crashes on Travis. I'll work on turning this into a conditional load.

Rpost = CartesianIndices(axes(x)[d+1:end])
# @show d
Rpre = CartesianIndices(axes(x)[1:d-1]);
Rpost = CartesianIndices(axes(x)[d+1:end]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semicolons are because I don't want to be bombarded by output when running these interactively.

@@ -239,11 +239,12 @@ function calcnops(ls::LoopSet, os::OperationStruct)
offsets = ls.loopsymbol_offsets
idxs = loopindex(ls, os.loopdeps, 0x04) # FIXME DRY
iszero(length(idxs)) && return 1
return map(i->offsets[i+1]-offsets[i], idxs)
return maximum(i->offsets[i+1]-offsets[i], idxs)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, only loads and stores are allowed to have a mix of nops as inputs.
nops for loads and stores are also always 1, so we don't need a vector.

@chriselrod
Copy link
Member

chriselrod commented Apr 8, 2020

Tests passed locally (I re-enabled 5d). I set Travis to allow failures on master, because master currently seems to be broken.

I made a few comments discussing some of the changes.

I'll work on teaching it that constant-offset indices reduce the cost of loads if unrolling that loop (i.e., unrolling the i loop lets you reuse x).

@timholy
Copy link
Contributor Author

timholy commented Apr 8, 2020

Wow, amazing work! Sorry I didn't bring this closer to completion, I was hoping it was a simple matter.

Certainly I'd approve merging this whenever you are satisfied with it. Anything else I can do to help?

@chriselrod
Copy link
Member

Thanks, I appreciate your contributions and getting this started! I'd also be happy to write up how different things work. Some of that could also help me figure out what it should look like after components get rewritten/refactored with internal APIs.

I merged because tests passed. Performance needs work, but I can handle that later.

Performance work mostly means adding support for recognizing constant integer offsets, and then using these in determinestrategy.jl to encourage it to unroll the loops that allow eliminating loads (i.e., the i loop here).
You're also more than welcome to take a stab at it. That would also lay the same groundwork for recognized loop-carried dependencies.

I looked at the expanded code but struggled to figure out which pieces come from what parts of this code

Anything in particular I can answer?

@chriselrod chriselrod merged commit 8657d6b into JuliaSIMD:master Apr 8, 2020
@timholy timholy deleted the teh/multiindices branch April 11, 2020 13:18
@timholy
Copy link
Contributor Author

timholy commented Apr 11, 2020

Anything in particular I can answer?

I need to go back and figure out how operations work in greater detail. https://chriselrod.github.io/LoopVectorization.jl/latest/devdocs/loopset_structure/ was really helpful in getting going, but now I see I should go back to that page and really internalize all the stuff there. So mostly I think this is me just not doing my homework carefully enough.

Some things that might help:

I can take a stab at some of these, it would be good for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants