-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: support mixed Int/CartesianIndex indexing #84
Conversation
There is something else missing for the exponential filter demo: function expfiltdim(x, dim::Integer, α)
s = similar(x)
Rpre = CartesianIndices(size(x)[1:dim-1])
Rpost = CartesianIndices(size(x)[dim+1:end])
_expfilt!(s, x, α, Rpre, size(x, dim), Rpost)
end
function _expfilt!(s, x, α, Rpre, n, Rpost)
for Ipost in Rpost
# Initialize the first value along the filtered dimension
for Ipre in Rpre
s[Ipre, 1, Ipost] = x[Ipre, 1, Ipost]
end
# Handle all other entries
for i = 2:n
for Ipre in Rpre
s[Ipre, i, Ipost] = α*x[Ipre, i, Ipost] + (1-α)*s[Ipre, i-1, Ipost]
end
end
end
s
end There is a loop-carried dependency: s[Ipre, i, Ipost] = α*x[Ipre, i, Ipost] + (1-α)*s[Ipre, i-1, Ipost] That is LoopVectorization doesn't yet support loop-carried dependencies, but it's something that would be very nice to support, since they show up in lots of code, and don't always prevent. The simplest way to get them to work would be to prevent the loop with a loop-carried dependency (the Onα = 1 - α
s_1 = s[Ipre, i-1, Ipost]
s_2 = s[Ipre, i , Ipost]
x_1 = x[Ipre, i, Ipost]
x_2 = x[Ipre, i+1, Ipost]
αx_1 = α * x_1
αx_2 = α * x_2
Onαs_1 = Onα * s_1
Onαs_2 = Onα * s_2
s[Ipre, i, Ipost] = αx_1 + Onαs_1
s[Ipre, i+1, Ipost] = αx_2 + Onαs_2 The idea behind this interleaving is to better support super-scalar parallelism. The operations next to each other wouldn't depend on each other, so no complicated reordering is needed by the CPU's scheduler to execute them in parallel. The problem is that in this case, the above unrolling is wrong. Onα = 1 - α
s_1 = s[Ipre, i-1, Ipost]
x_1 = x[Ipre, i, Ipost]
Onαs_1 = Onα*s_1
αx_1 = α*x_1
s_2 = αx_1 + Onαs_1
s[Ipre, i , Ipost] = s_2
x_2 = x[Ipre, i+1, Ipost]
αx_2 = α * x_2
Onαs_2 = Onα * s_2
s[Ipre, i+1, Ipost] = αx_2 + Onαs_2
EDIT: Then, this would also supply a means of checking for loop carried dependencies. If there's the (offset of a store) > (offset of a load), there's a dependency. |
Yes, I think that is the easiest approach. An alternative that would be more complicated (at least in the short term) would be to define a Out of curiosity, what are your thoughts on explicit type signatures for all these functions? My thought was that using |
Thanks! Interesting about the loop-carried dependency, makes sense. Despite this example it is not an issue for most of ImageFiltering, only the
Interesting question. In general if it's the case that you only expect particular input types, then adding checks essentially becomes a set of asserts that can catch bugs early. It's the type annotations that restrict what you can do with a function (e.g., I will go about making that change, and re-jiggering the test to avoid the loop-carried dependency. Maybe I'll put one in as a |
ImageFiltering contains optimizations for separable filters: if the size along each dimension is `n[i]`, then per-pixel a fully-separable filter has cost `O(n[1] + ... + n[d])`, where as a non-separable filter has cost `O(n[1] * ... * n[d])`. Consequently this can be a very large savings. This PR aims to support such operations by extending support for the exponential-filter demo in https://julialang.org/blog/2016/02/iteration/.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR almost works now, but I may have gotten to the stage where I need a bit of help. As you'll see, it's a bit of a Frankensteinian approach: in its current form, it's essentially deferring the old @assert all(isequal(nops), Δidxs)
to the last possible point-of-use. Doing it this way was easier for me in my current state of understanding, and to my surprise it ended up circumventing the need to deal with heterogeneous nops
. (By effectively delaying the @assert
it never ends up firing.) I am not sure how, or whether it's possible, to write a test that would force me to deal with heterogeneous nops
.
I don't necessarily propose that we keep the implementation this way, just that this was the first successful approach I came up with.
test/miscellaneous.jl
Outdated
# Initialize the first value along the filtered dimension | ||
for Ipre in Rpre | ||
s[Ipre, ifirst, Ipost] = x[Ipre, ifirst, Ipost] | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least for d=1
in the test-loop at the bottom of this file, the test failure is due to this loop failing to run for all of Rpost
; it only runs for first(Rpost)
. (And I do mean Rpost
rather than Rpre
.) I am guessing this is a loop-ordering thing? I looked at the expanded code but struggled to figure out which pieces come from what parts of this code; if you get a chance, mind looking at that? For the 3d test below, test failures occur for d=1
and d=3
, whereas it passes on d=2
. Would probably be best to reenable the 5d (or at least, 4d) case before merging, since you need d=5
to test the case of indexing with x[::CI{2}, ::Int, ::CI{2}]
(CI=CartesianIndex
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think heterogenous nops
should either only be allowed for memload
and memstore
operations, or perhaps also allowed for compute
iff Set(Δidxs)
== Set([1,n])
(i.e., broadcasting some nops == 1
s to match the others).
I'd lean toward the former now, for simplicity. That means, maintain the old behavior, except for indexing operations.
That means you could make the @assert
immediate, but conditional on !accesses_memory(op)
.
I'll take a deeper look at the code in the next few hours and leave more comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a couple problems. I'll push when I've fixed them and leave a few comments.
Two of the highlights:
subset_vptr!
is broken with CartesianIndices, because it used the number of loops to find the axis. It also gets called when constructing the initialLoopSet
, not inreconstruct_loopset
, meaning we can't supply it the offset vector. I'll either drop it and trust LLVM to take care of that optimization, or adjust the function call to do the right thing anyway.- I'm using
const NOpsType = Int
, and thecalcnops
function returnsmaximum(i->offsets[i+1]-offsets[i], idxs)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tile=(1,1)
is unnecessary. I just added that while experimenting.
But I think @avx
is being overly aggressive in unrolling this. I'll run benchmarks later.
LoopVectorization also doesn't support multiple loops at the same level in a loop nest yet.
for m in 1:M
for n in 1:N
end
for n in 1:N
end
end
Would be good to add eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's nice that I don't have to think about cache-friendliness when writing out those loops!
end | ||
nothing | ||
end | ||
function subset_vptr!(ls::LoopSet, vptr::Symbol, indnum::Int, ind, previndices, loopindex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I decided it would be easier to update the old subsetview
function. This function is used for constant slices, like by ifirst
. This function gets called very early, to avoid storing or representing that there was an index at all.
This probably caused the failures you observed, because it assumed the preceding indices corresponded to one axis each. That assumption was violated when d == 1
and d == 3
, where the first index corresponded to 0 and 2 axes, instead.
It now calls valdims
on each of the preceding loops instead.
I currently defined valdims
to return Val(1)
for everything except CartesianIndices{N}
, where it returns Val(N)
instead.
With this change, it now slices the correct axis.
@@ -327,7 +337,8 @@ function setup_call_noinline(ls::LoopSet, U = zero(Int8), T = zero(Int8)) | |||
push!(q.args, ex) | |||
end | |||
insert!(call.args, 8, Expr(:call, Expr(:curly, :Val, vptrarrays))) | |||
insert!(call.args, 9, Expr(:call, Expr(:curly, :Val, vptrsubsetdims))) | |||
# insert!(call.args, 9, Expr(:call, Expr(:curly, :Val, vptrsubsetdims))) | |||
insert!(call.args, 9, vptrsubsetdims) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function parsed Val{N}()
expressions to extract the N
, and create a expression for Val{tuple_of_Ns}()
. This broke as soon as the Val{N}()
expressions were replaced with calls that returned Val
s.
So now, instead of parsing expressions, it makes them arguments to a function that will return the desired answer.
The purpose of that song and dance is so that __avx__!
is not inlined, and calls _avx_!
, which is inlined.
Within the __avx__!
call, it'll subset the arrays, get pointers, etc, so that LLVM still has this aliasing / relative pointer offset information available. This reduced the performance disparity between setting inline=false
and inline=true
. The latter is the default.
This is not a good solution, but a hack added on later when I decided to add this.
Codecov Report
@@ Coverage Diff @@
## master #84 +/- ##
==========================================
- Coverage 91.44% 90.97% -0.48%
==========================================
Files 28 28
Lines 2841 2892 +51
==========================================
+ Hits 2598 2631 +33
- Misses 243 261 +18
Continue to review full report at Codecov.
|
test/miscellaneous.jl
Outdated
s[Ipre, i, Ipost] = α*x[Ipre, i, Ipost] + (1-α)*x[Ipre, i-1, Ipost] | ||
xi = x[Ipre, i, Ipost] | ||
xim = ifelse(i == ifirst, xi, x[Ipre, i-1, Ipost]) | ||
s[Ipre, i, Ipost] = α*xi + (1-α)*xim | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not as pretty, but another way to write the function.
I think this may risk crashing because of the read to x[Ipre, ifirst-1, Ipost]
, even though that value wont be used, so I'm not inclined to recommend this until I add explicit support for conditional loads, like the conditional store support it has now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This did cause crashes on Travis. I'll work on turning this into a conditional load.
Rpost = CartesianIndices(axes(x)[d+1:end]) | ||
# @show d | ||
Rpre = CartesianIndices(axes(x)[1:d-1]); | ||
Rpost = CartesianIndices(axes(x)[d+1:end]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semicolons are because I don't want to be bombarded by output when running these interactively.
…Meta constructor.
@@ -239,11 +239,12 @@ function calcnops(ls::LoopSet, os::OperationStruct) | |||
offsets = ls.loopsymbol_offsets | |||
idxs = loopindex(ls, os.loopdeps, 0x04) # FIXME DRY | |||
iszero(length(idxs)) && return 1 | |||
return map(i->offsets[i+1]-offsets[i], idxs) | |||
return maximum(i->offsets[i+1]-offsets[i], idxs) | |||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, only loads and stores are allowed to have a mix of nops
as inputs.
nops
for loads and stores are also always 1
, so we don't need a vector.
Tests passed locally (I re-enabled 5d). I set Travis to allow failures on master, because master currently seems to be broken. I made a few comments discussing some of the changes. I'll work on teaching it that constant-offset indices reduce the cost of loads if unrolling that loop (i.e., unrolling the |
Wow, amazing work! Sorry I didn't bring this closer to completion, I was hoping it was a simple matter. Certainly I'd approve merging this whenever you are satisfied with it. Anything else I can do to help? |
Thanks, I appreciate your contributions and getting this started! I'd also be happy to write up how different things work. Some of that could also help me figure out what it should look like after components get rewritten/refactored with internal APIs. I merged because tests passed. Performance needs work, but I can handle that later. Performance work mostly means adding support for recognizing constant integer offsets, and then using these in
Anything in particular I can answer? |
I need to go back and figure out how operations work in greater detail. https://chriselrod.github.io/LoopVectorization.jl/latest/devdocs/loopset_structure/ was really helpful in getting going, but now I see I should go back to that page and really internalize all the stuff there. So mostly I think this is me just not doing my homework carefully enough. Some things that might help:
I can take a stab at some of these, it would be good for me. |
ImageFiltering contains optimizations for separable filters: if the filter size along each dimension is
n[i]
, then per-pixel a fully-separable filter has costO(n[1] + ... + n[d])
, where as a non-separable filter has costO(n[1] * ... * n[d])
. Consequently this can be a very large savings. This PR aims to support such operations by extending support for the exponential-filter demo in https://julialang.org/blog/2016/02/iteration/.This doesn't yet work, because I haven't yet compensated for what the commented-out
@assert
is supposed to catch. I thought I'd submit this now to discuss the right architecture changes. My presumption is that the easiest approach is to returnΔidxs
fromcalcnops
and thus makenopsv
aVector{Vector{Int}}
. But if you have a better suggestion I am all ears.I'll leave a couple of comments in the changes.