-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: support for CartesianIndices #66
Conversation
Codecov Report
@@ Coverage Diff @@
## master #66 +/- ##
==========================================
+ Coverage 93.32% 93.42% +0.10%
==========================================
Files 27 27
Lines 2501 2586 +85
==========================================
+ Hits 2334 2416 +82
- Misses 167 170 +3
Continue to review full report at Codecov.
|
src/reconstruct_loopset.jl
Outdated
function Loop(ls::LoopSet, l::Int, k::Int, ::Type{<:CartesianIndices{N}}) where N | ||
start = gensym(:loopstart); stop = gensym(:loopstop) | ||
axisexpr = Expr(:ref, Expr(:., Expr(:ref, :lb, l), QuoteNode(:indices)), k) | ||
pushpreamble!(ls, Expr(:(=), start, Expr(:macrocall, Symbol("@inbounds"), LineNumberNode(@__LINE__, Symbol(@__FILE__)), Expr(:(.), axisexpr, QuoteNode(:start))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the package assumes arrays have been adjusted to 0-indexing (because it makes calculating the pointer offsets a little easier).
I.e., note the staticm1
(static minus 1) here: https://github.com/chriselrod/LoopVectorization.jl/blob/master/src/graphs.jl#L359
I think it would be simpler to instead leave things [whatever they originally were] indexed, and only subtract 1 while lowering here.
An alternative to handling it at lowering is to instead make stridedpointers use the offsets of the parent array types.
What needs to happen is that the for i in I, j in J, k in K, l in L
# do something with A[i,k], B[i,j], and C[i,l]
end the Now lets say that
Meaning that if index_type == LoopIndex and One fix would be to have the index_vec = Symbol[]
while index_types != zero(UInt64)
ind = indices % UInt8
if index_types == LoopIndex
for inda in 1+cumulative_offset[ind]:cumulative_offset[ind+1]
pushfirst!(index_vec, ls.loopsymbols[inda])
end
else
symind= if...
end
pushfirt!(index_vec, symind)
end
index_types >>>= 8
...
end The loop This reverse order is because of the first in/last out approach to writing information to and then reading it from bits julia> @macroexpand @avx for i ∈ I, j ∈ J, k ∈ K, l ∈ L
s += A[i,k] * B[i,j] * C[i,l]
end
quote
var"##loopi#274" = LoopVectorization.maybestaticrange(I)
var"##i_loop_lower_bound#275" = LoopVectorization.staticm1(first(var"##loopi#274"))
var"##i_loop_upper_bound#276" = last(var"##loopi#274")
var"##loopj#277" = LoopVectorization.maybestaticrange(J)
var"##j_loop_lower_bound#278" = LoopVectorization.staticm1(first(var"##loopj#277"))
var"##j_loop_upper_bound#279" = last(var"##loopj#277")
var"##loopk#280" = LoopVectorization.maybestaticrange(K)
var"##k_loop_lower_bound#281" = LoopVectorization.staticm1(first(var"##loopk#280"))
var"##k_loop_upper_bound#282" = last(var"##loopk#280")
var"##loopl#283" = LoopVectorization.maybestaticrange(L)
var"##l_loop_lower_bound#284" = LoopVectorization.staticm1(first(var"##loopl#283"))
var"##l_loop_upper_bound#285" = last(var"##loopl#283")
var"##vptr##_A" = LoopVectorization.stridedpointer(A)
var"##vptr##_B" = LoopVectorization.stridedpointer(B)
var"##vptr##_C" = LoopVectorization.stridedpointer(C)
local var"##s_0"
begin
$(Expr(:gc_preserve, :(var"##s_0" = LoopVectorization._avx_!(Val{(0, 0)}(), Tuple{:LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000013, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000014, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x03), :LoopVectorization, :vmul, LoopVectorization.OperationStruct(0x0000000000000124, 0x0000000000000000, 0x0000000000000000, 0x0000000000000203, LoopVectorization.compute, 0x00, 0x04), :LoopVectorization, Symbol("##254"), LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000001324, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x05), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000001324, 0x0000000000001324, 0x0000000000000000, 0x0000000000010405, LoopVectorization.compute, 0x00, 0x05)}, Tuple{LoopVectorization.ArrayRefStruct(0x0000000000000101, 0x0000000000000103, 0x0000000000007074), LoopVectorization.ArrayRefStruct(0x0000000000000101, 0x0000000000000102, 0xfffffffffffff07d), LoopVectorization.ArrayRefStruct(0x0000000000000101, 0x0000000000000104, 0xffffffffffffffff)}, Tuple{0, Tuple{6}, Tuple{5}, Tuple{}, Tuple{}, Tuple{}, Tuple{}}, (var"##i_loop_lower_bound#275":var"##i_loop_upper_bound#276", var"##j_loop_lower_bound#278":var"##j_loop_upper_bound#279", var"##k_loop_lower_bound#281":var"##k_loop_upper_bound#282", var"##l_loop_lower_bound#284":var"##l_loop_upper_bound#285"), var"##vptr##_A", var"##vptr##_B", var"##vptr##_C", s)), :A, :B, :C))
end
s = LoopVectorization.reduced_add(var"##s_0", s)
end To save you from scrolling far enough to see the relevant part, it is: Tuple{LoopVectorization.ArrayRefStruct(0x0000000000000101, 0x0000000000000103, 0x0000000000007074), LoopVectorization.ArrayRefStruct(0x0000000000000101, 0x0000000000000102, 0xfffffffffffff07d), LoopVectorization.ArrayRefStruct(0x0000000000000101, 0x0000000000000104, 0xffffffffffffffff)} The third field looks like junk, but I'm not using it yet (it's intended to carry offsets relative to a base array, to make it aware of the CSE opportunities in a loop with indices In our example where the second loop The cost modelling assumes that all arrays are column major, so if an array happens to be row major (i.e., an adjoint or a tranpose), it'll tranpose it to make it column major and reverse the order of indices. Does this make it clear how it works? Another change that we may want to make is in the
This looks very wrong to me, but by the power of multiple dispatch... julia> first(CartesianIndices(A)):last(CartesianIndices(A)) == CartesianIndices(A)
true It'll give us the correct answer. But it'll probably be preferable to pass the original julia> CartesianIndices(A) |> typeof
CartesianIndices{2,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}}
julia> first(CartesianIndices(A)):last(CartesianIndices(A)) |> typeof
CartesianIndices{2,Tuple{UnitRange{Int64},UnitRange{Int64}}} We could also fix it by defining a I'm very appreciative of this work and really looking forward to this landing. It'll make a lot of my code in |
Wow, this looks amazing! I haven't looked it over carefully, but I will this weekend, and I'm very happy to dive in. If it starts becoming not worth your time feel free to just short-circuit me and do it yourself, but in the meantime I'm happy to try to push this over the finish line. |
Great, I'm happy to have collaborators! Maybe I could work on some "developer docs" to describe how it works in more detail to would-be contributors. |
Agreed. For me this comes up in grant applications, where having other contributors is convincing evidence that people actually care about the package. It also makes it more fun.
I had thought about the same thing. I was thinking of typing up notes as I learn and sort of blurting them out there; you can edit or suggest edits until they actually become true 😄. |
Certainly.
You're more than welcome too. I've also created some initial dev docs; you're more than welcome to add any notes there as well. I'm more than happy to edit notes / answer questions / expand on something, etc. But if something comes up and you're no longer looking to finish, no worries, just let me know and I'll wrap it up. |
Progress. I haven't yet addressed the issue of offset-by-1, basically because there's a prior sticking point: when I run ERROR: MethodError: no method matching vload(::LoopVectorization.LoopValue, ::Tuple{Int64,VectorizationBase._MM{8,Int64}})
Closest candidates are:
vload(::LoopVectorization.LoopValue, ::Tuple{VectorizationBase._MM{W,I} where I<:Number}) where W at /home/tim/.julia/dev/LoopVectorization/src/vectorizationbase_extensions.jl:4
vload(::LoopVectorization.LoopValue, ::Tuple{VectorizationBase._MM{W,I} where I<:Number}, ::VectorizationBase.Mask) where W at /home/tim/.julia/dev/LoopVectorization/src/vectorizationbase_extensions.jl:6
vload(::LoopVectorization.LoopValue, ::Integer) at /home/tim/.julia/dev/LoopVectorization/src/vectorizationbase_extensions.jl:7
...
Stacktrace:
[1] macro expansion at /home/tim/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:295 [inlined]
[2] _avx_! at /home/tim/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:295 [inlined]
[3] macro expansion at ./gcutils.jl:91 [inlined]
[4] avxgeneric!(::OffsetArray{Float32,2,Array{Float32,2}}, ::Array{Float32,2}, ::OffsetArray{Float32,2,Array{Float32,2}}, ::CartesianIndices{2,Tuple{OffsetArrays.IdOffsetRange{Int64,Base.OneTo{Int64}},OffsetArrays.IdOffsetRange{Int64,Base.OneTo{Int64}}}}, ::Float32) at ./none:3
[5] avxgeneric!(::OffsetArray{Float32,2,Array{Float32,2}}, ::Array{Float32,2}, ::OffsetArray{Float32,2,Array{Float32,2}}) at ./none:2
[6] top-level scope at none:0 As far as I can tell this comes from here, where |
This also allowed for correcting the offsets, e.g. if someone wrote Because you're doing arithmetic on the loop iterables here ( tmp += A[I+J]*kern[J] It introduces the julia> @macroexpand @_avx for I in R
tmp = z
for J in Rk
tmp += A[I+J]*kern[J]
end
out[I] = tmp
end
quote
$(Expr(:meta, :inline))
begin
var"##loopI#286" = LoopVectorization.maybestaticrange(R)
var"##I_loop_lower_bound#287" = LoopVectorization.staticm1(first(var"##loopI#286"))
var"##I_loop_upper_bound#288" = last(var"##loopI#286")
var"##loopJ#289" = LoopVectorization.maybestaticrange(Rk)
var"##J_loop_lower_bound#290" = LoopVectorization.staticm1(first(var"##loopJ#289"))
var"##J_loop_upper_bound#291" = last(var"##loopJ#289")
var"##vptr##_A" = LoopVectorization.stridedpointer(A)
var"##LOOPSYMBOL##I" = LoopVectorization.LoopValue()
var"##vptr##_##LOOPSYMBOL##I" = LoopVectorization.stridedpointer(var"##LOOPSYMBOL##I")
var"##LOOPSYMBOL##J" = LoopVectorization.LoopValue()
var"##vptr##_##LOOPSYMBOL##J" = LoopVectorization.stridedpointer(var"##LOOPSYMBOL##J")
var"##vptr##_kern" = LoopVectorization.stridedpointer(kern)
var"##vptr##_out" = LoopVectorization.stridedpointer(out)
var"##T#285" = promote_type(eltype(A), eltype(kern), eltype(out))
var"##W#284" = LoopVectorization.pick_vector_width_val(eltype(A), eltype(kern), eltype(out))
var"##mask##" = LoopVectorization.masktable(var"##W#284", LoopVectorization.valrem(var"##W#284", var"##J_loop_upper_bound#291" - var"##J_loop_lower_bound#290"))
end
begin
$(Expr(:gc_preserve, :(begin
var"##outer##I##outer##" = LoopVectorization.unwrap(var"##I_loop_lower_bound#287")
while var"##outer##I##outer##" < var"##I_loop_upper_bound#288" - 3
J = LoopVectorization._MM(var"##W#284", var"##J_loop_lower_bound#290")
I = var"##outer##I##outer##"
var"######LOOPSYMBOL##I#293_0_" = LoopVectorization.vload(var"##vptr##_##LOOPSYMBOL##I", (I,))
var"####reduction#297_0_0" = LoopVectorization.vzero(var"##W#284", var"##T#285")
var"####reduction#297_0_1" = LoopVectorization.vzero(var"##W#284", var"##T#285")
var"####reduction#297_0_2" = LoopVectorization.vzero(var"##W#284", var"##T#285")
var"####reduction#297_0_3" = LoopVectorization.vzero(var"##W#284", var"##T#285")
I += 1
var"######LOOPSYMBOL##I#293_1_" = LoopVectorization.vload(var"##vptr##_##LOOPSYMBOL##I", (I,))
var"####reduction#297_1_0" = LoopVectorization.vzero(var"##W#284", var"##T#285")
var"####reduction#297_1_1" = LoopVectorization.vzero(var"##W#284", var"##T#285")
var"####reduction#297_1_2" = LoopVectorization.vzero(var"##W#284", var"##T#285")
var"####reduction#297_1_3" = LoopVectorization.vzero(var"##W#284", var"##T#285")
I += 1
var"######LOOPSYMBOL##I#293_2_" = LoopVectorization.vload(var"##vptr##_##LOOPSYMBOL##I", (I,))
var"####reduction#297_2_0" = LoopVectorization.vzero(var"##W#284", var"##T#285")
var"####reduction#297_2_1" = LoopVectorization.vzero(var"##W#284", var"##T#285")
var"####reduction#297_2_2" = LoopVectorization.vzero(var"##W#284", var"##T#285")
var"####reduction#297_2_3" = LoopVectorization.vzero(var"##W#284", var"##T#285")
I += 1
var"######LOOPSYMBOL##I#293_3_" = LoopVectorization.vload(var"##vptr##_##LOOPSYMBOL##I", (I,))
var"####reduction#297_3_0" = LoopVectorization.vzero(var"##W#284", var"##T#285")
var"####reduction#297_3_1" = LoopVectorization.vzero(var"##W#284", var"##T#285")
var"####reduction#297_3_2" = LoopVectorization.vzero(var"##W#284", var"##T#285")
var"####reduction#297_3_3" = LoopVectorization.vzero(var"##W#284", var"##T#285")
begin
while J < var"##J_loop_upper_bound#291" - LoopVectorization.valmuladd(var"##W#284", 4, -1)
var"######LOOPSYMBOL##J#294_0" = LoopVectorization.vload(var"##vptr##_##LOOPSYMBOL##J", (J,))
var"######LOOPSYMBOL##J#294_1" = LoopVectorization.vload(var"##vptr##_##LOOPSYMBOL##J", (J + LoopVectorization.valmul(var"##W#284", 1),))
var"######LOOPSYMBOL##J#294_2" = LoopVectorization.vload(var"##vptr##_##LOOPSYMBOL##J", (J + LoopVectorization.valmul(var"##W#284", 2),))
var"######LOOPSYMBOL##J#294_3" = LoopVectorization.vload(var"##vptr##_##LOOPSYMBOL##J", (J + LoopVectorization.valmul(var"##W#284", 3),))
var"####tempload#296_0" = LoopVectorization.vload(var"##vptr##_kern", (J,))
var"####tempload#296_1" = LoopVectorization.vload(var"##vptr##_kern", (J + LoopVectorization.valmul(var"##W#284", 1),))
var"####tempload#296_2" = LoopVectorization.vload(var"##vptr##_kern", (J + LoopVectorization.valmul(var"##W#284", 2),))
var"####tempload#296_3" = LoopVectorization.vload(var"##vptr##_kern", (J + LoopVectorization.valmul(var"##W#284", 3),))
I = var"##outer##I##outer##"
var"####indexpr#292_0_0" = var"######LOOPSYMBOL##I#293_0_" + var"######LOOPSYMBOL##J#294_0"
var"####indexpr#292_0_1" = var"######LOOPSYMBOL##I#293_0_" + var"######LOOPSYMBOL##J#294_1"
var"####indexpr#292_0_2" = var"######LOOPSYMBOL##I#293_0_" + var"######LOOPSYMBOL##J#294_2"
var"####indexpr#292_0_3" = var"######LOOPSYMBOL##I#293_0_" + var"######LOOPSYMBOL##J#294_3"
var"####tempload#295_0_0" = LoopVectorization.vload(var"##vptr##_A", (var"####indexpr#292_0_0" - 1,))
var"####tempload#295_0_1" = LoopVectorization.vload(var"##vptr##_A", (var"####indexpr#292_0_1" - 1,))
var"####tempload#295_0_2" = LoopVectorization.vload(var"##vptr##_A", (var"####indexpr#292_0_2" - 1,))
var"####tempload#295_0_3" = LoopVectorization.vload(var"##vptr##_A", (var"####indexpr#292_0_3" - 1,))
var"####reduction#297_0_0" = LoopVectorization.vfmadd(var"####tempload#295_0_0", var"####tempload#296_0", var"####reduction#297_0_0")
var"####reduction#297_0_1" = LoopVectorization.vfmadd(var"####tempload#295_0_1", var"####tempload#296_1", var"####reduction#297_0_1")
var"####reduction#297_0_2" = LoopVectorization.vfmadd(var"####tempload#295_0_2", var"####tempload#296_2", var"####reduction#297_0_2")
var"####reduction#297_0_3" = LoopVectorization.vfmadd(var"####tempload#295_0_3", var"####tempload#296_3", var"####reduction#297_0_3")
I += 1 I think it may be best to render them obsolete, by making the The new dummy should lower simply to the loop indexes themselves. The idea would be this new object would at least have Thoughts? If you agree that it'll simplify things, I'll go ahead and make all these updates to VectorizationBase and SIMDPirates, and issue new releases.
|
Yes, that sounds like an excellent plan. IIUC only the first index would be struct CartesianVIndex{N,W,K} <: Base.AbstractCartesianIndex{N}
vi::VectorizationBase._MM{W}
tail::CartesianIndex{K}
end It's a pity we can't use |
Unfortunately, this seems more complicated than I originally thought.
We can't rely on the struct CartesianVIndex{N,W,K} <: Base.AbstractCartesianIndex{N}
i::CartesianIndex{N}
end where struct CartesianVIndex{N,T<:Tuple} <: Base.AbstractCartesianIndex{N}
i::T
end We would need something closer to this last definition if we wanted to always use it for lowering (currently, it always uses tuples for lowering; I think switching to always using We would need this flexibility, because it is possible for multiple indexes to be For convenience, I'd avoid nesting We'd also have to be mindful of how we update the ArrayReferenceMeta constructor, discussed in one of my first comments in this PR. Basically, if we go this route, if A[I] can be expanded to A[I_1,I_2,I_3] while A[I + R] will not be expanded, still using just the single index. If we go this route, another change we should make is adding a mutable
The problem faced by the the un-transposing and reversing index order trick is that it would also have to reverse the order of Another alternative we can consider is that instead of creating This will still require updating determinestrategy to see that julia> q = :(for i ∈ 1:100
s += a[i,i]
end);
julia> LoopVectorization.operations(lsq)
3-element Array{LoopVectorization.Operation,1}:
s = 0
var"##tempload#266" = a[i, i]
s = LoopVectorization.vadd(var"##tempload#266", s)
julia> LoopVectorization.unitstride(LoopVectorization.operations(lsq)[2], :i)
true It incorrectly returns julia> q = :(for i ∈ 1:100
s += a[i + 3]
end)
:(for i = 1:100
#= REPL[22]:2 =#
s += a[i + 3]
end)
julia> lsq = LoopVectorization.LoopSet(q);
julia> LoopVectorization.operations(lsq)
6-element Array{LoopVectorization.Operation,1}:
s = 0
var"####LOOPSYMBOL##i#270" = var"##LOOPSYMBOL##i"[i]
var"##loopconstnumber#271" = 0
var"##indexpr#269" = var"####LOOPSYMBOL##i#270" + var"##loopconstnumber#271"
var"##tempload#272" = a[var"##DISCONTIGUOUSSUBARRAY##", var"##indexpr#269"]
s = LoopVectorization.vadd(var"##tempload#272", s)
julia> LoopVectorization.unitstride(LoopVectorization.operations(lsq)[5], :i)
false but returns The new definition should should look for function unitstride(op, s, contigind = contiguousindex(op))
inds = getindices(op)
parentindoffset = ??
for i in eachindex(inds)
if op.ref.loopedindex[i] # then this index is a loop symbol
contigind -= 1
iszero(contigind) && continue # currently contiguous index always == 1
inds[i] == s && return false
else
parentindoffset += 1
parenttosearch = parents(op)[parentindoffset]
unitstride(parenttosearch, s, contigind) || return false
contigind -= numindices(parenttosearch)
end
end
true
end If the vectorized index
unitstride(IpJ, s, 7)
unitstride(k, s, 3)
unitstride(LpM, s, 2) The values of Thoughts as to the approaches we should take? For now, I suspect the "expand to N operations" will be simpler. |
I've released VectorizationBase v0.8, and SIMDPirates 0.7 should be tagged in a few minutes. julia> A = rand(100,100);
julia> using VectorizationBase, SIMDPirates
[ Info: Precompiling SIMDPirates [21efa798-c60a-11e8-04d3-e1a92915a26a]
julia> vload(stridedpointer(A), (1,1))
0.742845271680725
julia> A[1:8,1:8]
8×8 Array{Float64,2}:
0.742845 0.757178 0.300299 0.603263 0.93428 0.831621 0.0556697 0.304235
0.321009 0.992542 0.691654 0.894312 0.322318 0.815803 0.173785 0.472578
0.839145 0.236637 0.863609 0.855714 0.830834 0.852366 0.817774 0.271009
0.497822 0.406762 0.448195 0.827784 0.391112 0.197293 0.985954 0.483548
0.406454 0.186679 0.630388 0.510679 0.0327818 0.106299 0.920731 0.360811
0.597366 0.422358 0.620349 0.642949 0.318277 0.0382444 0.807226 0.516668
0.70947 0.703065 0.599923 0.238758 0.742942 0.171481 0.307564 0.234002
0.309503 0.868046 0.444636 0.785198 0.802417 0.892564 0.378186 0.981091
julia> vload(stridedpointer(A), (_MM{8}(1),1))
SVec{8,Float64}<0.742845271680725, 0.32100884916103745, 0.8391447789719655, 0.4978220946937666, 0.4064540466312667, 0.5973661806262187, 0.709469624346265, 0.30950340386114306>
julia> vload(stridedpointer(A), (1,_MM{8}(1)))
SVec{8,Float64}<0.742845271680725, 0.7571776100487808, 0.30029898328718097, 0.6032634638402294, 0.9342799038628531, 0.8316213365756264, 0.055669703980538054, 0.30423537816346524>
julia> vload(stridedpointer(A), (_MM{8}(1),_MM{8}(1)))
SVec{8,Float64}<0.742845271680725, 0.9925421783730857, 0.8636089381258407, 0.8277841006116959, 0.0327818040145591, 0.03824440231241022, 0.30756379402750467, 0.9810911051207567> |
Yeah, I realized that as I was going to bed last night.
Especially if we need multiple
Perhaps I can help here. I'm not sure I understand what you mean by "query julia> nitems(::Tuple{Vararg{Any,N}}) where N = N
nitems (generic function with 1 method)
julia> nitems((1, "hello", 3.2))
3 is viable. There's a lot you can do on tuples without generated functions. For example, there's a possibility of replacing some of
That seems sensible. Mildly relatedly, in your var"##loopconstnumber#271" = 0 rather than 3.
That's my impression too. I'm planning on fixing JuliaLang/julia#9080 that way.
Hope you didn't sacrifice anything substantial by doing this! (I checked the commits and they don't look bad, but...) It does seem easier to use the native indexing, thanks for doing this! |
That is what I meant, thanks.
Regarding that issue, out of curiosity, how/where will you do the expansion? Seems like this would require expression manipulation?
I suspect that it will still have some compile-time benefits.
No problem. Those commits were small; I suspect much smaller than the amount of code we can now remove from this library.
I should fix the show method to at least not be terribly misleading: function Base.show(io::IO, op::Operation)
if isconstant(op)
if op.instruction === LOOPCONSTANT
print(io, Expr(:(=), op.variable, 0))
else
print(io, Expr(:(=), op.variable, op.instruction.instr))
end It always prints 0, regardless of what the actual value is. |
Right now no, but that could be added. Now that I think about it, it might work to call
Probably right before this line. At that point the code has been lowered and type-inferred, but not inlined or subjected to a number of other optimization passes. When you replace it there are basically two steps:
|
I updated master to:
I'll run benchmarks later, if it doesn't look like I'm done fiddling. |
Awesome! Good timing too; I wanted to see what that compiler pass would look like, but now that it's out there I can let that sit until a few more people have looked at it. So I might get back to this soon. |
Progress! It actually runs now. The only issue is returning the correct answer 😄. I still need to take a close look at the generated code. |
I think it looks good and I like your approach. But errors are a lot easier than silently wrong answers! Looking at the generated expression: var"####op#855_0_0" = var"##n#845" + var"##n#851"
var"####op#855_0_1" = var"##n#845" + var"##n#851"
var"####op#855_0_2" = var"##n#845" + var"##n#851"
var"####op#855_0_3" = var"##n#845" + var"##n#851"
var"####op#856_0_0" = LoopVectorization.vload(var"##862", (var"####op#855_0_0",))
var"####op#856_0_1" = LoopVectorization.vload(var"##862", (var"####op#855_0_1",))
var"####op#856_0_2" = LoopVectorization.vload(var"##862", (var"####op#855_0_2",))
var"####op#856_0_3" = LoopVectorization.vload(var"##862", (var"####op#855_0_3",))
So what's missing is adding extra It is correctly loading from var"####op#857_0" = LoopVectorization.vload(var"##863", (var"##n#851", var"##n#848"))
var"####op#857_1" = LoopVectorization.vload(var"##863", (var"##n#851", var"##n#848" + 1))
var"####op#857_2" = LoopVectorization.vload(var"##863", (var"##n#851", var"##n#848" + 2))
var"####op#857_3" = LoopVectorization.vload(var"##863", (var"##n#851", var"##n#848" + 3)) |
More progress, but this is starting to feel a bit messy so I thought I'd ask for help rather than go down a crazy rabbit hole. What I realized was that, just like I needed to add more loop symbols and then keep track of the offsets relative to the original definition, so too I need to add additional operations (to handle Here's a replay of my latest test session. The key part of this is to pay attention to is that it shows all the operations after creating them (and setting their parents). julia> using LoopVectorization, OffsetArrays
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]
julia> function avxgeneric!(out, A, kern, R=CartesianIndices(out), z=zero(eltype(out)))
Rk = CartesianIndices(kern)
@avx for I in R
tmp = z
for J in Rk
tmp += A[I+J]*kern[J]
end
out[I] = tmp
end
out
end
avxgeneric! (generic function with 3 methods)
julia> T = Float32
Float32
julia> A = rand(T, 100, 100);
julia> kern = OffsetArray(rand(T, 3, 3), -1:1, -1:1);
julia> out1 = OffsetArray(similar(A, size(A).-2), 1, 1); # stay away from the edges of A
julia> avxgeneric!(out1, A, kern);
op = var"##op#447#1#" = z
op = var"##op#447#2#" = z
op = var"##op#448#1#" = var"##op#448#1#"
op = var"##op#448#2#" = var"##op#448#2#"
op = var"##op#449#1#" = var"##op#449#1#"
op = var"##op#449#2#" = var"##op#449#2#"
op = var"##op#450#1#" = var"##op#449#1#" + var"##op#448#1#"
op = var"##op#450#2#" = var"##op#449#2#" + var"##op#448#2#"
op = var"##op#451#1#" = A[var"##op#450"]
op = var"##op#451#2#" = A[var"##op#450"]
op = var"##op#452#1#" = kern[var"J#1#", var"J#2#"]
op = var"##op#452#2#" = kern[var"J#1#", var"J#2#"]
op = var"##op#453#1#" = var"##reductzero#434"
op = var"##op#453#2#" = var"##reductzero#434"
op = var"##op#453#1#" = LoopVectorization.vfmadd(var"##op#453#1#", var"##op#452#1#", var"##op#451#1#")
op = var"##op#453#2#" = LoopVectorization.vfmadd(var"##op#453#2#", var"##op#452#2#", var"##op#451#2#")
op = var"##op#447#1#" = LoopVectorization.reduced_add(var"##op#447#1#", var"##op#453#1#")
op = var"##op#447#2#" = LoopVectorization.reduced_add(var"##op#447#2#", var"##op#453#2#")
op = out[var"I#1#", var"I#2#"] = var"##op#447#1#"
op = out[var"I#1#", var"I#2#"] = var"##op#447#2#"
op = var"##op#467#1#" = z
op = var"##op#467#2#" = z
op = var"##op#468#1#" = var"##op#468#1#"
op = var"##op#468#2#" = var"##op#468#2#"
op = var"##op#469#1#" = var"##op#469#1#"
op = var"##op#469#2#" = var"##op#469#2#"
op = var"##op#470#1#" = var"##op#469#1#" + var"##op#468#1#"
op = var"##op#470#2#" = var"##op#469#2#" + var"##op#468#2#"
op = var"##op#471#1#" = A[var"##op#470"]
op = var"##op#471#2#" = A[var"##op#470"]
op = var"##op#472#1#" = kern[var"J#1#", var"J#2#"]
op = var"##op#472#2#" = kern[var"J#1#", var"J#2#"]
op = var"##op#473#1#" = var"##reductzero#434"
op = var"##op#473#2#" = var"##reductzero#434"
op = var"##op#473#1#" = LoopVectorization.vfmadd(var"##op#473#1#", var"##op#472#1#", var"##op#471#1#")
op = var"##op#473#2#" = LoopVectorization.vfmadd(var"##op#473#2#", var"##op#472#2#", var"##op#471#2#")
op = var"##op#467#1#" = LoopVectorization.reduced_add(var"##op#467#1#", var"##op#473#1#")
op = var"##op#467#2#" = LoopVectorization.reduced_add(var"##op#467#2#", var"##op#473#2#")
op = out[var"I#1#", var"I#2#"] = var"##op#467#1#"
op = out[var"I#1#", var"I#2#"] = var"##op#467#2#"
op = var"##op#487#1#" = z
op = var"##op#487#2#" = z
op = var"##op#488#1#" = var"##op#488#1#"
op = var"##op#488#2#" = var"##op#488#2#"
op = var"##op#489#1#" = var"##op#489#1#"
op = var"##op#489#2#" = var"##op#489#2#"
op = var"##op#490#1#" = var"##op#489#1#" + var"##op#488#1#"
op = var"##op#490#2#" = var"##op#489#2#" + var"##op#488#2#"
op = var"##op#491#1#" = A[var"##op#490"]
op = var"##op#491#2#" = A[var"##op#490"]
op = var"##op#492#1#" = kern[var"J#1#", var"J#2#"]
op = var"##op#492#2#" = kern[var"J#1#", var"J#2#"]
op = var"##op#493#1#" = var"##reductzero#434"
op = var"##op#493#2#" = var"##reductzero#434"
op = var"##op#493#1#" = LoopVectorization.vfmadd(var"##op#493#1#", var"##op#492#1#", var"##op#491#1#")
op = var"##op#493#2#" = LoopVectorization.vfmadd(var"##op#493#2#", var"##op#492#2#", var"##op#491#2#")
op = var"##op#487#1#" = LoopVectorization.reduced_add(var"##op#487#1#", var"##op#493#1#")
op = var"##op#487#2#" = LoopVectorization.reduced_add(var"##op#487#2#", var"##op#493#2#")
op = out[var"I#1#", var"I#2#"] = var"##op#487#1#"
op = out[var"I#1#", var"I#2#"] = var"##op#487#2#"
ERROR: MethodError: no method matching extract_external_functions!(::LoopVectorization.LoopSet, ::Nothing)
Closest candidates are:
extract_external_functions!(::LoopVectorization.LoopSet, ::Int64) at /home/tim/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:257
Stacktrace:
[1] avx_loopset(::Array{LoopVectorization.Instruction,1}, ::Array{LoopVectorization.OperationStruct,1}, ::Array{LoopVectorization.ArrayRefStruct,1}, ::Core.SimpleVector, ::Core.SimpleVector, ::Core.SimpleVector, ::NTuple{4,DataType}) at /home/tim/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:289
[2] _avx_loopset(::Core.SimpleVector, ::Core.SimpleVector, ::Core.SimpleVector, ::Core.SimpleVector, ::Core.SimpleVector, ::NTuple{4,DataType}) at /home/tim/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:308
[3] #s68#88(::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Type, ::Type, ::Type, ::Type, ::Any, ::Any, ::Any) at /home/tim/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:315
[4] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any,N} where N) at ./boot.jl:524
[5] macro expansion at ./gcutils.jl:91 [inlined]
[6] avxgeneric!(::OffsetArray{Float32,2,Array{Float32,2}}, ::Array{Float32,2}, ::OffsetArray{Float32,2,Array{Float32,2}}, ::CartesianIndices{2,Tuple{OffsetArrays.IdOffsetRange{Int64,Base.OneTo{Int64}},OffsetArrays.IdOffsetRange{Int64,Base.OneTo{Int64}}}}, ::Float32) at ./REPL[2]:3
[7] avxgeneric!(::OffsetArray{Float32,2,Array{Float32,2}}, ::Array{Float32,2}, ::OffsetArray{Float32,2,Array{Float32,2}}) at ./REPL[2]:2
[8] top-level scope at REPL[7]:1 There are a couple of things that seem undesirable here:
Any suggestions about how to proceed? |
I would make sure they're unnecessary for later steps. That is, you should be able to discard them as soon as The new
|
I made a few changes so that tests pass locally. |
…even when there is only a single loop (because that single loop could be over a CartesianIndex).
Love it---thanks so much for finishing this off, and sorry I couldn't help more! |
In the test below,
avxgeneric!
is actually how I want to write this function so that it supports arbitrary dimensions. It fails because of a mismatch between expectations from the number of explicit loops and the number of implicit loops when you expand out the CartesianIndices. I started trying to fix the issues but at a certain point it started feeling like whack-a-mole, and I wondered if you have any suggestions for how to make this better. (I am happy to do the work as a way to get to know this package better, though if you'd prefer to take this over that's fine with me.)