-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
faster iteration over subarrays (views) #48720
Conversation
We just need |
Does SIMDing in |
Closes #43295. Also, this introduces inbounds into a view that may wrap any AbstractArray, including one with a mis-implementation of length. I don't think this is safe. Can you check if LLVM can omit the boundscheck automatically? Thank you for the contribution! |
Note that I'm using
@N5N3 Are you sure? On master the timings for
are the same as those without
@jakobnissen It's only unsafe for a view into some
If I drop the
Note that for
in
One problem I see is that the step value would not be known at compile time. More importantly, are there SIMD instructions that can handle non-contiguous data? |
Good point regarding safety, and indeed it looks like LLVM can't omit boundscheck. It's true that the step range would not be known at compile time (except for contiguous arrays where the step size is zero). But you would still have the benefit of inbounds. And it means you can potentially use a single implementation for contiguous and non-contiguous views with no runtime cost. |
For that one needs to change
Then there is much smaller performance gain for
For
However, one could achieve the same by adding |
I'm sure. Non of struct SafeArray{T,N,P<:AbstractArray{T,N}} <: AbstractArray{T,N}
parent::P
end
@inline Base.size(x::SafeArray) = Base.size(x.parent)
@inline Base.getindex(x::SafeArray, is...) = @inbounds x.parent[is...] Then you will get julia> @btime f($a)
244.975 ns (0 allocations: 0 bytes)
157838443398623711
julia> @btime f($v)
1.480 μs (0 allocations: 0 bytes)
157838443398623711
julia> @btime f(SafeArray($v))
249.086 ns (0 allocations: 0 bytes)
157838443398623711 In fact, you can also try this function f_c(v)
eachindex(v) == eachindex(parent(v)) || throw("") # inbounds hit for LLVM
foldl(+, v; init = zero(eltype(v)));
end At least on master I get julia> @btime f($v)
1.880 μs (0 allocations: 0 bytes)
157838443398623711
julia> @btime f_c($v)
247.880 ns (0 allocations: 0 bytes)
157838443398623711 |
@matthias314 When I use the function
And the benchmarks:
|
BTW, I don't think this PR could be finished with out the added |
Thanks a lot for your suggestions! Some of them seem to require end-users to rewrite their code (which was not what I had in mind), but they are all very instructive. I agree that nothing can be done without using |
I agree 100%, this is a good contribution and will really help. If you restrict the type signature to only views of |
6062a58
to
70d1d95
Compare
Here is a new version. It's now safe (that is, with bound checks), but still as fast as iterating over the parent array in many cases: For a
In each case,
The reason that With @tkf's proposal #40397 the iteration over all views above would be as fast as over their parent arrays, but I don't know how to achieve this while keeping |
Nice result. @inline function iterate(v::FastSubArray)
# The cached `stride1` might loss the const info.
st = compute_stride1(parent(v), v.indices)
i = v.offset1 + st * firstindex(v)
l = v.offset1 + st * lastindex(v) + st
return iterate(v, (i, st, l))
end
@inline function iterate(v::FastSubArray, (i, st, l))
flipsign(i, st) >= flipsign(l, st) && return nothing
return v.parent[i], (i + st, st, l)
end Just wondering why LLVM prefer this version though. The lower IR shows that LLVM keeps the boundcheck block in the non-simd brach correctly, and it also generates a simd-branch. Edit: The problem of 1d OffsetArray could be fixed with the added checkindex(::Type{Bool}, ind::OneTo{T}, I::T) where {T<:BitSigned} = unsigned(I-1) < last(ind) # make wrapped Array happy With that we have n = 2^10; m = 2^5
v = rand(Int, n)
ov = OffsetVector(v, -n:-1)
oa = OffsetMatrix(rand(Int, m, m), -m:-1, -m:-1)
f(v) = foldl(+, v; init = zero(eltype(v)))
@btime f($v); #65.000 ns (0 allocations: 0 bytes)
@btime f($(view(v, 1:n))); #67.041 ns (0 allocations: 0 bytes)
@btime f($(view(ov, -n:-1))); #67.959 ns (0 allocations: 0 bytes)
@btime f($(view(oa, 1:n))); #65.816 ns (0 allocations: 0 bytes) |
Great! Compared to the current version of the PR, I get the following for the examples I posted above:
(Here I'm using the modified |
The enlarged performance gap on julia> v = rand(Int8, 1 + 2^16);
julia> @btime f($v);
496.373 ns (0 allocations: 0 bytes)
julia> @btime f($(view(v, 1:n)));
543.548 ns (0 allocations: 0 bytes)
julia> @btime f($(view(v, 1:n+1)));
504.688 ns (0 allocations: 0 bytes) |
Just a naming nit: C-style array are not SafeArray, since C is not a safe language, they are UnsafeArray (or UnsafeView, which exists currently) |
Update the latest benchmark:
|
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
The Benchmark looks good. But some of the bench suite seems broken after this change.
function perf_sub2ind(sz, irange, jrange, krange)
linear = LinearIndices(sz)
s = 0
for k in krange, j in jrange, i in irange
ind = linear[i, j, k]
s += ind
end
s
end
perf_sub2ind((1000,1000,1000), 1:1000, 1:1000, 1:1000) The problem here is
function perf_mapr_access(A, B, zz, n) #20517
@inbounds for j in 1:n
B[j] = mapreduce(k -> A[j,k]*A[k,j], +, 1:j; init=zz)
end
B
end The problem here is julia> a = reinterpret(Base.OneTo{Int}, [-10])[]
Base.OneTo(-10)
julia> Base.checkindex(Bool, a, 2)
true |
Although we have already moved forward quite a bit, I would like to ask a basic question: The whole point of this endeavor is to make Instead of addressing this for each type separately ( Another approach would be to use @N5N3's idea of
one can now replace an iterator |
I think users are free to use As for |
But why not have it in Base? As far as I understand the discussion in #40397, the concern with |
Then there's no difference with In my opinion, |
78210bf
to
7fa7e09
Compare
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
The difference would only be in the way it's implemented. I still think that the cleanest way would be to use I believe that with The main advantage of both approaches compared to the one we are working on here is that it frees us from coming up with rather tricky code for each type that we want to speed up. |
At a second thought, what we need here might not be the julia> f(v) = foldl(+, v; init = zero(eltype(v)))
f (generic function with 1 method)
julia> n = 2^10; v = rand(Int, n); ov = OffsetArray(v, -1-n); vv = view(v, 1:n); sv = reshape(vv, :, 2^5); vov = view(ov, :); vov2 = view(ov, -n:-1);
julia> for h in (v, ov, vv, sv, vov, vov2)
println(typeof(h))
@btime f($h)
end
Vector{Int64}
61.601 ns (0 allocations: 0 bytes)
OffsetVector{Int64, Vector{Int64}}
59.980 ns (0 allocations: 0 bytes)
SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}
374.879 ns (0 allocations: 0 bytes)
Base.ReshapedArray{Int64, 2, SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}, Tuple{}}
471.939 ns (0 allocations: 0 bytes)
SubArray{Int64, 1, OffsetVector{Int64, Vector{Int64}}, Tuple{Base.Slice{OffsetArrays.IdOffsetRange{Int64, Base.OneTo{Int64}}}}, true}
553.476 ns (0 allocations: 0 bytes)
SubArray{Int64, 1, OffsetVector{Int64, Vector{Int64}}, Tuple{UnitRange{Int64}}, true}
373.430 ns (0 allocations: 0 bytes)
julia> Base.checkindex(::Type{Bool}, ind::Base.OneTo{T}, I::T) where {T<:Base.BitInteger} = unsigned(I - one(I)) < unsigned(last(ind))
julia> Base.checkindex(::Type{Bool}, ind::Base.IdentityUnitRange, I::Real) = checkindex(Bool, ind.indices, I) # needed for OffsetArray
julia> for h in (v, ov, vv, sv, vov, vov2)
println(typeof(h))
@btime f($h)
end
Vector{Int64}
61.546 ns (0 allocations: 0 bytes)
OffsetVector{Int64, Vector{Int64}}
60.020 ns (0 allocations: 0 bytes)
SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}
63.646 ns (0 allocations: 0 bytes)
Base.ReshapedArray{Int64, 2, SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}, Tuple{}}
63.710 ns (0 allocations: 0 bytes)
SubArray{Int64, 1, OffsetVector{Int64, Vector{Int64}}, Tuple{Base.Slice{OffsetArrays.IdOffsetRange{Int64, Base.OneTo{Int64}}}}, true}
68.641 ns (0 allocations: 0 bytes)
SubArray{Int64, 1, OffsetVector{Int64, Vector{Int64}}, Tuple{UnitRange{Int64}}, true}
63.265 ns (0 allocations: 0 bytes) I haven't check more Of course |
7fa7e09
to
370d946
Compare
the `checkindex` hack is based on the assumption that `last(ind) >= 0` remove `iterate` specialization.
370d946
to
bb84177
Compare
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
This change simplifies the boundscheck in loop as LLVM would lift the const subtraction. Simd block would be generated in more cases. Co-authored-by: N5N3 <[email protected]> (cherry picked from commit 1543cdd)
Iterating over a
SubArray
(view
) of aninteger-valuedarray is often much slower than iterating over the array itself:My understanding is that the index transformation necessary for accessing the view prevents the compiler from optimizing the code via SIMD instructions. This PR addresses this by a dedicated
iterate
method forFastContiguousSubArray
. This is the (already existing) subtype ofSubArray
for which the indices in the parent array are known to be contiguous and increasing; see subarrays.jl for details. For example,view(a, :,:,2:3,4)
is of this type, butview(a, 1,:,2:3,4)
andview(a, 3:-1:2,1,1,1)
are not (although they are of typeFastSubArray
, meaning that they allow fast linear indexing into the parent array). It's hard to imagine SIMD speed-ups for other subarrays.In the following benchmarks (made with
@btime f($v)
etc.), the functionf
from above is applied to the arraysand views into them.
f(v)
f(view(v,:))
f(u)
f(view(u,:))
f(a)
f(view(a, 2:3,1,1,1))
f(view(a, :,2:3,1,1))
f(view(a, :,:,2:3,1))
f(view(a, :,:,:,2:3))
One can see that with the new method, iterating over a view is as fast as iterating over the parent array.
EDIT: Using
@fastmath
one can obtain analogous speed-ups for floating-point arrays: