Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster iteration over subarrays (views) #48720

Merged
merged 4 commits into from
Feb 24, 2023

Conversation

matthias314
Copy link
Contributor

@matthias314 matthias314 commented Feb 18, 2023

Iterating over a SubArray (view) of an integer-valued array is often much slower than iterating over the array itself:

f(v) = foldl(+, v; init = zero(eltype(v)))

a = rand(Int8, 2^16);
v = view(a, :);

julia> @btime f($a)
  570.860 ns (0 allocations: 0 bytes)
julia> @btime f($v)
  34.613 μs (0 allocations: 0 bytes)

a = rand(Int, 2^16);
v = view(a, :);

julia> @btime f($a)
  7.957 μs (0 allocations: 0 bytes)
julia> @btime f($v)
  34.741 μs (0 allocations: 0 bytes)

My understanding is that the index transformation necessary for accessing the view prevents the compiler from optimizing the code via SIMD instructions. This PR addresses this by a dedicated iterate method for FastContiguousSubArray. This is the (already existing) subtype of SubArray for which the indices in the parent array are known to be contiguous and increasing; see subarrays.jl for details. For example, view(a, :,:,2:3,4) is of this type, but view(a, 1,:,2:3,4) and view(a, 3:-1:2,1,1,1) are not (although they are of type FastSubArray, meaning that they allow fast linear indexing into the parent array). It's hard to imagine SIMD speed-ups for other subarrays.

In the following benchmarks (made with @btime f($v) etc.), the function f from above is applied to the arrays

v = rand(Int8, 2^16);
u = codeunits(String(rand('a':'z', 2^16)));
a = rand(Int8, 32,32,32,32)

and views into them.

function call master this PR
f(v) 567.227 ns
f(view(v,:)) 34.615 μs 563.778 ns
f(u) 576.681 ns
f(view(u,:)) 34.613 μs 581.473 ns
f(a) 15.847 μs
f(view(a, 2:3,1,1,1)) 3.545 ns 4.084 ns
f(view(a, :,2:3,1,1)) 36.684 ns 10.300 ns
f(view(a, :,:,2:3,1)) 1.091 μs 25.071 ns
f(view(a, :,:,:,2:3)) 34.610 μs 572.445 ns

One can see that with the new method, iterating over a view is as fast as iterating over the parent array.

EDIT: Using @fastmath one can obtain analogous speed-ups for floating-point arrays:

g(v) = @fastmath foldl(+, v; init = zero(eltype(v)))

a = rand(Float64, 2^16);
v = view(a, :);

julia> @btime g($a)
  7.963 μs (0 allocations: 0 bytes)

julia> @btime g($v)
  69.192 μs (0 allocations: 0 bytes)   # Julia 1.9.0-beta2
  7.967 μs (0 allocations: 0 bytes)    # this PR
Julia Version 1.10.0-DEV.635
Commit 12d329b6e3 (2023-02-17 20:01 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 4 virtual cores

@N5N3
Copy link
Member

N5N3 commented Feb 19, 2023

We just need @inbounds to make SIMD works in this case.
LLVM is clever enough to optimize the index transformation away.
But it's hard for it to prove the access is inbounds as SubArray store's it's own indices.

@Seelengrab
Copy link
Contributor

Does SIMDing in foldl preserve the foldl order of operations? I'm not sure we want to use @fastmath by default, since it allows reassociations and can lead to very surprising differences in results.

@jakobnissen
Copy link
Contributor

jakobnissen commented Feb 19, 2023

Closes #43295.
Could this be generalized to views with integer StepRange indices, such that it also works for @view a[1:2:5] (i.e FastSubArray not necessary contiguous)?

Also, this introduces inbounds into a view that may wrap any AbstractArray, including one with a mis-implementation of length. I don't think this is safe. Can you check if LLVM can omit the boundscheck automatically?

Thank you for the contribution!

@dkarrasch dkarrasch added performance Must go faster arrays [a, r, r, a, y, s] iteration Involves iteration or the iteration protocol labels Feb 19, 2023
@matthias314
Copy link
Contributor Author

Note that I'm using foldl just as an example for an iteration.

We just need @inbounds to make SIMD works in this case.

@N5N3 Are you sure? On master the timings for

f(v) = @inbounds foldl(+, v; init = zero(eltype(v)))

are the same as those without @inbounds.

Also, this introduces inbounds into a view that may wrap any AbstractArray, including one with a mis-implementation of length. I don't think this is safe.

@jakobnissen It's only unsafe for a view into some T <: AbstractArray if getindex is defined for T with @propagate_inbounds. I would argue that one should only use @propagate_inbounds if one is sure that the implementation is correct.

Can you check if LLVM can omit the boundscheck automatically?

If I drop the @inbounds in iterate, I get the following timings:

n = 2^16

v = rand(Int8, n);
w = view(v, :);

julia> @btime f($v)
  568.296 ns (0 allocations: 0 bytes)
julia> @btime f($w)
  21.947 μs (0 allocations: 0 bytes)

s = String(rand('a':'z', n));
u = codeunits(s);
v = view(u, :);

julia> @btime f($u)
  562.832 ns (0 allocations: 0 bytes)
julia> @btime f($v)
  662.285 ns (0 allocations: 0 bytes)

Note that for CodeUnits they are very similar, but not for Vector. For Vector I can still see

movabs	rax, offset ijl_bounds_error_ints

in @code_native f(w).

Could this be generalized to views with integer StepRange indices, such that it also works for @view a[1:2:5] (i.e FastSubArray not necessary contiguous)?

One problem I see is that the step value would not be known at compile time. More importantly, are there SIMD instructions that can handle non-contiguous data?

@jakobnissen
Copy link
Contributor

Good point regarding safety, and indeed it looks like LLVM can't omit boundscheck.

It's true that the step range would not be known at compile time (except for contiguous arrays where the step size is zero). But you would still have the benefit of inbounds. And it means you can potentially use a single implementation for contiguous and non-contiguous views with no runtime cost.

@matthias314
Copy link
Contributor Author

you can potentially use a single implementation for contiguous and non-contiguous views with no runtime cost

For that one needs to change iterate to something linke

function Base.iterate(v::Base.FastSubArray,
        state = (v.offset1+v.stride1*firstindex(v), v.offset1+v.stride1*lastindex(v)))
    i, l = state
    @inbounds i-1 < l ? (v.parent[i], (i+v.stride1, l)) : nothing
end

Then there is much smaller performance gain for FastContiguousSubArray,

n = 2^16; v = rand(Int8, n); w = view(v, :);

julia> @btime f($v)
  565.796 ns (0 allocations: 0 bytes)
julia> @btime f($w)
  34.617 μs (0 allocations: 0 bytes)   # master
  21.928 μs (0 allocations: 0 bytes)   # with iterate as above

For FastSubArray, one gets as fast as mapfoldl in this case:

r = 2:2:length(v); w = view(v, r);   # v as above

julia> @btime mapfoldl(i -> @inbounds($v[i]), +, $r; init = zero(eltype($v)))
  10.963 μs (0 allocations: 0 bytes)
julia> @btime f($w)
  190.195 μs (0 allocations: 0 bytes)   # master
  10.980 μs (0 allocations: 0 bytes)    # with iterate as above

However, one could achieve the same by adding @inbounds to the iterate method for AbstractArray, which is the one currently used for SubArray. (I'm omitting code and benchmark here.) That idea was discussed in #40397 and not implemented in the end. In light of the above performance difference, one might have wished for a different outcome.

@N5N3
Copy link
Member

N5N3 commented Feb 20, 2023

are the same as those without @inbounds.

I'm sure. Non of foldl and iterate propagates @inbounds thus @inbounds foldl won't omit the bounds check.
To make this work, you need:

struct SafeArray{T,N,P<:AbstractArray{T,N}} <: AbstractArray{T,N}
    parent::P
end
@inline Base.size(x::SafeArray) = Base.size(x.parent)
@inline Base.getindex(x::SafeArray, is...) = @inbounds x.parent[is...]

Then you will get

julia> @btime f($a)
  244.975 ns (0 allocations: 0 bytes)
157838443398623711

julia> @btime f($v)
  1.480 μs (0 allocations: 0 bytes)
157838443398623711

julia> @btime f(SafeArray($v))
  249.086 ns (0 allocations: 0 bytes)
157838443398623711

In fact, you can also try this

function f_c(v)
    eachindex(v) == eachindex(parent(v)) || throw("") # inbounds hit for LLVM
    foldl(+, v; init = zero(eltype(v)));
end

At least on master I get

julia> @btime f($v)
  1.880 μs (0 allocations: 0 bytes)
157838443398623711

julia> @btime f_c($v)
  247.880 ns (0 allocations: 0 bytes)
157838443398623711

@jakobnissen
Copy link
Contributor

@matthias314 When I use the function stride(v, 1) instead of accessing v.stride as a field, for me it constant folds, and I get maximal performance from the contiguous arrays, while also being as fast for non-contiguous arrays. Here is the function:

function Base.iterate(v::Base.FastSubArray,
    state = (v.offset1+stride(v,1)*firstindex(v), v.offset1+stride(v,1)*lastindex(v)))
    i, l = state
    @inbounds i-1 < l ? (v.parent[i], (i+stride(v,1), l)) : nothing
end

And the benchmarks:

f(v) = foldl(+, v; init = zero(eltype(v)))
a = rand(Int8, 2^16);
v = view(a, :);

julia> @btime f($a);
  485.785 ns (0 allocations: 0 bytes)

julia> @btime f($v);
  489.887 ns (0 allocations: 0 bytes)

@N5N3
Copy link
Member

N5N3 commented Feb 20, 2023

stride1 for FastSubArray is specific for the parent while stride(_, 1) is specific for the memory layout.
They can be different and strides might be undefined for FastSubArray if its parent has no strides defined.

BTW, I don't think this PR could be finished with out the added @inbounds.
As #40397 was closed with "this is too unsafe", I'm not sure is there a way to make this PR "safer".
Perhaps we can limit the function sig to StridedArray, which should cover most of the simdable cases, but the concern from #40397 (comment) remains unresolved.

@matthias314
Copy link
Contributor Author

matthias314 commented Feb 20, 2023

Thanks a lot for your suggestions! Some of them seem to require end-users to rewrite their code (which was not what I had in mind), but they are all very instructive.

I agree that nothing can be done without using @inbounds, and that may not fly in general. However, I'm reluctant to give up completely. For elements of type Int8, we are talking about potential speed-up factors of 60 (FastContiguousSubArray) or 17 (FastSubArray and I believe SubArray in general). That seems to much to let go of it. At least for views into Array and maybe other standard types one should do something. What do you think?

@jakobnissen
Copy link
Contributor

I agree 100%, this is a good contribution and will really help. If you restrict the type signature to only views of Array with either an AbstractUnitRange or AbstractStepRange, then most real-life cases will be covered.

@matthias314
Copy link
Contributor Author

matthias314 commented Feb 21, 2023

Here is a new version. It's now safe (that is, with bound checks), but still as fast as iterating over the parent array in many cases: For a FastContiguousSubArray this holds if it is known to be 1-based. For any other FastSubArray a it holds if additionally stride(a,1) can be determined at compile-time (@jakobnissen's idea). Here are some benchmarks:

n = 2^16; m = 2^8

v = rand(Int8, n)
u = codeunits(String(rand('a':'z', n)))
ov = OffsetVector(v, -n:-1)
oa = OffsetMatrix(rand(Int8, m, m), -m:-1, -m:-1)

f(v) = foldl(+, v; init = zero(eltype(v)))

In each case, f is applied to the given array.

array without view view (master) view (this PR)
view(v, :) 573.613 ns 34.615 μs 571.708 ns
@view v[2:2:end] 10.962 μs 190.205 μs 10.980 μs
view(u,:) 577.000 ns 27.047 μs 567.232 ns
view(ov,:) 573.751 ns 40.696 μs 34.628 μs
view(oa,:,:) 578.016 ns 34.615 μs 588.478 ns

The reason that OffsetVector is so much slower than a higher-dimensional OffsetArray is that in the higher-dimensional case the linear indexing is always 1-based while for a vector it is the given indexing. Also note that without a separate function for FastContiguousSubArray the runtime for a higher-dimensional OffsetArray oa would degrade since stride(oa,1) cannot be determined ahead of time. (OffsetArray is not a subtype of StridedArray.)

With @tkf's proposal #40397 the iteration over all views above would be as fast as over their parent arrays, but I don't know how to achieve this while keeping iterate safe.

@N5N3
Copy link
Member

N5N3 commented Feb 22, 2023

Nice result.
I also make some trial locally, and the following version works well.

@inline function iterate(v::FastSubArray)
    # The cached `stride1` might loss the const info.
    st = compute_stride1(parent(v), v.indices)
    i = v.offset1 + st * firstindex(v)
    l = v.offset1 + st * lastindex(v) + st
    return iterate(v, (i, st, l))
end

@inline function iterate(v::FastSubArray, (i, st, l))
    flipsign(i, st) >= flipsign(l, st) && return nothing
    return v.parent[i], (i + st, st, l)
end

Just wondering why LLVM prefer this version though.
Compared with the AbstractArray fallback, it just forward the boundcheck to parent.

The lower IR shows that LLVM keeps the boundcheck block in the non-simd brach correctly, and it also generates a simd-branch.
If other guys can confirm the benefit, I think we can switch to this version and run BaseBenchmark.

Edit: The problem of 1d OffsetArray could be fixed with the added checkindex hack

checkindex(::Type{Bool}, ind::OneTo{T}, I::T) where {T<:BitSigned} = unsigned(I-1) < last(ind) # make wrapped Array happy

With that we have

n = 2^10; m = 2^5
v = rand(Int, n)
ov = OffsetVector(v, -n:-1)
oa = OffsetMatrix(rand(Int, m, m), -m:-1, -m:-1)
f(v) = foldl(+, v; init = zero(eltype(v)))
@btime f($v);                   #65.000 ns (0 allocations: 0 bytes)
@btime f($(view(v, 1:n)));      #67.041 ns (0 allocations: 0 bytes)
@btime f($(view(ov, -n:-1)));   #67.959 ns (0 allocations: 0 bytes)
@btime f($(view(oa, 1:n)));     #65.816 ns (0 allocations: 0 bytes)

@matthias314
Copy link
Contributor Author

Great! Compared to the current version of the PR, I get the following for the examples I posted above:

  • view(v, :), view(u, :), view(oa,:,:): minor (5%-10%) slowdown
  • @view v[2:2:end]: 50% slowdown
  • @view v[end:-1:1]: 30% speedup (35.761 μs compared to 51.903 μs; mapfoldl(i -> @inbounds(v[i]), +, length(v):-1:1; init = zero(eltype(v))): 21.906 μs)
  • view(ov,:): dramatic speedup, only 15% slower than iterating over ov

(Here I'm using the modified checkindex.)

@N5N3
Copy link
Member

N5N3 commented Feb 22, 2023

The enlarged performance gap on Int8 is not that supprising to me.
As the unvectorlized tailing would be much longer than Int64. (127 vs 15)
Since my version can't omit the boundcheck in non-simd branch, the impact could not be neglected.
Although on my PC, the gap is much smaller. (Perhaps your PC support AVX512 ? )

julia> v = rand(Int8, 1 + 2^16);

julia> @btime f($v);
  496.373 ns (0 allocations: 0 bytes)

julia> @btime f($(view(v, 1:n)));
  543.548 ns (0 allocations: 0 bytes)

julia> @btime f($(view(v, 1:n+1)));
  504.688 ns (0 allocations: 0 bytes)

@vtjnash
Copy link
Member

vtjnash commented Feb 22, 2023

Just a naming nit: C-style array are not SafeArray, since C is not a safe language, they are UnsafeArray (or UnsafeView, which exists currently)

@N5N3
Copy link
Member

N5N3 commented Feb 22, 2023

Update the latest benchmark:

  1. ov = OffsetArray(v, -n-1);
  2. oa = OffsetArray(reshape(v, isqrt(n), :), 0, 0);
  3. the result is the slow down ratio compared with UnSafeArray
n = 2^16; v = rand(Int8, n); n = 2^10; v = rand(Int, n);
v 99.8% 99.4%
view(v, 1:n) 110% 102%
view(v, 2:2:n) 130% 125%
view(v, n:-1:1) 130% 127%
view(ov, -n:-1) 108% 99%
view(oa, :, :) 109% 101%

@N5N3
Copy link
Member

N5N3 commented Feb 22, 2023

@nanosoldier runbenchmarks("array", vs=":master")

base/subarray.jl Outdated Show resolved Hide resolved
base/subarray.jl Outdated Show resolved Hide resolved
@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@N5N3
Copy link
Member

N5N3 commented Feb 23, 2023

The Benchmark looks good. But some of the bench suite seems broken after this change.

  1. ["array", "index", "sub2ind"]
function perf_sub2ind(sz, irange, jrange, krange)
    linear = LinearIndices(sz)
    s = 0
    for k in krange, j in jrange, i in irange
        ind = linear[i, j, k]
        s += ind
    end
    s
end
perf_sub2ind((1000,1000,1000), 1:1000, 1:1000, 1:1000)

The problem here is perf_sub2ind get vectorlized after the checkindex change. But I'm not sure if this is wanted.

  1. ["array", "index", ("mapr_access", "BaseBenchmarks.ArrayBenchmarks.ArrayLF{Float32, 2}")]
function perf_mapr_access(A, B, zz, n) #20517
    @inbounds for j in 1:n
        B[j] = mapreduce(k -> A[j,k]*A[k,j], +, 1:j; init=zz)
    end
    B
end

The problem here is k -> A[j,k]*A[k,j] is not inlined by default.
We can fix this by unsigned(I-1) < unsigned(last(ind)), but it still increases the inline cost (from 3 to 4 for OneTo{Int})
And the result might be not correct for illed OneTo

julia> a = reinterpret(Base.OneTo{Int}, [-10])[]
Base.OneTo(-10)

julia> Base.checkindex(Bool, a, 2)
true

@matthias314
Copy link
Contributor Author

Although we have already moved forward quite a bit, I would like to ask a basic question: The whole point of this endeavor is to make iterate safe. Otherwise we could just use @inbounds in iterate(a::AbstractArray) (as suggested in #40397) and get what seems to be optimal performance.

Instead of addressing this for each type separately (FastSubArray in our case), one could think of more general approaches. One would be to extend the @inbounds / @propagate_inbounds mechanism to iterate, see #40397 (comment). That seems to me a very clean solution, but one might have to change quite a few method definitions to realize it.

Another approach would be to use @N5N3's idea of SafeArray UnsafeArray (see above). For this type, getindex always uses @inbounds. Defining

unsafe_iter(x) = x
unsafe_iter(x::AbstractArray) = UnsafeArray(x)

one can now replace an iterator iter by unsafe_iter(iter) in any function one wants to speed up. For example, doing so for mapfoldl would already go a long way since many other functions rely to it. One could also think of generators (Generator, zip etc.) and of the lowering of for loops. I've tried out mapfoldl, Generator and zip and got excellent performance for views, with just a few lines of code. Am I overlooking any problems with this approach, or has this already been discussed and rejected?

@N5N3
Copy link
Member

N5N3 commented Feb 23, 2023

Am I overlooking any problems with this approach, or has this already been discussed and rejected?

I think users are free to use unsafe_iter in their code if they can ensure safety.

As for iterate with @inbounds / @propagate_inbounds, I think we wont have it in recent Base as they have been discussed in #40397.

@matthias314
Copy link
Contributor Author

I think users are free to use unsafe_iter in their code if they can ensure safety.

But why not have it in Base? As far as I understand the discussion in #40397, the concern with @inbounds inside iterate was that the function may be fed with an invalid state. With unsafe_iter encapsulated inside a function from Base, this would not be possible -- assuming that the function is correctly implemented, of course. If one has doubts about that, then one wouldn't use unsafe_iter in this function, analogously to the use or non-use of @inbounds.

@N5N3
Copy link
Member

N5N3 commented Feb 23, 2023

assuming that the function is correctly implemented.

Then there's no difference with @propagate_inbounds + @inbounds?

In my opinion, unsafe_iter is more like a performance hack rather than a stable api, thus a package might be a better choise.
For personall usage, I tend to put these code in the startup file as it's really small.

@N5N3 N5N3 force-pushed the m3/iterate-subarray branch 2 times, most recently from 78210bf to 7fa7e09 Compare February 23, 2023 05:34
@N5N3
Copy link
Member

N5N3 commented Feb 23, 2023

@nanosoldier runbenchmarks("array", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@matthias314
Copy link
Contributor Author

matthias314 commented Feb 23, 2023

Then there's no difference with @propagate_inbounds + @inbounds?

The difference would only be in the way it's implemented. I still think that the cleanest way would be to use @propagate_inbounds for all iterators one wants to speed up. This would be completely in line with getindex and setindex!. However, I fear that this idea would be rejected because it requires a lot of code changes. (I didn't see an explicit reason against this in #40397.) EDIT: Maybe I'm overestimating the necessary changes here.

I believe that with unsafe_iter the changes would be much smaller: One defines UnsafeArray as well as unsafe_iter for each iterator one wants to speed up (including things like Generator and Zip) plus a call to unsafe_iter in any function one wants to speed up. I don't think there are many of them. You may argue that it's a hack, but it may have better chances to be realized than the @inbounds approach. Or am I wrong here?

The main advantage of both approaches compared to the one we are working on here is that it frees us from coming up with rather tricky code for each type that we want to speed up.

@N5N3
Copy link
Member

N5N3 commented Feb 24, 2023

At a second thought, what we need here might not be the iterate specialization.
I tried to make Array use the default iterate fallback and it's simdable.
Theoretically, the default fallback is equivalent to for i in eachindex(...) in foldl and LLVM should be able to omit the boundcheck code.
I tried to play with the checkindex hack on 1.9 beta1

julia> f(v) = foldl(+, v; init = zero(eltype(v)))
f (generic function with 1 method)

julia> n = 2^10; v = rand(Int, n); ov = OffsetArray(v, -1-n); vv = view(v, 1:n); sv = reshape(vv, :, 2^5); vov = view(ov, :); vov2 = view(ov, -n:-1);

julia> for h in (v, ov, vv, sv, vov, vov2)
       println(typeof(h))
       @btime f($h)
       end
Vector{Int64}
  61.601 ns (0 allocations: 0 bytes)
OffsetVector{Int64, Vector{Int64}}
  59.980 ns (0 allocations: 0 bytes)
SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}
  374.879 ns (0 allocations: 0 bytes)
Base.ReshapedArray{Int64, 2, SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}, Tuple{}}
  471.939 ns (0 allocations: 0 bytes)
SubArray{Int64, 1, OffsetVector{Int64, Vector{Int64}}, Tuple{Base.Slice{OffsetArrays.IdOffsetRange{Int64, Base.OneTo{Int64}}}}, true}
  553.476 ns (0 allocations: 0 bytes)
SubArray{Int64, 1, OffsetVector{Int64, Vector{Int64}}, Tuple{UnitRange{Int64}}, true}
  373.430 ns (0 allocations: 0 bytes)

julia> Base.checkindex(::Type{Bool}, ind::Base.OneTo{T}, I::T) where {T<:Base.BitInteger} = unsigned(I - one(I)) < unsigned(last(ind))

julia> Base.checkindex(::Type{Bool}, ind::Base.IdentityUnitRange, I::Real) = checkindex(Bool, ind.indices, I) # needed for OffsetArray

julia> for h in (v, ov, vv, sv, vov, vov2)
       println(typeof(h))
       @btime f($h)
       end
Vector{Int64}
  61.546 ns (0 allocations: 0 bytes)
OffsetVector{Int64, Vector{Int64}}
  60.020 ns (0 allocations: 0 bytes)
SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}
  63.646 ns (0 allocations: 0 bytes)
Base.ReshapedArray{Int64, 2, SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}, Tuple{}}
  63.710 ns (0 allocations: 0 bytes)
SubArray{Int64, 1, OffsetVector{Int64, Vector{Int64}}, Tuple{Base.Slice{OffsetArrays.IdOffsetRange{Int64, Base.OneTo{Int64}}}}, true}
  68.641 ns (0 allocations: 0 bytes)
SubArray{Int64, 1, OffsetVector{Int64, Vector{Int64}}, Tuple{UnitRange{Int64}}, true}
  63.265 ns (0 allocations: 0 bytes)

I haven't check more AbstractArray types but looks like it works quite well.

Of course view(v, 2:2:n) is another story, as length(::StepRange) is not inlined by default.
For now user could use view(v, 2:static(2):n) via Static.jl to make it inlined.
And in future this problem should be solved with #47844

@N5N3 N5N3 force-pushed the m3/iterate-subarray branch from 7fa7e09 to 370d946 Compare February 24, 2023 03:21
the `checkindex` hack is based on the assumption that `last(ind) >= 0`

remove `iterate` specialization.
@N5N3 N5N3 force-pushed the m3/iterate-subarray branch from 370d946 to bb84177 Compare February 24, 2023 03:27
@N5N3
Copy link
Member

N5N3 commented Feb 24, 2023

@nanosoldier runbenchmarks("array", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@N5N3 N5N3 merged commit 1543cdd into JuliaLang:master Feb 24, 2023
@matthias314 matthias314 deleted the m3/iterate-subarray branch March 5, 2023 00:36
@NHDaly NHDaly added the backport 1.9 Change should be backported to release-1.9 label Aug 10, 2023
@NHDaly
Copy link
Member

NHDaly commented Aug 10, 2023

Marking this PR as backport to 1.9, to fix #50831, since this PR was required in order for #48933 to be perf-neutral, but #48933 was previously backported to 1.9 without this one.

KristofferC pushed a commit that referenced this pull request Aug 14, 2023
This change simplifies the boundscheck in loop as LLVM would lift the const subtraction.
Simd block would be generated in more cases.

Co-authored-by: N5N3 <[email protected]>
(cherry picked from commit 1543cdd)
@IanButterworth IanButterworth removed the backport 1.9 Change should be backported to release-1.9 label Aug 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] iteration Involves iteration or the iteration protocol performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants