Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make for iter::CartesianIndices better vectorized for 1d/2d cases. #45338

Closed
wants to merge 3 commits into from

Conversation

N5N3
Copy link
Member

@N5N3 N5N3 commented May 17, 2022

Local benchmark shows that this makes non-@simd 1d/2d loop faster.

function sumcart_iter(A)
    s = zero(eltype(A))
    for I in CartesianIndices(A)
        @inbounds @fastmath s += A[I] # use @fastmath to enable simd
    end
    s
end

on master:

julia> A = view(rand(256*20),1:256*20);
julia> @btime sumcart_iter($A)
  4.657 μs (0 allocations: 0 bytes)
2539.6790676561955

julia> B = view(rand(256,20),1:256,1:20);
julia> @btime sumcart_iter($B)
  4.657 μs (0 allocations: 0 bytes)
2557.7341880811764

This PR

julia> @btime sumcart_iter($A)
  305.200 ns (0 allocations: 0 bytes)
2539.6790676562046

julia> @btime sumcart_iter($B)
  406.500 ns (0 allocations: 0 bytes)
2557.734188081178

@N5N3 N5N3 added the performance Must go faster label May 17, 2022
@N5N3 N5N3 requested a review from johnnychen94 May 17, 2022 13:38
rng = indices[1]
I = state[1] + step(rng)
valid = __is_valid_range(I, rng) && state[1] != last(rng)
if N == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I get the idea -- when will N != 1 here?

Copy link
Member Author

@N5N3 N5N3 May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N is the dimension of the CartesianIndices:

  1. If N > 1, the outermost dimension uses __is_valid_range to preserve the performance improvement introduced in add StepRange support for CartesianIndices #37829.
  2. If N == 1, just use state[1] != last(rng), as __is_valid_range pervents vectorization.

Copy link
Member

@johnnychen94 johnnychen94 May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But by calling __inc(state.I, iter.indices, Val(ndims(iter))) as in you did in R404, because of the type annotion state::Tuple{Int}, indices::Tuple{OrdinalRangeInt}, this method R420 would only be called when length(iter.indices) == ndims(iter) == 1, right?

Copy link
Member Author

@N5N3 N5N3 May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also called at R435 as the last input ndim is passed deeper without any change.

BTW, all the test failures should be invalid state. Since we have

julia> iterate(1:2:typemax(Int), typemax(Int)-1)
(-9223372036854775808, -9223372036854775808)

I think it's OK to replace them.

@johnnychen94
Copy link
Member

johnnychen94 commented May 17, 2022

This is good for me as long as the test passes. Also ping @vchuravy and @timholy as this was originally written in #31011

@N5N3
Copy link
Member Author

N5N3 commented May 18, 2022

Some local BaseBenchmark:

          "index" => 7-element BenchmarkTools.BenchmarkGroup:
                  tags: ["sum", "simd"]
                  ("sumeach", "SubArray{Int32, 2, Matrix{Int32}, Tuple{UnitRange{Int64}, UnitRange{Int64}}, false}") => TrialJudgement(-85.57% => improvement)
                  ("sumcartesian", "SubArray{Int32, 2, Matrix{Int32}, Tuple{UnitRange{Int64}, UnitRange{Int64}}, false}") => TrialJudgement(-87.51% => improvement)
                  ("sumcartesian", "1:100000") => TrialJudgement(-100.00% => improvement)
                  ("sumeach", "SubArray{Int32, 2, BaseBenchmarks.ArrayBenchmarks.ArrayLS{Int32, 2}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, false}") => TrialJudgement(-87.62% => improvement)
                  ("sumcartesian", "SubArray{Int32, 2, BaseBenchmarks.ArrayBenchmarks.ArrayLS{Int32, 2}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, false}") => TrialJudgement(-87.48% => improvement)
                  ("sumcartesian", "SubArray{Int32, 2, Matrix{Int32}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, true}") => TrialJudgement(-84.04% => improvement)
                  ("sumcartesian", "100000:-1:1") => TrialJudgement(-100.00% => improvement)

@N5N3 N5N3 force-pushed the cart-auto-simd branch from d105525 to db66c7a Compare May 18, 2022 10:54
N5N3 and others added 3 commits June 18, 2022 15:15
@N5N3 N5N3 force-pushed the cart-auto-simd branch from db66c7a to 66b8e2d Compare June 18, 2022 07:15
@N5N3
Copy link
Member Author

N5N3 commented Jun 18, 2022

With today's master (LLVM14), this PR also helps vectorizing 3d CartesianIndices (Very limited though)

@N5N3
Copy link
Member Author

N5N3 commented Oct 8, 2023

No need after #51606.

@N5N3 N5N3 closed this Oct 8, 2023
@jonas-schulze
Copy link
Contributor

It's a bit sad that (subjectively) so many PRs get forgotten about for so long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants