Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix overflow in CartesianIndices iteration #31011

Merged
merged 4 commits into from
Mar 29, 2019
Merged

fix overflow in CartesianIndices iteration #31011

merged 4 commits into from
Mar 29, 2019

Conversation

vchuravy
Copy link
Member

@vchuravy vchuravy commented Feb 8, 2019

Make iteration over CartesianIndices blazing fast.
Fixes #30998, by avoiding the potential overflow.

LLVM was right..., based on the given semantics it
is not possible to determine the loop-bounds of the
loop in:

function vecfail(A)
     @inbounds begin
         acc = zero(eltype(A))
         N = length(A)
         for I in CartesianIndices(A)
              acc = Base.FastMath.add_fast(acc, A[I])
         end
     end
     return acc
end

Since it is possible for the iteration to overflow:

julia> I = CartesianIndices(1:typemax(Int64));

julia> i= last(I)
CartesianIndex(9223372036854775807,)

julia> iterate(I, i)
(CartesianIndex(-9223372036854775808,), CartesianIndex(-9223372036854775808,))

@vchuravy vchuravy requested review from Keno and timholy February 8, 2019 16:56
base/multidimensional.jl Outdated Show resolved Hide resolved
@vchuravy vchuravy force-pushed the vc/cartesian branch 2 times, most recently from 1aac477 to eebadbc Compare February 9, 2019 03:04
@vchuravy
Copy link
Member Author

vchuravy commented Feb 9, 2019

I should add that this helps vectorize 1D CartesianIndices, but doesn't help with ND. The loop we lower two has two induction variables. instead of being a inner and a outer for-loop.

I would love for:

function vecfail(A)
            @inbounds begin
                acc = zero(eltype(A))
                for I in CartesianIndices(A)
                     acc = Base.FastMath.add_fast(acc, A[I])
                end
            end
            return acc
       end

to be more or less equivalent to

using Base.Cartesian
@generated function vecfail(A::AbstractArray{T, N}) where {T, N}
             quote
             @inbounds begin
                acc = zero(eltype(A))
                @nloops $N i A begin
                     acc = Base.FastMath.add_fast(acc, @nref($N, A, i))
                end
            end
            return acc
            end
       end

I know eachindex with arrays that support linear indexing will already get me there.

@vchuravy
Copy link
Member Author

vchuravy commented Feb 9, 2019

Ok updated once again after I realized that

state == last(iter) && return nothing

is basically equivalent to:

        idxend = map(last, iter.indices)
        all(map(>=, state.I, idxend)) && return nothing

Since the latter already accepts invalid state. I also experimented with checked arithmetic, but that didn't help with vectorisation.

Copy link
Member

@timholy timholy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I think between this suggestion and my refinement in the comment below, we may have a fix (or close to it) for #9080! That's one of the few bugs whose number I know by memory, so to me that's a big deal.

See this gist and then:

julia> includet("/tmp/iterc.jl")

julia> using BenchmarkTools

julia> @btime sumcart_manual($A)
  102.337 ms (0 allocations: 0 bytes)
5.0001993217249475e7

julia> @btime sumcart_iter($A)
  131.513 ms (0 allocations: 0 bytes)
5.0001993217249475e7

julia> @btime sumcart_newiter($A)
  105.207 ms (0 allocations: 0 bytes)
5.0001993217249475e7

@@ -334,9 +338,11 @@ module IteratorsMD
iterfirst, iterfirst
end
@inline function iterate(iter::CartesianIndices, state)
# If we increment before the condition check, we run the
# risk of an integer overflow. This intentionally allows for invalid state.
state == last(iter) && return nothing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the ND case, what if we weave this check together with inc? Here we check all of the entries of state and iter, and if there are any differences we bail. But then we potentially check them all again in inc. It seems you could be running this check on each entry as you go, and then bail if you've exhausted the tuple.

Something like this:

@inline function iterate(iter::CartesianIndices, state)
    inc((), state.I, first(iter).I, last(iter).I)
end

# increment & carry
@inline inc(out, ::Tuple{}, ::Tuple{}, ::Tuple{}) = nothing
@inline function inc(out, state, start, stop)
    if state[1] < stop[1]
        nextstate = CartesianIndex(out..., state[1]+1, tail(state)...)
        return nextstate, nextstate
    end
    return inc((out..., start[1]), tail(state), tail(start), tail(stop))
end

Now historically Julia's inference has not liked arguments that grow, but experimenting with @code_warntype up to 8 dimensions suggests it's OK in this case. Cross-checking with @vtjnash.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I went a slightly different approach that constructs the tuple in anycase and propagates a boolean to check if it is valid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this still solve #9080?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:/ No.

julia> @benchmark sumcart_iter($A)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     190.778 ms (0.00% GC)
  median time:      191.650 ms (0.00% GC)
  mean time:        191.665 ms (0.00% GC)
  maximum time:     192.456 ms (0.00% GC)
  --------------
  samples:          27
  evals/sample:     1

julia> @benchmark sumcart_manual($A)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     102.647 ms (0.00% GC)
  median time:      103.522 ms (0.00% GC)
  mean time:        103.663 ms (0.00% GC)
  maximum time:     105.404 ms (0.00% GC)
  --------------
  samples:          49
  evals/sample:     1

@timholy
Copy link
Member

timholy commented Feb 9, 2019

With regards to the multidimensional case, note that @simd strips off the first dimension's iteration variable to create two loops, an outer n-1-dimensional iteration and an inner 1-dimensional iteration. So at least in the typical case you can vectorize across the first dimension. Would be nice to need that and get the equivalent of the @nloops approach.

@vchuravy
Copy link
Member Author

vchuravy commented Feb 9, 2019

Would be nice to need that and get the equivalent of the @nloops approach.

I wonder if Polly could do things like this, but right now both -loop-analysis and -scalar-evolution get confused about the 2D case. (I will punt on that though)

@vchuravy
Copy link
Member Author

vchuravy commented Feb 9, 2019

Interesting using Tim's variant I do see a speedup 200->150, but not the same speedup as if I am using _newiter from the gist 200->100.

@timholy
Copy link
Member

timholy commented Feb 9, 2019

but not the same speedup as if I am using _newiter from the gist 200->100

Just to check, are you saying that if you "wire this in" to Base.iterate and then rely on lowering to reduce the for loop to an iterate loop, it's slower than the version I produced by manual lowering? If so, what's the difference between the two @code_lowereds?

@vchuravy
Copy link
Member Author

Just to check, are you saying that if you "wire this in" to Base.iterate and then rely on lowering to reduce the for loop to an iterate loop, it's slower than the version I produced by manual lowering? If so, what's the difference between the two @code_lowereds?

Exactly! I rewrote the newiter version to match more closely what the iter version produces:

using Base: tail

@inline function myiterate(iter::CartesianIndices)
    iterfirst, iterlast = first(iter), last(iter)
    if any(map(>, iterfirst.I, iterlast.I))
        return nothing
    end
    iterfirst, iterfirst
end
@inline function myiterate(iter::CartesianIndices, state)
    inc((), state.I, first(iter).I, last(iter).I)
end
# 0-d cartesian ranges are special-cased to iterate once and only once
myiterate(iter::CartesianIndices{0}, done=false) = done ? nothing : (CartesianIndex(), true)

# increment & carry
@inline inc(out, ::Tuple{}, ::Tuple{}, ::Tuple{}) = nothing
@inline function inc(out, state::Tuple{Int}, start::Tuple{Int}, stop::Tuple{Int})
    if state[1] < stop[1]
        nextstate = CartesianIndex(out..., state[1]+1)
        return nextstate, nextstate
    end
    return nothing
end
@inline function inc(out, state, start, stop)
    if state[1] < stop[1]
        nextstate = CartesianIndex(out..., state[1]+1, tail(state)...)
        return nextstate, nextstate
    end
    return inc((out..., start[1]), tail(state), tail(start), tail(stop))
end


function sumcart_newiter(A)
    s = 0.0
    iter = CartesianIndices(A)
    ret = myiterate(iter)
    @inbounds while !(ret === nothing)
        I, state = ret
        s += A[I]
        ret = myiterate(iter, state)
    end
    return s
end

The only difference is that this a head-loop and the iteration loop produces a check + tail-loop. For the curious I put both produced LLVM IR into godbolt and setup MCA for both https://godbolt.org/z/8owxJT

@vchuravy
Copy link
Member Author

Ok... before a755c6a

julia> @benchmark sumcart_iter($A)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     158.716 ms (0.00% GC)
  median time:      163.086 ms (0.00% GC)
  mean time:        174.294 ms (0.00% GC)
  maximum time:     288.133 ms (0.00% GC)
  --------------
  samples:          29
  evals/sample:     1

after:

julia> @benchmark sumcart_iter($A)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     105.531 ms (0.00% GC)
  median time:      106.831 ms (0.00% GC)
  mean time:        107.862 ms (0.00% GC)
  maximum time:     120.262 ms (0.00% GC)
  --------------
  samples:          47
  evals/sample:     1

This is a a bit odd since LLVM prefers do ... while loops, and definitely requires a nanosoldier run.

@nanosoldier runbenchmarks(ALL, vs = "@5a6b972282a91ceea422c479433e7d1413fea2b4")

@jekbradbury
Copy link
Contributor

What does this lowering change look like in terms of the equivalent Julia syntax? I think there’s an example of for loop lowering in the manual that might need to be adjusted.

Copy link
Member

@timholy timholy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really fascinating. I am not the best person to comment on LLVM's preferences for various lowered forms, and certainly not the changes to julia-synax.scm. It will be interesting to see the nanosoldier run.

@inline function inc(state, start, stop)
# increment post check to avoid integer overflow
@inline inc(out, ::Tuple{}, ::Tuple{}, ::Tuple{}) = nothing
@inline function inc(out, state::Tuple{Int}, start::Tuple{Int}, stop::Tuple{Int})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need this method anymore? I think it's covered by the generic case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I added it because this is a common pattern elsewhere (probably steming from a time when inference liked code like this more).

base/multidimensional.jl Outdated Show resolved Hide resolved
@vchuravy
Copy link
Member Author

What does this lowering change look like in terms of the equivalent Julia syntax? I think there’s an example of for loop lowering in the manual that might need to be adjusted.

Since we don't have a do ... while loop in Julia. Loop lowering on master currently looks like this:

next = iterate(iter)
if !(next === nothing)
   @label loop
   I, state = next
  # ...
  next = iterate(iter, state)
  if !(next === nothing)
    @goto loop
  end
end

my change changes it back to how it was on 0.6, but with the new iteration protocol

next = iterate(iter)
while !(next === nothing)
  I, state = next
  ...
  next = iterate(iter, state)
end

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@vchuravy
Copy link
Member Author

The nanosoldier results are confusing and all over the board. How can this change lead to half the memory usage....

@@ -95,6 +95,10 @@ module IteratorsMD
# access to index tuple
Tuple(index::CartesianIndex) = index.I

# equality
import Base: ==
==(a::CartesianIndex{N}, b::CartesianIndex{N}) where N = a.I == b.I
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for the import, this can just be Base.:(==)(...), which is more explicit.

@ararslan
Copy link
Member

The results will be noisier the more benchmarks are run at once, so if there are a few in particular you're interested in, you could do those specifically instead of ALL. That should give more reliable results.

@vchuravy
Copy link
Member Author

sigh

Current Julia:

julia> A = zeros(3,3)
3×3 Array{Float64,2}:
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

julia> nextind(A, CartesianIndex(3,3))
CartesianIndex(1, 4)

nextind(CartesianIndices((1:3, 1:typemax(Int))), CartesianIndex(3,typemax(Int)))
CartesianIndex(1, -9223372036854775808)

Looking at the docs for nextind the expected behaviour might be a BoundsError?

@vchuravy
Copy link
Member Author

Okay. I decided not to experiment with loop-form and just focus on fixing the over/underflows.

Notes:

  • Lowering of cartesian loops with N>=2 is something I still want to further investigate, but the general realisation is that SCEV doesn't like it and can't figure out that the loops are bounded.
    A potential idea might be to have an IR pass that converts loops of this form to loop nests (hopefully solving this issue in general) and avoiding the kind of strip-mining transformation @simd does.
  • I am unsure about the semantics for nextind usage generally seems to allow one illegal index to be returned and then calling nextind on that is supposed to give you a BoundsError.
  • I rewrote findnext to avoid the overflow, but it would be cleaner to directly call iterate, but it is not clear to me whether start can be expected to be a proper iterator state.

@vchuravy
Copy link
Member Author

I am unsure about the semantics for nextind usage generally seems to allow one illegal index to be returned and then calling nextind on that is supposed to give you a BoundsError.

I went back to the form valid, i = inc(), which allows nextind/prevind to produce the nextindex. These might still underflow/overflow so I will leave the changes for findnext et.al, but I could also revert those and open a separate PR.

This also sets us up with a CGF that I believe to be amenable to be transformed into a loop nest.

@vchuravy vchuravy added this to the 1.2 milestone Mar 18, 2019
@vchuravy
Copy link
Member Author

There seems to be an inferrability issue:

  Test threw exception
  Expression: #= /tmp/julia/share/julia/test/reducedim.jl:55 =# @inferred(maximum(Areduc, dims=region)) ≈ safe_maximum(Areduc, region)
  return type Array{Float64,4} does not match inferred return type Array{_A,4} where _A
  Stacktrace:
   [1] error(::String) at ./error.jl:33
   [2] top-level scope at /tmp/julia/share/julia/test/reducedim.jl:55
   [3] top-level scope at /home/travis/build/JuliaLang/julia/usr/share/julia/stdlib/v1.2/Test/src/Test.jl:1186
   [4] include at ./boot.jl:325 [inlined]
   [5] include_relative(::Module, ::String) at ./loading.jl:1042
   [6] include at ./Base.jl:31 [inlined]
   [7] include(::String) at /tmp/julia/share/julia/test/testdefs.jl:13
   [8] top-level scope at /tmp/julia/share/julia/test/testdefs.jl:23
   [9] top-level scope at /home/travis/build/JuliaLang/julia/usr/share/julia/stdlib/v1.2/Test/src/Test.jl:1113
   [10] top-level scope at /tmp/julia/share/julia/test/testdefs.jl:22
   [11] top-level scope at util.jl:289
   [12] top-level scope at /tmp/julia/share/julia/test/testdefs.jl:20

Haven't had time to take a deeper look.

vchuravy and others added 3 commits March 28, 2019 09:47
This allows LLVM to vectorize the 1D CartesianIndices case,
as well as fixing an overflow bug for:

```julia
CartesianIndices(((typemax(Int64)-2):typemax(Int64),))
```

Co-authored-by: Yingbo Ma <[email protected]>
@vchuravy
Copy link
Member Author

I plan to merge this tomorrow.

@mbauman
Copy link
Member

mbauman commented Mar 28, 2019

I wanna see another nanosoldier run first. There were some fairly major regressions in array iteration in the run in February.

@nanosoldier runbenchmarks("array" || "union", vs = ":master")

base/array.jl Outdated Show resolved Hide resolved
base/array.jl Outdated Show resolved Hide resolved
@vchuravy
Copy link
Member Author

Fair enough, the run in February was the experiment in changing the loop structure for the iteration protocol.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

Co-Authored-By: vchuravy <[email protected]>
@vchuravy
Copy link
Member Author

@mbauman The nanosoldier results look clean. Are you okay with squash merging this as is?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants