Vectorization failure for CartesianIndices #30998

vchuravy · 2019-02-07T18:18:29Z

@peterahrens send me on a goose chase, which led me down to a peculiar observation. Given a relatively simple reduction loop (intentionally not using @simd here):

function vecfail(A)
     @inbounds begin
         acc = zero(eltype(A))
         N = length(A)
         for I in CartesianIndices(A)
              acc = Base.FastMath.add_fast(acc, A[I])
         end
     end
     return acc
end

Turns into this nice loop before loop-vectorize, https://godbolt.org/z/aIU_3u

; ...
L6:                                               ; preds = %L6.lr.ph, %L6
  %value_phi13 = phi i64 [ 0 , %L6.lr.ph ], [ %13, %L6 ]
  %value_phi2 = phi float [ 0.000000e+00, %L6.lr.ph ], [ %12, %L6 ]
  %9 = add i64 %value_phi13, -1
  %10 = getelementptr inbounds float, float addrspace(13)* %8, i64 %9
  %11 = load float, float addrspace(13)* %10, align 4
  %12 = fadd fast float %11, %value_phi2
  %13 = add i64 %value_phi13, 1
  ; %13 = add nsw i64 %value_phi13, 1
  %14 = icmp sgt i64 %13, %5
  ; %14 = icmp eq i64 %13, %5
  br i1 %14, label %L10, label %L6
; ...

If one comments out either %14 or %13 and uses the version below the function vectorize just find.

I came up with two fixes to Base.

Change `iterate`

@eval Base.IteratorsMD begin
         @inline function iterate(iter::CartesianIndices, state)
             nextstate = CartesianIndex(inc(state.I, first(iter).I, last(iter).I))
             nextstate.I[end] == (last(iter.indices[end])+1) && return nothing
             nextstate, nextstate
         end
       end

Change `inc` to use `unsafe_inc`

@eval Base.IteratorsMD begin
       function unsafe_inc(i::Int64)
                Base.llvmcall("""
                %i = add nsw i64 %0, 1
                ret i64 %i
                """, Int64, Tuple{Int64}, i)
              end
       @inline inc(state::Tuple{Int}, start::Tuple{Int}, stop::Tuple{Int}) = (unsafe_inc(state[1]),)
       @inline function inc(state, start, stop)
         if state[1] < stop[1]
           return (unsafe_inc(state[1]),tail(state)...)
         end
         newtail = inc(tail(state), tail(start), tail(stop))
         (start[1], newtail...)
       end
       end

It would be nice if we could convince LLVM to do this reasoning for us.
Some random data I collected as well:

➜  $OPT -analyze -scalar-evolution vecfailsum.opt.ll -S
Printing analysis 'Scalar Evolution Analysis' for function 'julia_vecfailsum_12014':
Classifying expressions for: @julia_vecfailsum_12014
  %1 = call %jl_value_t*** @julia.ptls_states()
  -->  %1 U: full-set S: full-set
  %2 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
  -->  %2 U: full-set S: full-set
  %3 = bitcast %jl_value_t addrspace(11)* %2 to %jl_array_t addrspace(11)*
  -->  %2 U: full-set S: full-set
  %4 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %3, i64 0, i32 1
  -->  (8 + %2)<nsw> U: full-set S: full-set
  %5 = load i64, i64 addrspace(11)* %4, align 8
  -->  %5 U: full-set S: full-set
  %7 = bitcast %jl_value_t addrspace(11)* %2 to float addrspace(13)* addrspace(11)*
  -->  %2 U: full-set S: full-set
  %8 = load float addrspace(13)*, float addrspace(13)* addrspace(11)* %7, align 8
  -->  %8 U: full-set S: full-set
  %value_phi13 = phi i64 [ 1, %L6.lr.ph ], [ %13, %L6 ]
  -->  {1,+,1}<%L6> U: full-set S: full-set		Exits: <<Unknown>>		LoopDispositions: { %L6: Computable }
  %9 = add i64 %value_phi13, -1
  -->  {0,+,1}<%L6> U: full-set S: full-set		Exits: <<Unknown>>		LoopDispositions: { %L6: Computable }
  %10 = getelementptr inbounds float, float addrspace(13)* %8, i64 %9
  -->  {%8,+,4}<%L6> U: full-set S: full-set		Exits: <<Unknown>>		LoopDispositions: { %L6: Computable }
  %13 = add i64 %value_phi13, 1
  -->  {2,+,1}<%L6> U: full-set S: full-set		Exits: <<Unknown>>		LoopDispositions: { %L6: Computable }
Determining loop execution counts for: @julia_vecfailsum_12014
Loop %L6: Unpredictable backedge-taken count. 
Loop %L6: Unpredictable max backedge-taken count. 
Loop %L6: Unpredictable predicated backedge-taken count. 

➜  $OPT -analyze -scalar-evolution vecsum.opt.ll -S
Printing analysis 'Scalar Evolution Analysis' for function 'julia_vecsum_12012':
Classifying expressions for: @julia_vecsum_12012
  %1 = call %jl_value_t*** @julia.ptls_states()
  -->  %1 U: full-set S: full-set
  %2 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
  -->  %2 U: full-set S: full-set
  %3 = bitcast %jl_value_t addrspace(11)* %2 to %jl_array_t addrspace(11)*
  -->  %2 U: full-set S: full-set
  %4 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %3, i64 0, i32 1
  -->  (8 + %2)<nsw> U: full-set S: full-set
  %5 = load i64, i64 addrspace(11)* %4, align 8
  -->  %5 U: full-set S: full-set
  %7 = bitcast %jl_value_t addrspace(11)* %2 to float addrspace(13)* addrspace(11)*
  -->  %2 U: full-set S: full-set
  %8 = load float addrspace(13)*, float addrspace(13)* addrspace(11)* %7, align 8
  -->  %8 U: full-set S: full-set
  %value_phi13 = phi i64 [ 1, %L7.lr.ph ], [ %13, %L7 ]
  -->  {1,+,1}<%L7> U: [1,0) S: [1,0)		Exits: (-1 + %5)		LoopDispositions: { %L7: Computable }
  %9 = add i64 %value_phi13, -1
  -->  {0,+,1}<%L7> U: [0,-1) S: [0,-1)		Exits: (-2 + %5)		LoopDispositions: { %L7: Computable }
  %10 = getelementptr inbounds float, float addrspace(13)* %8, i64 %9
  -->  {%8,+,4}<%L7> U: full-set S: full-set		Exits: (-8 + (4 * %5) + %8)		LoopDispositions: { %L7: Computable }
  %13 = add i64 %value_phi13, 1
  -->  {2,+,1}<%L7> U: [2,1) S: [2,1)		Exits: %5		LoopDispositions: { %L7: Computable }
Determining loop execution counts for: @julia_vecsum_12012
Loop %L7: backedge-taken count is (-2 + %5)
Loop %L7: max backedge-taken count is -2
Loop %L7: Predicated backedge-taken count is (-2 + %5)
 Predicates:

Loop %L7: Trip multiple is 1

The text was updated successfully, but these errors were encountered:

vchuravy added performance Must go faster upstream The issue is with an upstream dependency, e.g. LLVM external dependencies Involves LLVM, OpenBLAS, or other linked libraries compiler:simd instruction-level vectorization labels Feb 7, 2019

vchuravy mentioned this issue Feb 8, 2019

fix overflow in CartesianIndices iteration #31011

Merged

vchuravy removed external dependencies Involves LLVM, OpenBLAS, or other linked libraries upstream The issue is with an upstream dependency, e.g. LLVM labels Feb 8, 2019

vchuravy closed this as completed in #31011 Mar 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorization failure for CartesianIndices #30998

Vectorization failure for CartesianIndices #30998

vchuravy commented Feb 7, 2019

Vectorization failure for CartesianIndices #30998

Vectorization failure for CartesianIndices #30998

Comments

vchuravy commented Feb 7, 2019

Change iterate

Change inc to use unsafe_inc

Change `iterate`

Change `inc` to use `unsafe_inc`