Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorization failure for CartesianIndices #30998

Closed
vchuravy opened this issue Feb 7, 2019 · 0 comments
Closed

Vectorization failure for CartesianIndices #30998

vchuravy opened this issue Feb 7, 2019 · 0 comments
Labels
compiler:simd instruction-level vectorization performance Must go faster

Comments

@vchuravy
Copy link
Member

vchuravy commented Feb 7, 2019

@peterahrens send me on a goose chase, which led me down to a peculiar observation. Given a relatively simple reduction loop (intentionally not using @simd here):

function vecfail(A)
     @inbounds begin
         acc = zero(eltype(A))
         N = length(A)
         for I in CartesianIndices(A)
              acc = Base.FastMath.add_fast(acc, A[I])
         end
     end
     return acc
end

Turns into this nice loop before loop-vectorize, https://godbolt.org/z/aIU_3u

; ...
L6:                                               ; preds = %L6.lr.ph, %L6
  %value_phi13 = phi i64 [ 0 , %L6.lr.ph ], [ %13, %L6 ]
  %value_phi2 = phi float [ 0.000000e+00, %L6.lr.ph ], [ %12, %L6 ]
  %9 = add i64 %value_phi13, -1
  %10 = getelementptr inbounds float, float addrspace(13)* %8, i64 %9
  %11 = load float, float addrspace(13)* %10, align 4
  %12 = fadd fast float %11, %value_phi2
  %13 = add i64 %value_phi13, 1
  ; %13 = add nsw i64 %value_phi13, 1
  %14 = icmp sgt i64 %13, %5
  ; %14 = icmp eq i64 %13, %5
  br i1 %14, label %L10, label %L6
; ...

If one comments out either %14 or %13 and uses the version below the function vectorize just find.

I came up with two fixes to Base.

Change iterate

@eval Base.IteratorsMD begin
         @inline function iterate(iter::CartesianIndices, state)
             nextstate = CartesianIndex(inc(state.I, first(iter).I, last(iter).I))
             nextstate.I[end] == (last(iter.indices[end])+1) && return nothing
             nextstate, nextstate
         end
       end

Change inc to use unsafe_inc

@eval Base.IteratorsMD begin
       function unsafe_inc(i::Int64)
                Base.llvmcall("""
                %i = add nsw i64 %0, 1
                ret i64 %i
                """, Int64, Tuple{Int64}, i)
              end
       @inline inc(state::Tuple{Int}, start::Tuple{Int}, stop::Tuple{Int}) = (unsafe_inc(state[1]),)
       @inline function inc(state, start, stop)
         if state[1] < stop[1]
           return (unsafe_inc(state[1]),tail(state)...)
         end
         newtail = inc(tail(state), tail(start), tail(stop))
         (start[1], newtail...)
       end
       end

It would be nice if we could convince LLVM to do this reasoning for us.
Some random data I collected as well:

➜  $OPT -analyze -scalar-evolution vecfailsum.opt.ll -S
Printing analysis 'Scalar Evolution Analysis' for function 'julia_vecfailsum_12014':
Classifying expressions for: @julia_vecfailsum_12014
  %1 = call %jl_value_t*** @julia.ptls_states()
  -->  %1 U: full-set S: full-set
  %2 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
  -->  %2 U: full-set S: full-set
  %3 = bitcast %jl_value_t addrspace(11)* %2 to %jl_array_t addrspace(11)*
  -->  %2 U: full-set S: full-set
  %4 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %3, i64 0, i32 1
  -->  (8 + %2)<nsw> U: full-set S: full-set
  %5 = load i64, i64 addrspace(11)* %4, align 8
  -->  %5 U: full-set S: full-set
  %7 = bitcast %jl_value_t addrspace(11)* %2 to float addrspace(13)* addrspace(11)*
  -->  %2 U: full-set S: full-set
  %8 = load float addrspace(13)*, float addrspace(13)* addrspace(11)* %7, align 8
  -->  %8 U: full-set S: full-set
  %value_phi13 = phi i64 [ 1, %L6.lr.ph ], [ %13, %L6 ]
  -->  {1,+,1}<%L6> U: full-set S: full-set		Exits: <<Unknown>>		LoopDispositions: { %L6: Computable }
  %9 = add i64 %value_phi13, -1
  -->  {0,+,1}<%L6> U: full-set S: full-set		Exits: <<Unknown>>		LoopDispositions: { %L6: Computable }
  %10 = getelementptr inbounds float, float addrspace(13)* %8, i64 %9
  -->  {%8,+,4}<%L6> U: full-set S: full-set		Exits: <<Unknown>>		LoopDispositions: { %L6: Computable }
  %13 = add i64 %value_phi13, 1
  -->  {2,+,1}<%L6> U: full-set S: full-set		Exits: <<Unknown>>		LoopDispositions: { %L6: Computable }
Determining loop execution counts for: @julia_vecfailsum_12014
Loop %L6: Unpredictable backedge-taken count. 
Loop %L6: Unpredictable max backedge-taken count. 
Loop %L6: Unpredictable predicated backedge-taken count. 

➜  $OPT -analyze -scalar-evolution vecsum.opt.ll -S
Printing analysis 'Scalar Evolution Analysis' for function 'julia_vecsum_12012':
Classifying expressions for: @julia_vecsum_12012
  %1 = call %jl_value_t*** @julia.ptls_states()
  -->  %1 U: full-set S: full-set
  %2 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
  -->  %2 U: full-set S: full-set
  %3 = bitcast %jl_value_t addrspace(11)* %2 to %jl_array_t addrspace(11)*
  -->  %2 U: full-set S: full-set
  %4 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %3, i64 0, i32 1
  -->  (8 + %2)<nsw> U: full-set S: full-set
  %5 = load i64, i64 addrspace(11)* %4, align 8
  -->  %5 U: full-set S: full-set
  %7 = bitcast %jl_value_t addrspace(11)* %2 to float addrspace(13)* addrspace(11)*
  -->  %2 U: full-set S: full-set
  %8 = load float addrspace(13)*, float addrspace(13)* addrspace(11)* %7, align 8
  -->  %8 U: full-set S: full-set
  %value_phi13 = phi i64 [ 1, %L7.lr.ph ], [ %13, %L7 ]
  -->  {1,+,1}<%L7> U: [1,0) S: [1,0)		Exits: (-1 + %5)		LoopDispositions: { %L7: Computable }
  %9 = add i64 %value_phi13, -1
  -->  {0,+,1}<%L7> U: [0,-1) S: [0,-1)		Exits: (-2 + %5)		LoopDispositions: { %L7: Computable }
  %10 = getelementptr inbounds float, float addrspace(13)* %8, i64 %9
  -->  {%8,+,4}<%L7> U: full-set S: full-set		Exits: (-8 + (4 * %5) + %8)		LoopDispositions: { %L7: Computable }
  %13 = add i64 %value_phi13, 1
  -->  {2,+,1}<%L7> U: [2,1) S: [2,1)		Exits: %5		LoopDispositions: { %L7: Computable }
Determining loop execution counts for: @julia_vecsum_12012
Loop %L7: backedge-taken count is (-2 + %5)
Loop %L7: max backedge-taken count is -2
Loop %L7: Predicated backedge-taken count is (-2 + %5)
 Predicates:

Loop %L7: Trip multiple is 1
@vchuravy vchuravy added performance Must go faster upstream The issue is with an upstream dependency, e.g. LLVM external dependencies Involves LLVM, OpenBLAS, or other linked libraries compiler:simd instruction-level vectorization labels Feb 7, 2019
@vchuravy vchuravy removed external dependencies Involves LLVM, OpenBLAS, or other linked libraries upstream The issue is with an upstream dependency, e.g. LLVM labels Feb 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:simd instruction-level vectorization performance Must go faster
Projects
None yet
Development

No branches or pull requests

1 participant