Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trivial code change causes vectorization failure #30933

Closed
robsmith11 opened this issue Feb 1, 2019 · 11 comments
Closed

Trivial code change causes vectorization failure #30933

robsmith11 opened this issue Feb 1, 2019 · 11 comments
Labels
compiler:codegen Generation of LLVM IR and native code upstream The issue is with an upstream dependency, e.g. LLVM

Comments

@robsmith11
Copy link
Contributor

robsmith11 commented Feb 1, 2019

The first function below fails to vectorize, but if I make a trivial change (expand s = f*f) as shown in the second function, then Julia produces nicely vectorized LLVM-IR.

I don't see any reason why they both shouldn't vectorize, so I think it might be a bug with the Julia optimizer.

@fastmath function fastlog1(x::Float32)::Float32
    xi = reinterpret(Int32, x)
    e = (xi - Int32(1059760811)) & Int32(-8388608)
    m = reinterpret(Float32, xi - e)
    i = e * 1.19209290f-7
    f = m - 1f0
    s = f * f
    r = 0.230836749f0 * f + -0.279208571f0
    t = 0.331826031f0 * f + -0.498910338f0
    r = r * s + t
    r = r * s + f
    i * 0.693147182f0 + r
end

@fastmath function fastlog2(x::Float32)::Float32
    xi = reinterpret(Int32, x)
    e = (xi - Int32(1059760811)) & Int32(-8388608)
    m = reinterpret(Float32, xi - e)
    i = e * 1.19209290f-7
    f = m - 1f0
    # s = f * f 
    r = 0.230836749f0 * f + -0.279208571f0
    t = 0.331826031f0 * f + -0.498910338f0
    r = r * (f*f) + t  # replaced s here
    r = r * (f*f) + f  # and here
    i * 0.693147182f0 + r
end

@fastmath function test(f)
    s = 0.0
    for i in Int32(1):Int32(1_000_000_000)
        s += f(Float32(i))
    end
    s
end

julia> @btime test(fastlog1)
  2.366 s (0 allocations: 0 bytes)
1.9723269760895107e10
julia> @btime test(fastlog2)
  524.180 ms (0 allocations: 0 bytes)
1.9723269761215004e10

julia> versioninfo()                                                                                                   
Julia Version 1.2.0-DEV.221
Commit 8fa0645ef1 (2019-01-28 01:13 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
@Keno
Copy link
Member

Keno commented Feb 3, 2019

Since they both inline equally well, this decision will come down to the legality/cost model in LLVM. Easiest way to figure that out is to dump out the LLVM IR of the one that doesn't optimize and run it through opt with -debug-only=loop-vectorize -pass-remarks='vector'. That'll probably indicate why it decided not to vectorize.

@Keno Keno added upstream The issue is with an upstream dependency, e.g. LLVM compiler:codegen Generation of LLVM IR and native code labels Feb 3, 2019
@robsmith11
Copy link
Contributor Author

@Keno

Are you suggesting that the difference is a failure of LLVM to optimize the IR? To me (who admittedly doesn't understand much of this), it appears the difference is from Julia generating explicitly vectorized IR for fastlog2, but not for fastlog1 (so before the code gets to LLVM.

@Keno
Copy link
Member

Keno commented Feb 4, 2019

correct, Julia does not have a vectorizer. The output of code_llvm is optimized.

@robsmith11
Copy link
Contributor Author

Hmm.. but isn't the following vectorized to run over 4 values at a time? (Taken from the output of @code_llvm test(fastlog2).)

vector.body: 
 ; preds = %vector.body, %top
  %index = phi i32 [ 0, %top ], [ %index.next, %vector.body ]
  %vec.phi = phi <4 x double> [ zeroinitializer, %top ], [ %40, %vector.body ]
  %vec.phi7 = phi <4 x double> [ zeroinitializer, %top ], [ %41, %vector.body ]
  %vec.ind = phi <4 x i32> [ <i32 1, i32 2, i32 3, i32 4>, %top ], [ %vec.ind.next, %vector.body ]
  %step.add = add <4 x i32> %vec.ind, <i32 4, i32 4, i32 4, i32 4>
;  @ /tmp/g.jl:32 within `test'
; ┌ @ float.jl:60 within `Type'
   %0 = sitofp <4 x i32> %vec.ind to <4 x float>
   %1 = sitofp <4 x i32> %step.add to <4 x float>
; └
; ┌ @ /tmp/g.jl:16 within `fastlog2'
; │┌ @ essentials.jl:381 within `reinterpret'
    %2 = bitcast <4 x float> %0 to <4 x i32>
    %3 = bitcast <4 x float> %1 to <4 x i32>
; │└ 
 ; │ @ /tmp/g.jl:17 within `fastlog2'
; │┌ @ fastmath.jl:257 within `sub_fast'
; ││┌ @ int.jl:52 within `-'
     %4 = add <4 x i32> %2, <i32 -1059760811, i32 -1059760811, i32 -1059760811, i32 -1059760811>
     %5 = add <4 x i32> %3, <i32 -1059760811, i32 -1059760811, i32 -1059760811, i32 -1059760811> 
 ; │└└

Nothing like that appears in @code_llvm test(fastlog1)

@Keno
Copy link
Member

Keno commented Feb 4, 2019

Yes, as I said, @code_llvm prints the LLVM IR after LLVM's optimization passes.

@robsmith11
Copy link
Contributor Author

Ah.. sorry I thought you meant "the output of code_llvm will then be sent to LLVM to be optimized". Thanks for clearing up my confusion.

@Keno
Copy link
Member

Keno commented Feb 4, 2019

Ah, sorry I see how that could be misread.

@vchuravy
Copy link
Member

vchuravy commented Feb 4, 2019

To see the IR that Julia produces you can use @code_llvm optimize=false. You can also use the environment flag JULIA_LLVM_ARGS to pass -pass-remarks-analysis=loop-vectorize or -pass-remarks=vector

@robsmith11
Copy link
Contributor Author

robsmith11 commented Feb 4, 2019

Running with JULIA_LLVM_ARGS="-pass-remarks-analysis=loop-vectorize -pass-remarks=vector":
Slow version:

remark: fastmath.jl:161:0: SLP vectorized with cost -2 and with tree size 5
remark: fastmath.jl:163:0: loop not vectorized: instruction return type cannot be vectorized

Fast version:

remark: /tmp/test.jl:31:0: vectorized loop (vectorization width: 4, interleaved count: 2)

I wanted to try passing the IR to a more recent version of LLVM opt, but I couldn't figure out a way to get Julia to output complete IR. @code_llvm contains undefined types.

@vchuravy
Copy link
Member

vchuravy commented Feb 4, 2019

@robsmith11
Copy link
Contributor Author

Thanks to @vchuravy's help, I think I've tracked down the issue. Using the unoptimized IR, I was able to reproduce the vectorization failure with LLVM 6:

opt -slp-vectorizer -mcpu=native -S /tmp/fastlog1.ir | opt -O3 -mcpu=native -S

If I remove the slp-vectorizer pass, it vectorizes nicely. The good news is that this appears to be fixed with LLVM 7 (slp-vectorizer no longer breaks proper vectorization), so should work fine by default in Julia after LLVM is upgraded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code upstream The issue is with an upstream dependency, e.g. LLVM
Projects
None yet
Development

No branches or pull requests

3 participants