Trivial code change causes vectorization failure #30933

robsmith11 · 2019-02-01T09:00:54Z

The first function below fails to vectorize, but if I make a trivial change (expand s = f*f) as shown in the second function, then Julia produces nicely vectorized LLVM-IR.

I don't see any reason why they both shouldn't vectorize, so I think it might be a bug with the Julia optimizer.

@fastmath function fastlog1(x::Float32)::Float32
    xi = reinterpret(Int32, x)
    e = (xi - Int32(1059760811)) & Int32(-8388608)
    m = reinterpret(Float32, xi - e)
    i = e * 1.19209290f-7
    f = m - 1f0
    s = f * f
    r = 0.230836749f0 * f + -0.279208571f0
    t = 0.331826031f0 * f + -0.498910338f0
    r = r * s + t
    r = r * s + f
    i * 0.693147182f0 + r
end

@fastmath function fastlog2(x::Float32)::Float32
    xi = reinterpret(Int32, x)
    e = (xi - Int32(1059760811)) & Int32(-8388608)
    m = reinterpret(Float32, xi - e)
    i = e * 1.19209290f-7
    f = m - 1f0
    # s = f * f 
    r = 0.230836749f0 * f + -0.279208571f0
    t = 0.331826031f0 * f + -0.498910338f0
    r = r * (f*f) + t  # replaced s here
    r = r * (f*f) + f  # and here
    i * 0.693147182f0 + r
end

@fastmath function test(f)
    s = 0.0
    for i in Int32(1):Int32(1_000_000_000)
        s += f(Float32(i))
    end
    s
end

julia> @btime test(fastlog1)
  2.366 s (0 allocations: 0 bytes)
1.9723269760895107e10
julia> @btime test(fastlog2)
  524.180 ms (0 allocations: 0 bytes)
1.9723269761215004e10

julia> versioninfo()                                                                                                   
Julia Version 1.2.0-DEV.221
Commit 8fa0645ef1 (2019-01-28 01:13 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

The text was updated successfully, but these errors were encountered:

Keno · 2019-02-03T22:58:30Z

Since they both inline equally well, this decision will come down to the legality/cost model in LLVM. Easiest way to figure that out is to dump out the LLVM IR of the one that doesn't optimize and run it through opt with -debug-only=loop-vectorize -pass-remarks='vector'. That'll probably indicate why it decided not to vectorize.

robsmith11 · 2019-02-04T00:22:36Z

@Keno

Are you suggesting that the difference is a failure of LLVM to optimize the IR? To me (who admittedly doesn't understand much of this), it appears the difference is from Julia generating explicitly vectorized IR for fastlog2, but not for fastlog1 (so before the code gets to LLVM.

Keno · 2019-02-04T00:24:13Z

correct, Julia does not have a vectorizer. The output of code_llvm is optimized.

robsmith11 · 2019-02-04T00:37:08Z

Hmm.. but isn't the following vectorized to run over 4 values at a time? (Taken from the output of @code_llvm test(fastlog2).)

vector.body: 
 ; preds = %vector.body, %top
  %index = phi i32 [ 0, %top ], [ %index.next, %vector.body ]
  %vec.phi = phi <4 x double> [ zeroinitializer, %top ], [ %40, %vector.body ]
  %vec.phi7 = phi <4 x double> [ zeroinitializer, %top ], [ %41, %vector.body ]
  %vec.ind = phi <4 x i32> [ <i32 1, i32 2, i32 3, i32 4>, %top ], [ %vec.ind.next, %vector.body ]
  %step.add = add <4 x i32> %vec.ind, <i32 4, i32 4, i32 4, i32 4>
;  @ /tmp/g.jl:32 within `test'
; ┌ @ float.jl:60 within `Type'
   %0 = sitofp <4 x i32> %vec.ind to <4 x float>
   %1 = sitofp <4 x i32> %step.add to <4 x float>
; └
; ┌ @ /tmp/g.jl:16 within `fastlog2'
; │┌ @ essentials.jl:381 within `reinterpret'
    %2 = bitcast <4 x float> %0 to <4 x i32>
    %3 = bitcast <4 x float> %1 to <4 x i32>
; │└ 
 ; │ @ /tmp/g.jl:17 within `fastlog2'
; │┌ @ fastmath.jl:257 within `sub_fast'
; ││┌ @ int.jl:52 within `-'
     %4 = add <4 x i32> %2, <i32 -1059760811, i32 -1059760811, i32 -1059760811, i32 -1059760811>
     %5 = add <4 x i32> %3, <i32 -1059760811, i32 -1059760811, i32 -1059760811, i32 -1059760811> 
 ; │└└

Nothing like that appears in @code_llvm test(fastlog1)

Keno · 2019-02-04T00:38:05Z

Yes, as I said, @code_llvm prints the LLVM IR after LLVM's optimization passes.

robsmith11 · 2019-02-04T00:40:58Z

Ah.. sorry I thought you meant "the output of code_llvm will then be sent to LLVM to be optimized". Thanks for clearing up my confusion.

Keno · 2019-02-04T00:41:24Z

Ah, sorry I see how that could be misread.

vchuravy · 2019-02-04T03:17:02Z

To see the IR that Julia produces you can use @code_llvm optimize=false. You can also use the environment flag JULIA_LLVM_ARGS to pass -pass-remarks-analysis=loop-vectorize or -pass-remarks=vector

robsmith11 · 2019-02-04T05:40:54Z

Running with JULIA_LLVM_ARGS="-pass-remarks-analysis=loop-vectorize -pass-remarks=vector":
Slow version:

remark: fastmath.jl:161:0: SLP vectorized with cost -2 and with tree size 5
remark: fastmath.jl:163:0: loop not vectorized: instruction return type cannot be vectorized

Fast version:

remark: /tmp/test.jl:31:0: vectorized loop (vectorization width: 4, interleaved count: 2)

I wanted to try passing the IR to a more recent version of LLVM opt, but I couldn't figure out a way to get Julia to output complete IR. @code_llvm contains undefined types.

vchuravy · 2019-02-04T13:30:54Z

Take a look at https://docs.julialang.org/en/v1/devdocs/llvm/index.html#Debugging-LLVM-transformations-in-isolation-1, since this is getting fairly off-topic feel free to ping me on Slack if you get stuck.
The remark is coming from here https://github.com/llvm/llvm-project/blob/edbf06a76771f77f70ad6ee1b4641a2d53c14152/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp#L768

robsmith11 · 2019-02-05T06:12:26Z

Thanks to @vchuravy's help, I think I've tracked down the issue. Using the unoptimized IR, I was able to reproduce the vectorization failure with LLVM 6:

opt -slp-vectorizer -mcpu=native -S /tmp/fastlog1.ir | opt -O3 -mcpu=native -S

If I remove the slp-vectorizer pass, it vectorizes nicely. The good news is that this appears to be fixed with LLVM 7 (slp-vectorizer no longer breaks proper vectorization), so should work fine by default in Julia after LLVM is upgraded.

Keno added upstream The issue is with an upstream dependency, e.g. LLVM compiler:codegen Generation of LLVM IR and native code labels Feb 3, 2019

robsmith11 closed this as completed Feb 5, 2019

nlw0 mentioned this issue Apr 18, 2019

Extrema is slower than maximum + minimum #31442

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trivial code change causes vectorization failure #30933

Trivial code change causes vectorization failure #30933

robsmith11 commented Feb 1, 2019 •

edited

Loading

Keno commented Feb 3, 2019

robsmith11 commented Feb 4, 2019

Keno commented Feb 4, 2019

robsmith11 commented Feb 4, 2019

Keno commented Feb 4, 2019 •

edited

Loading

robsmith11 commented Feb 4, 2019

Keno commented Feb 4, 2019

vchuravy commented Feb 4, 2019 •

edited

Loading

robsmith11 commented Feb 4, 2019 •

edited

Loading

vchuravy commented Feb 4, 2019

robsmith11 commented Feb 5, 2019

Trivial code change causes vectorization failure #30933

Trivial code change causes vectorization failure #30933

Comments

robsmith11 commented Feb 1, 2019 • edited Loading

Keno commented Feb 3, 2019

robsmith11 commented Feb 4, 2019

Keno commented Feb 4, 2019

robsmith11 commented Feb 4, 2019

Keno commented Feb 4, 2019 • edited Loading

robsmith11 commented Feb 4, 2019

Keno commented Feb 4, 2019

vchuravy commented Feb 4, 2019 • edited Loading

robsmith11 commented Feb 4, 2019 • edited Loading

vchuravy commented Feb 4, 2019

robsmith11 commented Feb 5, 2019

robsmith11 commented Feb 1, 2019 •

edited

Loading

Keno commented Feb 4, 2019 •

edited

Loading

vchuravy commented Feb 4, 2019 •

edited

Loading

robsmith11 commented Feb 4, 2019 •

edited

Loading