Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extrema is slower than maximum + minimum #31442

Open
mbauman opened this issue Mar 22, 2019 · 31 comments
Open

Extrema is slower than maximum + minimum #31442

mbauman opened this issue Mar 22, 2019 · 31 comments
Labels
compiler:simd instruction-level vectorization fold sum, maximum, reduce, foldl, etc. good first issue Indicates a good issue for first-time contributors to Julia help wanted Indicates that a maintainer wants help on an issue or pull request performance Must go faster

Comments

@mbauman
Copy link
Member

mbauman commented Mar 22, 2019

On AVX-512 systems, calling extrema(x) is slower than calling maximum and then minimum. Looks like we need to incorporate the same techniques from #30320 over to extrema in order to get it to vectorize in a similar manner.

@mbauman mbauman added performance Must go faster help wanted Indicates that a maintainer wants help on an issue or pull request arrays [a, r, r, a, y, s] good first issue Indicates a good issue for first-time contributors to Julia labels Mar 22, 2019
@StefanKarpinski StefanKarpinski added the compiler:simd instruction-level vectorization label Mar 22, 2019
@ghost
Copy link

ghost commented Mar 23, 2019

@mbauman Can you provide the test case you used? I ran the following code:

arrs = []

for i in 1:100000
    insert!(arrs, 1, rand(1000))
end

println("running extrema")
@time map(arrs) do arr
    extrema(arr)
end

println("running minimum, maximum")
@time map(arrs) do arr
    (minimum(arr), maximum(arr))
end

and got this output:

running extrema
  0.520760 seconds (301.98 k allocations: 14.811 MiB)
running minimum, maximum
  0.741328 seconds (279.30 k allocations: 13.814 MiB)

julia version:

julia version 1.1.0

@mbauman
Copy link
Member Author

mbauman commented Mar 25, 2019

Yes, that test case reproduces the behavior on master for me. It won't on 1.1.0 as the PR that improved maximum and minimum was after 1.1.0. Of course, note that you'll also need to have a relatively recent computer to see the difference.

@nlw0
Copy link
Contributor

nlw0 commented Apr 14, 2019

I got a slower extrema in my Core i5 machine.

running extrema
  314.316 ms (100003 allocations: 4.58 MiB)
running minimum, maximum
  224.977 ms (100003 allocations: 4.58 MiB)

@nlw0
Copy link
Contributor

nlw0 commented Apr 14, 2019

I would like to try working on this, and it would be great to hear what kind of patch is expected. I was thinking maybe mapreduce_impl could be turned into a generated function that can handle multiple functions simultaneously. Does that sound good?

One question I have is about the unrolling. Ideally this should be adapted for each architecture and data type. How much of this should be up to LLVM or to Julia? Should we prioritize generating code that the LLVM can vectorize? The alternative I imagine would be controlling parameters like this unrolling depending on the architecture.

@nlw0
Copy link
Contributor

nlw0 commented Apr 15, 2019

Thinking a bit more about this, maybe the best of worlds would be if this function could simply be implemented with a general fold-left, and it should result in a good vectorized machine code relying on the LLVM automatic vectorization. Should we concentrate our efforts on making sure that a generic fold works better in more cases, or should we first make sure that a few specific functions have high-performance implementations?

@mbauman
Copy link
Member Author

mbauman commented Apr 15, 2019

I was thinking of the much simpler option of copy-pasting the techniques used in maximum/minimum (here:

julia/base/reduce.jl

Lines 481 to 518 in 2b49d26

a1 = @inbounds A[first]
v1 = mapreduce_first(f, op, a1)
v2 = v3 = v4 = v1
chunk_len = 256
start = first + 1
simdstop = start + chunk_len - 4
while simdstop <= last - 3
# short circuit in case of NaN
v1 == v1 || return v1
v2 == v2 || return v2
v3 == v3 || return v3
v4 == v4 || return v4
@inbounds for i in start:4:simdstop
v1 = _fast(op, v1, f(A[i+0]))
v2 = _fast(op, v2, f(A[i+1]))
v3 = _fast(op, v3, f(A[i+2]))
v4 = _fast(op, v4, f(A[i+3]))
end
checkbounds(A, simdstop+3)
start += chunk_len
simdstop += chunk_len
end
v = op(op(v1,v2),op(v3,v4))
for i in start:last
@inbounds ai = A[i]
v = op(v, f(ai))
end
# enforce correct order of 0.0 and -0.0
# e.g. maximum([0.0, -0.0]) === 0.0
# should hold
if isbadzero(op, v)
for i in first:last
x = @inbounds A[i]
isgoodzero(op,x) && return x
end
end
return v
) into the body of the extrema implementation (here:

julia/base/operators.jl

Lines 471 to 483 in 2b49d26

y = iterate(itr)
y === nothing && throw(ArgumentError("collection must be non-empty"))
(v, s) = y
vmin = vmax = f(v)
while true
y = iterate(itr, s)
y === nothing && break
(x, s) = y
fx = f(x)
vmax = max(fx, vmax)
vmin = min(fx, vmin)
end
return (vmin, vmax)
).

Revamping mapreduce to support multiple functions is a much bigger task (and I'm not sure we want to add that complication to the already complex mapreduce machinery, either), and while improving LLVM's automatic vectorization is always a worthy goal it's particularly challenging in cases like these.

@nlw0
Copy link
Contributor

nlw0 commented Apr 15, 2019

I am still trying to understand why the specialization of mapreduce_impl was needed in the first place, and what are the symmetries between minimum and extrema. I am reading the history of the file and I cannot make complete sense of it.

The generic mapreduce_impl contains a @simd for line, and produces vectorized code to op=max from what I can tell. The only reason I see for the specialized function is the short-circuiting. Are we interested in having this on extrema too? Also, these specialized functions are not using @simd for, but a while instead. Is this on purpose?

This idea I had at first of supporting multiple functions doesn't really make sense, sorry. I am thinking now on something much simpler, which would be to implement this function as a fold like:
myextrema(arr) = Base.mapfoldl(identity, ((mi,ma),v)->(min(mi,v),max(ma,v)), arr[2:end], init=(arr[1],arr[1]))

Which of course we can specialize if it makes sense. What I see now is that this doesn't seem to vectorize, and I wonder if just refactoring mapfoldl_impl would do it, and that is what I see as something that might be nice to investigate, since it might also improve other functions based on it.

It doesn't seem to matter for this specific function, though, because from what I can tell the current extrema is already vectorized. So I suppose in the end this is just about using _min_fast and _max_fast?

@nlw0
Copy link
Contributor

nlw0 commented Apr 15, 2019

I quickly tried to use the _min_fast functions and saw no change. And the machine code is already showing the vcmpordsd instruction. Maybe this will really require something like that unrolling?

@mbauman
Copy link
Member Author

mbauman commented Apr 15, 2019

Note that just seeing vector instructions in @code_native set doesn't actually mean the processor is using the parallelism of the SIMD unit... it can also just be using SIMD instructions with single values. For example, take a look at @code_native max(1.0, 2.0) — that'll show a whole slew of "SIMD" instructions (including that vcmpordsd) operating on that single data. It's much easier to look at @code_llvm to see if things are vectorizing.

For example, you can see that extrema actually does vectorize nicely for integers:

julia> @code_llvm extrema(rand(Int, 1000))
# ... contains things like:
vector.ph:                                        ; preds = %L40.lr.ph.L40.lr.ph.split_crit_edge
      %n.vec = and i64 %27, -32
      %ind.end = or i64 %n.vec, 1
      %ind.end26 = or i64 %n.vec, 2
      %minmax.ident.splatinsert = insertelement <8 x i64> undef, i64 %24, i32 0
      %minmax.ident.splat = shufflevector <8 x i64> %minmax.ident.splatinsert, <8 x i64> undef, <8 x i32> zeroinitializer
      %minmax.ident.splatinsert59 = insertelement <8 x i64> undef, i64 %24, i32 0
      %minmax.ident.splat60 = shufflevector <8 x i64> %minmax.ident.splatinsert59, <8 x i64> undef, <8 x i32> zeroinitializer
      br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
      %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
      %vec.phi = phi <8 x i64> [ %minmax.ident.splat, %vector.ph ], [ %48, %vector.body ]
# ...

Key things to see are the vector.body label, all the <8 x i64> vectors. If you look at @code_llvm extrema(rand(1000)) with floating point values, though, you'll see no vector.body, no <N x double> vectors, and lots of plain old double comparisons.

@chriselrod
Copy link
Contributor

chriselrod commented Apr 15, 2019

vcmpordsd isn't a SIMD instruction. sd means "single double". If it was named pd for "packed double", it would be SIMD.

@nlw0
Copy link
Contributor

nlw0 commented Apr 16, 2019

Alright, thanks! I need to get more used with reading the code_llvm output. What is a good reference for Intel instructions? And what does the 'v' mean in this case?

Anyways I guess it all means we really need to change this loop somehow to enable the vectorization. Will try something later.

@mbauman
Copy link
Member Author

mbauman commented Apr 16, 2019

I'm not an assembly expert but here's my understanding: The v prefix is for a "VEX" (vector extensions) encoded assembly instruction. It's working on the xmm register; that's a register in the SIMD hardware. SIMD units have scalar arithmetic because shuffling data back to the other registers can be costly. I imagine the way it's implemented is that it actually does the same packed operation and just discards the other bits.

I find LLVM IR imminently more understandable and readable — and the giant language reference page is a good resource.

@nlw0
Copy link
Contributor

nlw0 commented Apr 16, 2019

OK, it took me long enough understand the whole thing and go trough my grief process. It seems the LLVM will really refuse to vectorize it unless we do this unrolling. I have prepared a first draft that shows a speedup. It is still missing many details like the chunky traversal.

I hope I can prepare a PR by tomorrow, but it would be great to hear any comments. In especial, would this use of @ntuple be fine?

julia> using BenchmarkTools
julia> using Test
julia> N=2^16;
julia> mylist = rand(N);
julia> minornan(x,y) = ifelse(isnan(x) || x < y, x, y)
minornan (generic function with 1 method)

julia> maxornan(x,y) = ifelse(isnan(x) || x > y, x, y)
maxornan (generic function with 1 method)

julia> function myext(itr)
           V=8
           @inbounds begin
               vmin = Base.Cartesian.@ntuple 8 j->itr[j]
               vmax = Base.Cartesian.@ntuple 8 j->itr[j]
               @simd for i in V:V:length(itr)-V
                   vmin = Base.Cartesian.@ntuple 8 j->minornan(vmin[j], itr[i+j])
                   vmax = Base.Cartesian.@ntuple 8 j->maxornan(vmax[j], itr[i+j])
               end
           end
           (minimum(vmin), maximum(vmax))
       end
myext (generic function with 1 method)

julia> minmaxextrema(aa) = (minimum(aa), maximum(aa))
minmaxextrema (generic function with 1 method)

julia> @btime extrema(xx) setup = (xx=rand(N));
  193.335 μs (1 allocation: 32 bytes)

julia> @btime minmaxextrema(xx) setup = (xx=rand(N));
  77.339 μs (1 allocation: 32 bytes)

julia> @btime myext(xx) setup = (xx=rand(N));
  27.389 μs (1 allocation: 32 bytes)

julia> @test all(extrema(xx) == myext(xx)  for xx in [rand(N) for _ in 1:100])
Test Passed

@nlw0
Copy link
Contributor

nlw0 commented Apr 17, 2019

I was checking how things are working on the LLVM side, and I see the relevant patch was merged months ago and made it to 8.0. Wouldn't this be a good candidate for the Julia LLVM patches, then? It wasn't the case before because the LLVM patch was still open. I'm proposing to fix the issue by a Julia LLVM patch plus updated extrema and etc implementations that can rely on the LLVM vectorization.

@nlw0
Copy link
Contributor

nlw0 commented Apr 17, 2019

I tried including the tiny D53680 LLVM patch into Julia, not sure I did anything wrong but nothing came out of it.

I would really like to understand better what could be happening, because I am not so sure there is just some small thing lacking on LLVM 6.0 that may soon just go away. Testing with clang 6.0.0 I can produce vectorized code as long as fast-mode is used, nicely adapted to e.g. AVX-512. And it all works fine for integers. On Julia it works fine with integers and floats with +, but I can't get vectorization of min even with fast-mode, or even doing just a ifelse(x < y, x, y) without caring for NaNs.

I was trying to use some LLVM intrinsics to see if it had any effect, like maxnum(x, y) = ccall("llvm.maxnum.f64", llvmcall, Float64, (Float64, Float64), x, y);, but nothing. I am thinking maybe I could try other more complex LLVM IR code until automatic vectorization happens.

@mbauman
Copy link
Member Author

mbauman commented Apr 17, 2019

Wow, nice investigations! When I tagged this as looking for help, I was thinking it'd be a simple copy-pasting of the technique used in #30320 — but you've definitely taken it to the next level. This is great.

On the Julia side, yes, I think it'd be just fine to use the Cartesian macros since they're also used elsewhere in the multidimensional.jl file. We had been slowly replacing them as they're a bit hard to read, but they're here for now. You could also just write out (or get the macro to write out) the expression literally:

julia> @macroexpand Base.Cartesian.@ntuple 8 j->minornan(vmin[j], itr[i+j])
:((minornan(vmin[1], itr[i + 1]), minornan(vmin[2], itr[i + 2]), minornan(vmin[3], itr[i + 3]), minornan(vmin[4], itr[i + 4]), minornan(vmin[5], itr[i + 5]), minornan(vmin[6], itr[i + 6]), minornan(vmin[7], itr[i + 7]), minornan(vmin[8], itr[i + 8])))

A few other notes: you'll need to grab the final elements from the iterator (if it's not an even multiple of V), and this implementation will only work well for 1-based IndexLinear AbstractArrays. You can generalize it to offset arrays with firstindex and lastindex, and then you can restrict it via dispatch to only apply to ::AbstractArray.

I'm not as familiar with the LLVM side of things, but it may indeed be worth investigating what version 8 might bring. I don't know how far we are from version 8 compatibility — 7 was just added a few weeks ago. That patch might not stand alone as a patch to version 6.

@nlw0
Copy link
Contributor

nlw0 commented Apr 17, 2019

This code vectorizes on clang, using just comparisons and in fast mode: https://godbolt.org/z/tF8SSH

float mymin(float x, float y) {
    return (x<y)?x:y;
}
float maximum(float*itr) {
    float v1=0.0;
    for(int i=0; i<6553600; i+=1) {
        v1=mymin(v1,itr[i]);
    }
    return v1;
}

This (addition) vectorizes in Julia even without fast mode (fast mode is still being set in the ir from what i understand):
myadd(x,y)= Base.llvmcall(""" %3 = fadd fast double %1, %0 ret double %3 """, Float64, Tuple{Float64,Float64}, x, y)

This does not vectorize in Julia even with fast mode:
mymin(x,y)= Base.llvmcall(""" %3 = fcmp fast olt double %0, %1 %4 = select i1 %3, double %0, double %1 ret double %4 """, Float64, Tuple{Float64,Float64}, x, y)

@nlw0
Copy link
Contributor

nlw0 commented Apr 18, 2019

I did the following test today: first I produced logs from the LLVM passes for 4 functions, calculating the sum and min from a vector of integers or floats (with fast-math). It is possible to see that the functions are very similar up to the point the loop vectorizer runs, when it doesn't work for the case of the min on floats, as expected. I've put the logs here.

I can't make much out of it, but maybe someone has any idea about the point we see the first difference, at line 174 where the float version reads

  br label %L88, !dbg !116

and the int

  %min.iters.check = icmp ult i64 %42, 16, !dbg !115
  br i1 %min.iters.check, label %scalar.ph, label %vector.ph, !dbg !115

After that I also generated dumps from clang, where the vectorization happens. The dumped IR can be run with llvmcall with slight modifications, and then we can find out vectorization still doesn't happen for floats in Julia. I guess that means the Julia IR is fine and Clang isn't doing anything special there, so it may have something to do with the way LLVM is being configured by Julia?

using BenchmarkTools
using Test

mymin(itr::Vector{Float32})=
Base.llvmcall("""
  %ptr = inttoptr i64 %0 to float*

  %2 = load float, float* %ptr, align 4
  br label %5

; <label>:3:                                      ; preds = %5
  %4 = phi float [ %11, %5 ]
  ret float %4

; <label>:5:                                      ; preds = %5, %1
  %6 = phi i64 [ 1, %1 ], [ %12, %5 ]
  %7 = phi float [ %2, %1 ], [ %11, %5 ]
  %8 = getelementptr inbounds float, float* %ptr, i64 %6
  %9 = load float, float* %8, align 4
  %10 = fcmp fast olt float %7, %9
  %11 = select i1 %10, float %7, float %9
  %12 = add nuw nsw i64 %6, 1
  %13 = icmp eq i64 %12, 65536
  br i1 %13, label %3, label %5
""", Float32, Tuple{Ptr{Float32}}, pointer(itr))

mymin(itr::Vector{Int32})=
Base.llvmcall("""
  %ptr = inttoptr i64 %0 to i32*

  %2 = load i32, i32* %ptr, align 4
  br label %5

; <label>:3:                                      ; preds = %5
  %4 = phi i32 [ %11, %5 ]
  ret i32 %4

; <label>:5:                                      ; preds = %5, %1
  %6 = phi i64 [ 1, %1 ], [ %12, %5 ]
  %7 = phi i32 [ %2, %1 ], [ %11, %5 ]
  %8 = getelementptr inbounds i32, i32* %ptr, i64 %6
  %9 = load i32, i32* %8, align 4
  %10 = icmp slt i32 %7, %9
  %11 = select i1 %10, i32 %7, i32 %9
  %12 = add nuw nsw i64 %6, 1
  %13 = icmp eq i64 %12, 65536
  br i1 %13, label %3, label %5
""", Int32, Tuple{Ptr{Int32}}, pointer(itr))

data = rand(Float32, 2^16)
# data = rand(Int32, 2^16)
@test mymin(data) == minimum(data)
@code_native mymin(data)

EDIT: running this code with JULIA_LLVM_ARGS="-pass-remarks-analysis=loop-vectorize -pass-remarks=vector" julia -O3 --math-mode=fast testext.jl (following comments on #30933) I get for float
remark: /home/user/simpleext.jl:4:0: loop not vectorized: value that could not be identified as reduction is used outside the loop
and for int
remark: /home/user/simpleext.jl:27:0: vectorized loop (vectorization width: 8, interleaved count: 4)

@nlw0
Copy link
Contributor

nlw0 commented Apr 18, 2019

I have finally seen trouble while working strictly outside Julia: I produced the LLVM IR from the int and float versions, and I was not able to produce vectorized code with opt -fp-contract=fast -O3 -mem2reg -S min_f32_clang.ll, even though it worked with int.

https://gist.github.com/nlw0/eb50a88845eba90b6b859154d2f2383d

@nlw0
Copy link
Contributor

nlw0 commented Apr 21, 2019

I finally started looking at the LLVM analysis log for the loop vectorizer via JULIA_LLVM_ARGS="--debug-only=loop-vectorize" ./usr/bin/julia-debug --math-mode=fast using this test code:

function myfun(itr)
    @inbounds begin
        v = itr[1]
        @simd for y in itr[2:end]
            @fastmath v = ifelse(v < y,v,y)
        end
        v
    end
end

# a = collect(0x00000001:0x00010000)
a = collect(1f0:1f4)
myfun(a)

Straight away what you can see is that vcat(1f0:1f4) also does not get vectorized, while it does in the integer version and with a similar function in clang --- which also appears to do a lot more unrolling too. I take it as a clue there really may be an underlying issue worth of attention. (EDIT: actually not tested yet on clang using the same llvm) (EDIT2: actually it's different for floats because apparently the loop size is not pre-computed in this case, and I don't know what would be the proper way to do it)

@nlw0
Copy link
Contributor

nlw0 commented Apr 22, 2019

I've run Julia under gdb and made a trace of the moment the loop vectorizer diverges between the integer and float versions of the minimum loop. The loop starts with two phi nodes, and vectorization apparently fails when at the second node the test if (RecurrenceDescriptor::isReductionPHI(... fails for floats, while it works for ints. Not sure what any of that means, but maybe someone has an idea. The relevant code is the function canVectorizeInstrs in https://github.com/llvm/llvm-project/blob/llvmorg-6.0.1/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

https://gist.github.com/nlw0/58ed9fda8e8944a9cb5e5a20f6038fcf

@nlw0
Copy link
Contributor

nlw0 commented Apr 22, 2019

I'm writing a lot of stuff here, only some of which makes actual sense, so apologies for the noise. I believe I am finally getting close to the root cause of this issue, though.

The LLVM loop vectorizer is expecting the function to carry a "no-nans-fp-math" attribute. The test is done here: https://github.com/llvm/llvm-project/blob/d359f2096850c68b708bc25a7baca4282945949f/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L5324

As far as I can tell Julia never sets this, and fast-math mode only affects instructions down at the leaves of the expression trees. This is why there is no optimization even though fast mode is active. I believe this is where this function attribute is set by clang https://github.com/llvm-mirror/clang/blob/31c307cf96ef1b6a4d0c542d18ebfd7a307ae9b0/lib/CodeGen/CGCall.cpp#L1747-L1748

I tried hacking this attribute it, but there seems to be something else still missing. Hope I can make some more progress soon.

@nlw0
Copy link
Contributor

nlw0 commented Apr 24, 2019

OK, there was just something missing still in order to see it in @code_llvm, but it is actually working! I have now managed to take this:

function mymin(itr)
    @inbounds begin
        v = itr[1]
        @simd for y in itr[2:65536]
            @fastmath v = ifelse(v < y,v,y)
        end
        v
    end
end
```assembly
And have `@code_native` show this:
    vminps  %ymm1, %ymm0, %ymm0
    vminps  %ymm2, %ymm0, %ymm0
    vminps  %ymm3, %ymm0, %ymm0
This is the nasty patch, it basically hacks the `Function` objects in `jitlayers.cpp` to set the `"no-nans-fp-math"` in all of them. Although `@code_llvm` doesn't work, and I assume it just generates the LLVM objects trough a different route, we can actually see the vectorized IR in the LLVM log:
```llvm
*** IR Dump After Loop Vectorization ***
; Function Attrs: sspstrong
define float @julia_myfun_15967(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40)) #0 !dbg !5 {
top:
  %1 = call %jl_value_t*** @julia.ptls_states()
  call void @llvm.dbg.value(metadata %jl_value_t addrspace(10)* null, metadata !17, metadata !DIExpression(DW_OP_deref)), !dbg !18
  call void @llvm.dbg.value(metadata %jl_value_t addrspace(10)* %0, metadata !17, metadata !DIExpression(DW_OP_deref)), !dbg !18
  %2 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*, !dbg !19
  %3 = bitcast %jl_value_t addrspace(11)* %2 to %jl_array_t addrspace(11)*, !dbg !19
  %4 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %3, i64 0, i32 1, !dbg !19
  %5 = load i64, i64 addrspace(11)* %4, align 8, !dbg !19, !tbaa !29
  %6 = icmp slt i64 %5, 1, !dbg !33
  br i1 %6, label %L19, label %L10.lr.ph, !dbg !36

L10.lr.ph:                                        ; preds = %top
  %7 = bitcast %jl_value_t addrspace(11)* %2 to float addrspace(13)* addrspace(11)*
  %8 = load float addrspace(13)*, float addrspace(13)* addrspace(11)* %7, align 8, !tbaa !37, !nonnull !4
  %min.iters.check = icmp ult i64 %5, 32, !dbg !39
  br i1 %min.iters.check, label %scalar.ph, label %vector.ph, !dbg !39

vector.ph:                                        ; preds = %L10.lr.ph
  %n.mod.vf = urem i64 %5, 32, !dbg !39
  %n.vec = sub i64 %5, %n.mod.vf, !dbg !39
  br label %vector.body, !dbg !39

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ], !dbg !40
  %vec.phi = phi <8 x float> [ <float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000>, %vector.ph ], [ %30, %vector.body ]
  %vec.phi10 = phi <8 x float> [ <float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000>, %vector.ph ], [ %31, %vector.body ]
  %vec.phi11 = phi <8 x float> [ <float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000>, %vector.ph ], [ %32, %vector.body ]
  %vec.phi12 = phi <8 x float> [ <float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000>, %vector.ph ], [ %33, %vector.body ]
  %broadcast.splatinsert = insertelement <8 x i64> undef, i64 %index, i32 0, !dbg !40
  %broadcast.splat = shufflevector <8 x i64> %broadcast.splatinsert, <8 x i64> undef, <8 x i32> zeroinitializer, !dbg !40
  %induction = add <8 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, !dbg !40
  %induction7 = add <8 x i64> %broadcast.splat, <i64 8, i64 9, i64 10, i64 11, i64 12, i64 13, i64 14, i64 15>, !dbg !40
  %induction8 = add <8 x i64> %broadcast.splat, <i64 16, i64 17, i64 18, i64 19, i64 20, i64 21, i64 22, i64 23>, !dbg !40
  %induction9 = add <8 x i64> %broadcast.splat, <i64 24, i64 25, i64 26, i64 27, i64 28, i64 29, i64 30, i64 31>, !dbg !40
  %9 = add i64 %index, 0, !dbg !40
  %10 = add i64 %index, 8, !dbg !40
  %11 = add i64 %index, 16, !dbg !40
  %12 = add i64 %index, 24, !dbg !40
  %13 = getelementptr inbounds float, float addrspace(13)* %8, i64 %9, !dbg !43
  %14 = getelementptr inbounds float, float addrspace(13)* %8, i64 %10, !dbg !43
  %15 = getelementptr inbounds float, float addrspace(13)* %8, i64 %11, !dbg !43
  %16 = getelementptr inbounds float, float addrspace(13)* %8, i64 %12, !dbg !43
  %17 = getelementptr float, float addrspace(13)* %13, i32 0, !dbg !43
  %18 = bitcast float addrspace(13)* %17 to <8 x float> addrspace(13)*, !dbg !43
  %wide.load = load <8 x float>, <8 x float> addrspace(13)* %18, align 4, !dbg !43, !tbaa !48
  %19 = getelementptr float, float addrspace(13)* %13, i32 8, !dbg !43
  %20 = bitcast float addrspace(13)* %19 to <8 x float> addrspace(13)*, !dbg !43
  %wide.load13 = load <8 x float>, <8 x float> addrspace(13)* %20, align 4, !dbg !43, !tbaa !48
  %21 = getelementptr float, float addrspace(13)* %13, i32 16, !dbg !43
  %22 = bitcast float addrspace(13)* %21 to <8 x float> addrspace(13)*, !dbg !43
  %wide.load14 = load <8 x float>, <8 x float> addrspace(13)* %22, align 4, !dbg !43, !tbaa !48
  %23 = getelementptr float, float addrspace(13)* %13, i32 24, !dbg !43
  %24 = bitcast float addrspace(13)* %23 to <8 x float> addrspace(13)*, !dbg !43
  %wide.load15 = load <8 x float>, <8 x float> addrspace(13)* %24, align 4, !dbg !43, !tbaa !48
  %25 = fcmp fast olt <8 x float> %vec.phi, %wide.load, !dbg !51
  %26 = fcmp fast olt <8 x float> %vec.phi10, %wide.load13, !dbg !51
  %27 = fcmp fast olt <8 x float> %vec.phi11, %wide.load14, !dbg !51
  %28 = fcmp fast olt <8 x float> %vec.phi12, %wide.load15, !dbg !51
  %29 = extractelement <8 x i1> %25, i32 0, !dbg !51
  %30 = select <8 x i1> %25, <8 x float> %vec.phi, <8 x float> %wide.load, !dbg !51
  %31 = select <8 x i1> %26, <8 x float> %vec.phi10, <8 x float> %wide.load13, !dbg !51
  %32 = select <8 x i1> %27, <8 x float> %vec.phi11, <8 x float> %wide.load14, !dbg !51
  %33 = select <8 x i1> %28, <8 x float> %vec.phi12, <8 x float> %wide.load15, !dbg !51
  %index.next = add i64 %index, 32, !dbg !40
  %34 = icmp eq i64 %index.next, %n.vec, !dbg !40
  br i1 %34, label %middle.block, label %vector.body, !dbg !40, !llvm.loop !54

With that I suppose we can start thinking of a proper solution, and since this involves code generation and maybe decisions about what @fastmath means I imagine some core developers with experience in this part of Julia would probably like to make a call here?

@nlw0
Copy link
Contributor

nlw0 commented Apr 24, 2019

I did some tests in my hacked Julia compiler, benchmarking maximum and extrema using @btime mymaximum(d) setup = (d = randn(Float32, 2^20)) and I got these results:

func time
maximum stdlib 626.831 μs
maximum simple loop 79.141 μs
extrema stdlib 6.945 ms
extrema simple loop 160.655 μs

@jakubwro
Copy link

Hi, during the development of my package I encountered a problem with extrema

signal1 = rand(10000000)
signal2 = rand(10000000)

Now I want to find minimum and maximum in one run

julia> @time Iterators.flatten((signal1, signal2)) |> extrema
  0.243334 seconds (20.00 M allocations: 610.352 MiB, 18.44% gc time)
(3.8733525054013285e-8, 0.9999999173562155)

I expect this not to allocate at all but it allocates much more (610 MiB) that the size of both arrays summed (152 MiB)

julia> @time [signal1;signal2;] |> extrema
  0.201026 seconds (7 allocations: 152.588 MiB, 24.41% gc time)
(3.8733525054013285e-8, 0.9999999173562155)

First I thought it is a problem with flatten iterator, but it looks ok for me and causes problems only with extrema function

julia> @time Iterators.flatten((signal1, signal2)) |> sum
  0.022848 seconds (8 allocations: 256 bytes)
1.0000864996770173e7

julia> @time Iterators.flatten((signal1, signal2)) |> minimum
  0.050592 seconds (8 allocations: 256 bytes)
3.8733525054013285e-8

When I defined my own extrema implementation problem does not occur.

function my_extrema(itr)
    vmin = vmax = first(itr)
    for item in itr
        vmax = max(item, vmax)
        vmin = min(item, vmin)
    end
    return (vmin, vmax)
end
julia> @time Iterators.flatten((signal1, signal2)) |> my_extrema
  0.074841 seconds (7 allocations: 240 bytes)
(3.8733525054013285e-8, 0.9999999173562155)

I'm not sure if this is related to this issue, but clearly something is wrong with this function, I cannot reason what and it happens only with flatten iterator over a tuple of arrays.

@nalimilan
Copy link
Member

Can you file a new issue? The problem seems much more serious than this issue if you can get a better performance just with a simpler implementation.

@jakubwro
Copy link

jakubwro commented Jan 14, 2020

Can you file a new issue? The problem seems much more serious than this issue if you can get a better performance just with a simpler implementation.

@nalimilan, I think I am close to fix that, so please give me some time and will open pull request instead of issue, or at least an issue with more info.

@jakubwro
Copy link

Can you file a new issue? The problem seems much more serious than this issue if you can get a better performance just with a simpler implementation.

@nalimilan, I think I am close to fix that, so please give me some time and will open pull request instead of issue, or at least an issue with more info.

Much more complicated than I expected and probably needs to be fixed after code lowering: #34385

@wizebt
Copy link

wizebt commented Feb 18, 2020

Here you go with the simple and fast implementation I'm using routinely (without the NaN test as it is not applicable, remove extra isnan test )

function min_max(I)
	a = b = I[1]
	for i = 2:length(I)
		isnan(I[i]) && continue
		if I[i] > b 
			b = I[i]
		elseif I[i] < a 
			a = I[i]
		end
	end
	a, b
end

Now lest bench functions

julia> x=rand(Float32,1000000);

julia> @btime minimum($x)
  430.751 μs (0 allocations: 0 bytes)
1.1920929f-6

julia> @btime maximum($x)
  435.949 μs (0 allocations: 0 bytes)
0.9999995f0

julia> @btime min_max($x)
  955.375 μs (0 allocations: 0 bytes)
(1.1920929f-6, 0.9999995f0)

julia> x[200000:700000] .= NaN;

julia> @btime min_max($x)
  707.323 μs (0 allocations: 0 bytes)
(1.7881393f-6, 0.9999995f0)

Without isnan test

julia> @btime min_max($x)
  680.122 μs (0 allocations: 0 bytes)
(1.7881393f-6, 0.9999995f0)

@wizebt
Copy link

wizebt commented Feb 18, 2020

Hi, during the development of my package I encountered a problem with extrema

signal1 = rand(10000000)
signal2 = rand(10000000)

Now I want to find minimum and maximum in one run

julia> @time Iterators.flatten((signal1, signal2)) |> extrema
  0.243334 seconds (20.00 M allocations: 610.352 MiB, 18.44% gc time)
(3.8733525054013285e-8, 0.9999999173562155)

I expect this not to allocate at all but it allocates much more (610 MiB) that the size of both arrays summed (152 MiB)

julia> @time [signal1;signal2;] |> extrema
  0.201026 seconds (7 allocations: 152.588 MiB, 24.41% gc time)
(3.8733525054013285e-8, 0.9999999173562155)

First I thought it is a problem with flatten iterator, but it looks ok for me and causes problems only with extrema function

julia> @time Iterators.flatten((signal1, signal2)) |> sum
  0.022848 seconds (8 allocations: 256 bytes)
1.0000864996770173e7

julia> @time Iterators.flatten((signal1, signal2)) |> minimum
  0.050592 seconds (8 allocations: 256 bytes)
3.8733525054013285e-8

When I defined my own extrema implementation problem does not occur.

function my_extrema(itr)
    vmin = vmax = first(itr)
    for item in itr
        vmax = max(item, vmax)
        vmin = min(item, vmin)
    end
    return (vmin, vmax)
end
julia> @time Iterators.flatten((signal1, signal2)) |> my_extrema
  0.074841 seconds (7 allocations: 240 bytes)
(3.8733525054013285e-8, 0.9999999173562155)

I'm not sure if this is related to this issue, but clearly something is wrong with this function, I cannot reason what and it happens only with flatten iterator over a tuple of arrays.

Your extrema functions works however from an algorithm optim perspective why should you scan twice the array to get min and max ?

@jakubwro
Copy link

Your extrema functions works however from an algorithm optim perspective why should you scan twice the array to get min and max ?

The problem I pointed is not about this particular calculation but rather too much memory allocation that manifests when using extrema with flatten iterator and potentially other cases that follow similar iteration pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:simd instruction-level vectorization fold sum, maximum, reduce, foldl, etc. good first issue Indicates a good issue for first-time contributors to Julia help wanted Indicates that a maintainer wants help on an issue or pull request performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants