Extrema is slower than maximum + minimum #31442

mbauman · 2019-03-22T03:30:27Z

On AVX-512 systems, calling extrema(x) is slower than calling maximum and then minimum. Looks like we need to incorporate the same techniques from #30320 over to extrema in order to get it to vectorize in a similar manner.

The text was updated successfully, but these errors were encountered:

ghost · 2019-03-23T13:18:08Z

@mbauman Can you provide the test case you used? I ran the following code:

arrs = []

for i in 1:100000
    insert!(arrs, 1, rand(1000))
end

println("running extrema")
@time map(arrs) do arr
    extrema(arr)
end

println("running minimum, maximum")
@time map(arrs) do arr
    (minimum(arr), maximum(arr))
end

and got this output:

running extrema
  0.520760 seconds (301.98 k allocations: 14.811 MiB)
running minimum, maximum
  0.741328 seconds (279.30 k allocations: 13.814 MiB)

julia version:

julia version 1.1.0

mbauman · 2019-03-25T14:00:41Z

Yes, that test case reproduces the behavior on master for me. It won't on 1.1.0 as the PR that improved maximum and minimum was after 1.1.0. Of course, note that you'll also need to have a relatively recent computer to see the difference.

nlw0 · 2019-04-14T20:31:41Z

I got a slower extrema in my Core i5 machine.

running extrema
  314.316 ms (100003 allocations: 4.58 MiB)
running minimum, maximum
  224.977 ms (100003 allocations: 4.58 MiB)

nlw0 · 2019-04-14T21:04:55Z

I would like to try working on this, and it would be great to hear what kind of patch is expected. I was thinking maybe mapreduce_impl could be turned into a generated function that can handle multiple functions simultaneously. Does that sound good?

One question I have is about the unrolling. Ideally this should be adapted for each architecture and data type. How much of this should be up to LLVM or to Julia? Should we prioritize generating code that the LLVM can vectorize? The alternative I imagine would be controlling parameters like this unrolling depending on the architecture.

nlw0 · 2019-04-15T08:53:51Z

Thinking a bit more about this, maybe the best of worlds would be if this function could simply be implemented with a general fold-left, and it should result in a good vectorized machine code relying on the LLVM automatic vectorization. Should we concentrate our efforts on making sure that a generic fold works better in more cases, or should we first make sure that a few specific functions have high-performance implementations?

mbauman · 2019-04-15T15:57:39Z

I was thinking of the much simpler option of copy-pasting the techniques used in maximum/minimum (here:

julia/base/reduce.jl

Lines 481 to 518 in 2b49d26

    
           a1 = @inbounds A[first] 
        
           v1 = mapreduce_first(f, op, a1) 
        
           v2 = v3 = v4 = v1 
        
           chunk_len = 256 
        
           start = first + 1 
        
           simdstop  = start + chunk_len - 4 
        
           while simdstop <= last - 3 
        
               # short circuit in case of NaN 
        
               v1 == v1 || return v1 
        
               v2 == v2 || return v2 
        
               v3 == v3 || return v3 
        
               v4 == v4 || return v4 
        
               @inbounds for i in start:4:simdstop 
        
                   v1 = _fast(op, v1, f(A[i+0])) 
        
                   v2 = _fast(op, v2, f(A[i+1])) 
        
                   v3 = _fast(op, v3, f(A[i+2])) 
        
                   v4 = _fast(op, v4, f(A[i+3])) 
        
               end 
        
               checkbounds(A, simdstop+3) 
        
               start += chunk_len 
        
               simdstop += chunk_len 
        
           end 
        
           v = op(op(v1,v2),op(v3,v4)) 
        
           for i in start:last 
        
               @inbounds ai = A[i] 
        
               v = op(v, f(ai)) 
        
           end 
        
           # enforce correct order of 0.0 and -0.0 
        
           # e.g. maximum([0.0, -0.0]) === 0.0 
        
           # should hold 
        
           if isbadzero(op, v) 
        
               for i in first:last 
        
                   x = @inbounds A[i] 
        
                   isgoodzero(op,x) && return x 
        
               end 
        
           end 
        
           return v

) into the body of the extrema implementation (here:

julia/base/operators.jl

Lines 471 to 483 in 2b49d26

    
           y = iterate(itr) 
        
           y === nothing && throw(ArgumentError("collection must be non-empty")) 
        
           (v, s) = y 
        
           vmin = vmax = f(v) 
        
           while true 
        
               y = iterate(itr, s) 
        
               y === nothing && break 
        
               (x, s) = y 
        
               fx = f(x) 
        
               vmax = max(fx, vmax) 
        
               vmin = min(fx, vmin) 
        
           end 
        
           return (vmin, vmax)

).

Revamping mapreduce to support multiple functions is a much bigger task (and I'm not sure we want to add that complication to the already complex mapreduce machinery, either), and while improving LLVM's automatic vectorization is always a worthy goal it's particularly challenging in cases like these.

nlw0 · 2019-04-15T20:55:20Z

I am still trying to understand why the specialization of mapreduce_impl was needed in the first place, and what are the symmetries between minimum and extrema. I am reading the history of the file and I cannot make complete sense of it.

The generic mapreduce_impl contains a @simd for line, and produces vectorized code to op=max from what I can tell. The only reason I see for the specialized function is the short-circuiting. Are we interested in having this on extrema too? Also, these specialized functions are not using @simd for, but a while instead. Is this on purpose?

This idea I had at first of supporting multiple functions doesn't really make sense, sorry. I am thinking now on something much simpler, which would be to implement this function as a fold like:
myextrema(arr) = Base.mapfoldl(identity, ((mi,ma),v)->(min(mi,v),max(ma,v)), arr[2:end], init=(arr[1],arr[1]))

Which of course we can specialize if it makes sense. What I see now is that this doesn't seem to vectorize, and I wonder if just refactoring mapfoldl_impl would do it, and that is what I see as something that might be nice to investigate, since it might also improve other functions based on it.

It doesn't seem to matter for this specific function, though, because from what I can tell the current extrema is already vectorized. So I suppose in the end this is just about using _min_fast and _max_fast?

nlw0 · 2019-04-15T21:15:13Z

I quickly tried to use the _min_fast functions and saw no change. And the machine code is already showing the vcmpordsd instruction. Maybe this will really require something like that unrolling?

mbauman · 2019-04-15T23:33:48Z

Note that just seeing vector instructions in @code_native set doesn't actually mean the processor is using the parallelism of the SIMD unit... it can also just be using SIMD instructions with single values. For example, take a look at @code_native max(1.0, 2.0) — that'll show a whole slew of "SIMD" instructions (including that vcmpordsd) operating on that single data. It's much easier to look at @code_llvm to see if things are vectorizing.

For example, you can see that extrema actually does vectorize nicely for integers:

julia> @code_llvm extrema(rand(Int, 1000))
# ... contains things like:
vector.ph:                                        ; preds = %L40.lr.ph.L40.lr.ph.split_crit_edge
      %n.vec = and i64 %27, -32
      %ind.end = or i64 %n.vec, 1
      %ind.end26 = or i64 %n.vec, 2
      %minmax.ident.splatinsert = insertelement <8 x i64> undef, i64 %24, i32 0
      %minmax.ident.splat = shufflevector <8 x i64> %minmax.ident.splatinsert, <8 x i64> undef, <8 x i32> zeroinitializer
      %minmax.ident.splatinsert59 = insertelement <8 x i64> undef, i64 %24, i32 0
      %minmax.ident.splat60 = shufflevector <8 x i64> %minmax.ident.splatinsert59, <8 x i64> undef, <8 x i32> zeroinitializer
      br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
      %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
      %vec.phi = phi <8 x i64> [ %minmax.ident.splat, %vector.ph ], [ %48, %vector.body ]
# ...

Key things to see are the vector.body label, all the <8 x i64> vectors. If you look at @code_llvm extrema(rand(1000)) with floating point values, though, you'll see no vector.body, no <N x double> vectors, and lots of plain old double comparisons.

chriselrod · 2019-04-15T23:45:43Z

vcmpordsd isn't a SIMD instruction. sd means "single double". If it was named pd for "packed double", it would be SIMD.

nlw0 · 2019-04-16T05:53:02Z

Alright, thanks! I need to get more used with reading the code_llvm output. What is a good reference for Intel instructions? And what does the 'v' mean in this case?

Anyways I guess it all means we really need to change this loop somehow to enable the vectorization. Will try something later.

mbauman · 2019-04-16T15:58:16Z

I'm not an assembly expert but here's my understanding: The v prefix is for a "VEX" (vector extensions) encoded assembly instruction. It's working on the xmm register; that's a register in the SIMD hardware. SIMD units have scalar arithmetic because shuffling data back to the other registers can be costly. I imagine the way it's implemented is that it actually does the same packed operation and just discards the other bits.

I find LLVM IR imminently more understandable and readable — and the giant language reference page is a good resource.

nlw0 · 2019-04-16T22:39:01Z

OK, it took me long enough understand the whole thing and go trough my grief process. It seems the LLVM will really refuse to vectorize it unless we do this unrolling. I have prepared a first draft that shows a speedup. It is still missing many details like the chunky traversal.

I hope I can prepare a PR by tomorrow, but it would be great to hear any comments. In especial, would this use of @ntuple be fine?

julia> using BenchmarkTools
julia> using Test
julia> N=2^16;
julia> mylist = rand(N);
julia> minornan(x,y) = ifelse(isnan(x) || x < y, x, y)
minornan (generic function with 1 method)

julia> maxornan(x,y) = ifelse(isnan(x) || x > y, x, y)
maxornan (generic function with 1 method)

julia> function myext(itr)
           V=8
           @inbounds begin
               vmin = Base.Cartesian.@ntuple 8 j->itr[j]
               vmax = Base.Cartesian.@ntuple 8 j->itr[j]
               @simd for i in V:V:length(itr)-V
                   vmin = Base.Cartesian.@ntuple 8 j->minornan(vmin[j], itr[i+j])
                   vmax = Base.Cartesian.@ntuple 8 j->maxornan(vmax[j], itr[i+j])
               end
           end
           (minimum(vmin), maximum(vmax))
       end
myext (generic function with 1 method)

julia> minmaxextrema(aa) = (minimum(aa), maximum(aa))
minmaxextrema (generic function with 1 method)

julia> @btime extrema(xx) setup = (xx=rand(N));
  193.335 μs (1 allocation: 32 bytes)

julia> @btime minmaxextrema(xx) setup = (xx=rand(N));
  77.339 μs (1 allocation: 32 bytes)

julia> @btime myext(xx) setup = (xx=rand(N));
  27.389 μs (1 allocation: 32 bytes)

julia> @test all(extrema(xx) == myext(xx)  for xx in [rand(N) for _ in 1:100])
Test Passed

nlw0 · 2019-04-17T09:38:45Z

I was checking how things are working on the LLVM side, and I see the relevant patch was merged months ago and made it to 8.0. Wouldn't this be a good candidate for the Julia LLVM patches, then? It wasn't the case before because the LLVM patch was still open. I'm proposing to fix the issue by a Julia LLVM patch plus updated extrema and etc implementations that can rely on the LLVM vectorization.

nlw0 · 2019-04-17T21:17:18Z

I tried including the tiny D53680 LLVM patch into Julia, not sure I did anything wrong but nothing came out of it.

I would really like to understand better what could be happening, because I am not so sure there is just some small thing lacking on LLVM 6.0 that may soon just go away. Testing with clang 6.0.0 I can produce vectorized code as long as fast-mode is used, nicely adapted to e.g. AVX-512. And it all works fine for integers. On Julia it works fine with integers and floats with +, but I can't get vectorization of min even with fast-mode, or even doing just a ifelse(x < y, x, y) without caring for NaNs.

I was trying to use some LLVM intrinsics to see if it had any effect, like maxnum(x, y) = ccall("llvm.maxnum.f64", llvmcall, Float64, (Float64, Float64), x, y);, but nothing. I am thinking maybe I could try other more complex LLVM IR code until automatic vectorization happens.

mbauman · 2019-04-17T22:56:28Z

Wow, nice investigations! When I tagged this as looking for help, I was thinking it'd be a simple copy-pasting of the technique used in #30320 — but you've definitely taken it to the next level. This is great.

On the Julia side, yes, I think it'd be just fine to use the Cartesian macros since they're also used elsewhere in the multidimensional.jl file. We had been slowly replacing them as they're a bit hard to read, but they're here for now. You could also just write out (or get the macro to write out) the expression literally:

julia> @macroexpand Base.Cartesian.@ntuple 8 j->minornan(vmin[j], itr[i+j])
:((minornan(vmin[1], itr[i + 1]), minornan(vmin[2], itr[i + 2]), minornan(vmin[3], itr[i + 3]), minornan(vmin[4], itr[i + 4]), minornan(vmin[5], itr[i + 5]), minornan(vmin[6], itr[i + 6]), minornan(vmin[7], itr[i + 7]), minornan(vmin[8], itr[i + 8])))

A few other notes: you'll need to grab the final elements from the iterator (if it's not an even multiple of V), and this implementation will only work well for 1-based IndexLinear AbstractArrays. You can generalize it to offset arrays with firstindex and lastindex, and then you can restrict it via dispatch to only apply to ::AbstractArray.

I'm not as familiar with the LLVM side of things, but it may indeed be worth investigating what version 8 might bring. I don't know how far we are from version 8 compatibility — 7 was just added a few weeks ago. That patch might not stand alone as a patch to version 6.

nlw0 · 2019-04-17T22:56:43Z

This code vectorizes on clang, using just comparisons and in fast mode: https://godbolt.org/z/tF8SSH

float mymin(float x, float y) {
    return (x<y)?x:y;
}
float maximum(float*itr) {
    float v1=0.0;
    for(int i=0; i<6553600; i+=1) {
        v1=mymin(v1,itr[i]);
    }
    return v1;
}

This (addition) vectorizes in Julia even without fast mode (fast mode is still being set in the ir from what i understand):
myadd(x,y)= Base.llvmcall(""" %3 = fadd fast double %1, %0 ret double %3 """, Float64, Tuple{Float64,Float64}, x, y)

This does not vectorize in Julia even with fast mode:
mymin(x,y)= Base.llvmcall(""" %3 = fcmp fast olt double %0, %1 %4 = select i1 %3, double %0, double %1 ret double %4 """, Float64, Tuple{Float64,Float64}, x, y)

nlw0 · 2019-04-18T21:22:28Z

I did the following test today: first I produced logs from the LLVM passes for 4 functions, calculating the sum and min from a vector of integers or floats (with fast-math). It is possible to see that the functions are very similar up to the point the loop vectorizer runs, when it doesn't work for the case of the min on floats, as expected. I've put the logs here.

I can't make much out of it, but maybe someone has any idea about the point we see the first difference, at line 174 where the float version reads

  br label %L88, !dbg !116

and the int

  %min.iters.check = icmp ult i64 %42, 16, !dbg !115
  br i1 %min.iters.check, label %scalar.ph, label %vector.ph, !dbg !115

After that I also generated dumps from clang, where the vectorization happens. The dumped IR can be run with llvmcall with slight modifications, and then we can find out vectorization still doesn't happen for floats in Julia. I guess that means the Julia IR is fine and Clang isn't doing anything special there, so it may have something to do with the way LLVM is being configured by Julia?

using BenchmarkTools
using Test

mymin(itr::Vector{Float32})=
Base.llvmcall("""
  %ptr = inttoptr i64 %0 to float*

  %2 = load float, float* %ptr, align 4
  br label %5

; <label>:3:                                      ; preds = %5
  %4 = phi float [ %11, %5 ]
  ret float %4

; <label>:5:                                      ; preds = %5, %1
  %6 = phi i64 [ 1, %1 ], [ %12, %5 ]
  %7 = phi float [ %2, %1 ], [ %11, %5 ]
  %8 = getelementptr inbounds float, float* %ptr, i64 %6
  %9 = load float, float* %8, align 4
  %10 = fcmp fast olt float %7, %9
  %11 = select i1 %10, float %7, float %9
  %12 = add nuw nsw i64 %6, 1
  %13 = icmp eq i64 %12, 65536
  br i1 %13, label %3, label %5
""", Float32, Tuple{Ptr{Float32}}, pointer(itr))

mymin(itr::Vector{Int32})=
Base.llvmcall("""
  %ptr = inttoptr i64 %0 to i32*

  %2 = load i32, i32* %ptr, align 4
  br label %5

; <label>:3:                                      ; preds = %5
  %4 = phi i32 [ %11, %5 ]
  ret i32 %4

; <label>:5:                                      ; preds = %5, %1
  %6 = phi i64 [ 1, %1 ], [ %12, %5 ]
  %7 = phi i32 [ %2, %1 ], [ %11, %5 ]
  %8 = getelementptr inbounds i32, i32* %ptr, i64 %6
  %9 = load i32, i32* %8, align 4
  %10 = icmp slt i32 %7, %9
  %11 = select i1 %10, i32 %7, i32 %9
  %12 = add nuw nsw i64 %6, 1
  %13 = icmp eq i64 %12, 65536
  br i1 %13, label %3, label %5
""", Int32, Tuple{Ptr{Int32}}, pointer(itr))

data = rand(Float32, 2^16)
# data = rand(Int32, 2^16)
@test mymin(data) == minimum(data)
@code_native mymin(data)

EDIT: running this code with JULIA_LLVM_ARGS="-pass-remarks-analysis=loop-vectorize -pass-remarks=vector" julia -O3 --math-mode=fast testext.jl (following comments on #30933) I get for float
remark: /home/user/simpleext.jl:4:0: loop not vectorized: value that could not be identified as reduction is used outside the loop
and for int
remark: /home/user/simpleext.jl:27:0: vectorized loop (vectorization width: 8, interleaved count: 4)

nlw0 · 2019-04-18T22:12:53Z

I have finally seen trouble while working strictly outside Julia: I produced the LLVM IR from the int and float versions, and I was not able to produce vectorized code with opt -fp-contract=fast -O3 -mem2reg -S min_f32_clang.ll, even though it worked with int.

https://gist.github.com/nlw0/eb50a88845eba90b6b859154d2f2383d

nlw0 · 2019-04-21T13:28:21Z

I finally started looking at the LLVM analysis log for the loop vectorizer via JULIA_LLVM_ARGS="--debug-only=loop-vectorize" ./usr/bin/julia-debug --math-mode=fast using this test code:

function myfun(itr)
    @inbounds begin
        v = itr[1]
        @simd for y in itr[2:end]
            @fastmath v = ifelse(v < y,v,y)
        end
        v
    end
end

# a = collect(0x00000001:0x00010000)
a = collect(1f0:1f4)
myfun(a)

Straight away what you can see is that vcat(1f0:1f4) also does not get vectorized, while it does in the integer version and with a similar function in clang --- which also appears to do a lot more unrolling too. I take it as a clue there really may be an underlying issue worth of attention. (EDIT: actually not tested yet on clang using the same llvm) (EDIT2: actually it's different for floats because apparently the loop size is not pre-computed in this case, and I don't know what would be the proper way to do it)

nlw0 · 2019-04-22T13:29:20Z

I've run Julia under gdb and made a trace of the moment the loop vectorizer diverges between the integer and float versions of the minimum loop. The loop starts with two phi nodes, and vectorization apparently fails when at the second node the test if (RecurrenceDescriptor::isReductionPHI(... fails for floats, while it works for ints. Not sure what any of that means, but maybe someone has an idea. The relevant code is the function canVectorizeInstrs in https://github.com/llvm/llvm-project/blob/llvmorg-6.0.1/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

https://gist.github.com/nlw0/58ed9fda8e8944a9cb5e5a20f6038fcf

nlw0 · 2019-04-22T22:48:28Z

I'm writing a lot of stuff here, only some of which makes actual sense, so apologies for the noise. I believe I am finally getting close to the root cause of this issue, though.

The LLVM loop vectorizer is expecting the function to carry a "no-nans-fp-math" attribute. The test is done here: https://github.com/llvm/llvm-project/blob/d359f2096850c68b708bc25a7baca4282945949f/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L5324

As far as I can tell Julia never sets this, and fast-math mode only affects instructions down at the leaves of the expression trees. This is why there is no optimization even though fast mode is active. I believe this is where this function attribute is set by clang https://github.com/llvm-mirror/clang/blob/31c307cf96ef1b6a4d0c542d18ebfd7a307ae9b0/lib/CodeGen/CGCall.cpp#L1747-L1748

I tried hacking this attribute it, but there seems to be something else still missing. Hope I can make some more progress soon.

nlw0 · 2019-04-24T06:15:02Z

OK, there was just something missing still in order to see it in @code_llvm, but it is actually working! I have now managed to take this:

function mymin(itr)
    @inbounds begin
        v = itr[1]
        @simd for y in itr[2:65536]
            @fastmath v = ifelse(v < y,v,y)
        end
        v
    end
end
```assembly
And have `@code_native` show this:

    vminps  %ymm1, %ymm0, %ymm0
    vminps  %ymm2, %ymm0, %ymm0
    vminps  %ymm3, %ymm0, %ymm0

This is the nasty patch, it basically hacks the `Function` objects in `jitlayers.cpp` to set the `"no-nans-fp-math"` in all of them. Although `@code_llvm` doesn't work, and I assume it just generates the LLVM objects trough a different route, we can actually see the vectorized IR in the LLVM log:
```llvm
*** IR Dump After Loop Vectorization ***
; Function Attrs: sspstrong
define float @julia_myfun_15967(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40)) #0 !dbg !5 {
top:
  %1 = call %jl_value_t*** @julia.ptls_states()
  call void @llvm.dbg.value(metadata %jl_value_t addrspace(10)* null, metadata !17, metadata !DIExpression(DW_OP_deref)), !dbg !18
  call void @llvm.dbg.value(metadata %jl_value_t addrspace(10)* %0, metadata !17, metadata !DIExpression(DW_OP_deref)), !dbg !18
  %2 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*, !dbg !19
  %3 = bitcast %jl_value_t addrspace(11)* %2 to %jl_array_t addrspace(11)*, !dbg !19
  %4 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %3, i64 0, i32 1, !dbg !19
  %5 = load i64, i64 addrspace(11)* %4, align 8, !dbg !19, !tbaa !29
  %6 = icmp slt i64 %5, 1, !dbg !33
  br i1 %6, label %L19, label %L10.lr.ph, !dbg !36

L10.lr.ph:                                        ; preds = %top
  %7 = bitcast %jl_value_t addrspace(11)* %2 to float addrspace(13)* addrspace(11)*
  %8 = load float addrspace(13)*, float addrspace(13)* addrspace(11)* %7, align 8, !tbaa !37, !nonnull !4
  %min.iters.check = icmp ult i64 %5, 32, !dbg !39
  br i1 %min.iters.check, label %scalar.ph, label %vector.ph, !dbg !39

vector.ph:                                        ; preds = %L10.lr.ph
  %n.mod.vf = urem i64 %5, 32, !dbg !39
  %n.vec = sub i64 %5, %n.mod.vf, !dbg !39
  br label %vector.body, !dbg !39

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ], !dbg !40
  %vec.phi = phi <8 x float> [ <float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000>, %vector.ph ], [ %30, %vector.body ]
  %vec.phi10 = phi <8 x float> [ <float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000>, %vector.ph ], [ %31, %vector.body ]
  %vec.phi11 = phi <8 x float> [ <float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000>, %vector.ph ], [ %32, %vector.body ]
  %vec.phi12 = phi <8 x float> [ <float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000, float 0x7FF0000000000000>, %vector.ph ], [ %33, %vector.body ]
  %broadcast.splatinsert = insertelement <8 x i64> undef, i64 %index, i32 0, !dbg !40
  %broadcast.splat = shufflevector <8 x i64> %broadcast.splatinsert, <8 x i64> undef, <8 x i32> zeroinitializer, !dbg !40
  %induction = add <8 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>, !dbg !40
  %induction7 = add <8 x i64> %broadcast.splat, <i64 8, i64 9, i64 10, i64 11, i64 12, i64 13, i64 14, i64 15>, !dbg !40
  %induction8 = add <8 x i64> %broadcast.splat, <i64 16, i64 17, i64 18, i64 19, i64 20, i64 21, i64 22, i64 23>, !dbg !40
  %induction9 = add <8 x i64> %broadcast.splat, <i64 24, i64 25, i64 26, i64 27, i64 28, i64 29, i64 30, i64 31>, !dbg !40
  %9 = add i64 %index, 0, !dbg !40
  %10 = add i64 %index, 8, !dbg !40
  %11 = add i64 %index, 16, !dbg !40
  %12 = add i64 %index, 24, !dbg !40
  %13 = getelementptr inbounds float, float addrspace(13)* %8, i64 %9, !dbg !43
  %14 = getelementptr inbounds float, float addrspace(13)* %8, i64 %10, !dbg !43
  %15 = getelementptr inbounds float, float addrspace(13)* %8, i64 %11, !dbg !43
  %16 = getelementptr inbounds float, float addrspace(13)* %8, i64 %12, !dbg !43
  %17 = getelementptr float, float addrspace(13)* %13, i32 0, !dbg !43
  %18 = bitcast float addrspace(13)* %17 to <8 x float> addrspace(13)*, !dbg !43
  %wide.load = load <8 x float>, <8 x float> addrspace(13)* %18, align 4, !dbg !43, !tbaa !48
  %19 = getelementptr float, float addrspace(13)* %13, i32 8, !dbg !43
  %20 = bitcast float addrspace(13)* %19 to <8 x float> addrspace(13)*, !dbg !43
  %wide.load13 = load <8 x float>, <8 x float> addrspace(13)* %20, align 4, !dbg !43, !tbaa !48
  %21 = getelementptr float, float addrspace(13)* %13, i32 16, !dbg !43
  %22 = bitcast float addrspace(13)* %21 to <8 x float> addrspace(13)*, !dbg !43
  %wide.load14 = load <8 x float>, <8 x float> addrspace(13)* %22, align 4, !dbg !43, !tbaa !48
  %23 = getelementptr float, float addrspace(13)* %13, i32 24, !dbg !43
  %24 = bitcast float addrspace(13)* %23 to <8 x float> addrspace(13)*, !dbg !43
  %wide.load15 = load <8 x float>, <8 x float> addrspace(13)* %24, align 4, !dbg !43, !tbaa !48
  %25 = fcmp fast olt <8 x float> %vec.phi, %wide.load, !dbg !51
  %26 = fcmp fast olt <8 x float> %vec.phi10, %wide.load13, !dbg !51
  %27 = fcmp fast olt <8 x float> %vec.phi11, %wide.load14, !dbg !51
  %28 = fcmp fast olt <8 x float> %vec.phi12, %wide.load15, !dbg !51
  %29 = extractelement <8 x i1> %25, i32 0, !dbg !51
  %30 = select <8 x i1> %25, <8 x float> %vec.phi, <8 x float> %wide.load, !dbg !51
  %31 = select <8 x i1> %26, <8 x float> %vec.phi10, <8 x float> %wide.load13, !dbg !51
  %32 = select <8 x i1> %27, <8 x float> %vec.phi11, <8 x float> %wide.load14, !dbg !51
  %33 = select <8 x i1> %28, <8 x float> %vec.phi12, <8 x float> %wide.load15, !dbg !51
  %index.next = add i64 %index, 32, !dbg !40
  %34 = icmp eq i64 %index.next, %n.vec, !dbg !40
  br i1 %34, label %middle.block, label %vector.body, !dbg !40, !llvm.loop !54

With that I suppose we can start thinking of a proper solution, and since this involves code generation and maybe decisions about what @fastmath means I imagine some core developers with experience in this part of Julia would probably like to make a call here?

nlw0 · 2019-04-24T22:01:47Z

I did some tests in my hacked Julia compiler, benchmarking maximum and extrema using @btime mymaximum(d) setup = (d = randn(Float32, 2^20)) and I got these results:

func	time
maximum stdlib	626.831 μs
maximum simple loop	79.141 μs
extrema stdlib	6.945 ms
extrema simple loop	160.655 μs

jakubwro · 2020-01-14T00:25:34Z

Hi, during the development of my package I encountered a problem with extrema

signal1 = rand(10000000)
signal2 = rand(10000000)

Now I want to find minimum and maximum in one run

julia> @time Iterators.flatten((signal1, signal2)) |> extrema
  0.243334 seconds (20.00 M allocations: 610.352 MiB, 18.44% gc time)
(3.8733525054013285e-8, 0.9999999173562155)

I expect this not to allocate at all but it allocates much more (610 MiB) that the size of both arrays summed (152 MiB)

julia> @time [signal1;signal2;] |> extrema
  0.201026 seconds (7 allocations: 152.588 MiB, 24.41% gc time)
(3.8733525054013285e-8, 0.9999999173562155)

First I thought it is a problem with flatten iterator, but it looks ok for me and causes problems only with extrema function

julia> @time Iterators.flatten((signal1, signal2)) |> sum
  0.022848 seconds (8 allocations: 256 bytes)
1.0000864996770173e7

julia> @time Iterators.flatten((signal1, signal2)) |> minimum
  0.050592 seconds (8 allocations: 256 bytes)
3.8733525054013285e-8

When I defined my own extrema implementation problem does not occur.

function my_extrema(itr)
    vmin = vmax = first(itr)
    for item in itr
        vmax = max(item, vmax)
        vmin = min(item, vmin)
    end
    return (vmin, vmax)
end

julia> @time Iterators.flatten((signal1, signal2)) |> my_extrema
  0.074841 seconds (7 allocations: 240 bytes)
(3.8733525054013285e-8, 0.9999999173562155)

I'm not sure if this is related to this issue, but clearly something is wrong with this function, I cannot reason what and it happens only with flatten iterator over a tuple of arrays.

nalimilan · 2020-01-14T19:10:17Z

Can you file a new issue? The problem seems much more serious than this issue if you can get a better performance just with a simpler implementation.

jakubwro · 2020-01-14T19:31:00Z

Can you file a new issue? The problem seems much more serious than this issue if you can get a better performance just with a simpler implementation.

@nalimilan, I think I am close to fix that, so please give me some time and will open pull request instead of issue, or at least an issue with more info.

jakubwro · 2020-01-15T12:32:41Z

Can you file a new issue? The problem seems much more serious than this issue if you can get a better performance just with a simpler implementation.

@nalimilan, I think I am close to fix that, so please give me some time and will open pull request instead of issue, or at least an issue with more info.

Much more complicated than I expected and probably needs to be fixed after code lowering: #34385

wizebt · 2020-02-18T14:06:15Z

Here you go with the simple and fast implementation I'm using routinely (without the NaN test as it is not applicable, remove extra isnan test )

function min_max(I)
	a = b = I[1]
	for i = 2:length(I)
		isnan(I[i]) && continue
		if I[i] > b 
			b = I[i]
		elseif I[i] < a 
			a = I[i]
		end
	end
	a, b
end

Now lest bench functions

julia> x=rand(Float32,1000000);

julia> @btime minimum($x)
  430.751 μs (0 allocations: 0 bytes)
1.1920929f-6

julia> @btime maximum($x)
  435.949 μs (0 allocations: 0 bytes)
0.9999995f0

julia> @btime min_max($x)
  955.375 μs (0 allocations: 0 bytes)
(1.1920929f-6, 0.9999995f0)

julia> x[200000:700000] .= NaN;

julia> @btime min_max($x)
  707.323 μs (0 allocations: 0 bytes)
(1.7881393f-6, 0.9999995f0)

Without isnan test

julia> @btime min_max($x)
  680.122 μs (0 allocations: 0 bytes)
(1.7881393f-6, 0.9999995f0)

wizebt · 2020-02-18T14:22:44Z

Hi, during the development of my package I encountered a problem with extrema
signal1 = rand(10000000)
signal2 = rand(10000000)
Now I want to find minimum and maximum in one run
julia> @time Iterators.flatten((signal1, signal2)) |> extrema
  0.243334 seconds (20.00 M allocations: 610.352 MiB, 18.44% gc time)
(3.8733525054013285e-8, 0.9999999173562155)
I expect this not to allocate at all but it allocates much more (610 MiB) that the size of both arrays summed (152 MiB)
julia> @time [signal1;signal2;] |> extrema
  0.201026 seconds (7 allocations: 152.588 MiB, 24.41% gc time)
(3.8733525054013285e-8, 0.9999999173562155)
First I thought it is a problem with flatten iterator, but it looks ok for me and causes problems only with extrema function
julia> @time Iterators.flatten((signal1, signal2)) |> sum
  0.022848 seconds (8 allocations: 256 bytes)
1.0000864996770173e7

julia> @time Iterators.flatten((signal1, signal2)) |> minimum
  0.050592 seconds (8 allocations: 256 bytes)
3.8733525054013285e-8
When I defined my own extrema implementation problem does not occur.
function my_extrema(itr)
    vmin = vmax = first(itr)
    for item in itr
        vmax = max(item, vmax)
        vmin = min(item, vmin)
    end
    return (vmin, vmax)
end
julia> @time Iterators.flatten((signal1, signal2)) |> my_extrema
  0.074841 seconds (7 allocations: 240 bytes)
(3.8733525054013285e-8, 0.9999999173562155)
I'm not sure if this is related to this issue, but clearly something is wrong with this function, I cannot reason what and it happens only with flatten iterator over a tuple of arrays.

Your extrema functions works however from an algorithm optim perspective why should you scan twice the array to get min and max ?

jakubwro · 2020-02-22T22:25:41Z

Your extrema functions works however from an algorithm optim perspective why should you scan twice the array to get min and max ?

The problem I pointed is not about this particular calculation but rather too much memory allocation that manifests when using extrema with flatten iterator and potentially other cases that follow similar iteration pattern.

mbauman added performance Must go faster help wanted Indicates that a maintainer wants help on an issue or pull request arrays [a, r, r, a, y, s] good first issue Indicates a good issue for first-time contributors to Julia labels Mar 22, 2019

StefanKarpinski added the compiler:simd instruction-level vectorization label Mar 22, 2019

This was referenced Apr 18, 2019

likely vectorization discrepancy between julia and clang triple-nested-loop gemms #29445

Open

sum(a) is now 30% slower than NumPy #30290

Closed

This was referenced Apr 26, 2019

meta nodes lost when converting back to legacy IR #27104

Closed

Implements the @nonans decorator to enable some LLVM vectorizations #31862

Open

jessebett mentioned this issue Jul 25, 2019

Tuple of callables should be callable #32682

Closed

jakubwro mentioned this issue Jan 15, 2020

Extrema, flattened iterator and too much memory alloation #34385

Closed

tkf mentioned this issue Feb 18, 2020

Faster extrema on abstract arrays when there is no functions or selected dims #34790

Open

tkf mentioned this issue Jun 13, 2020

Define extrema using mapreduce; support init #36265

Closed

dkarrasch mentioned this issue Sep 7, 2020

fix #37312 fast extrema computation on sparse arrays #37429

Merged

matthieugomez mentioned this issue Apr 13, 2021

Why are extrema needed for ContinuousTerm? JuliaStats/StatsModels.jl#222

Open

N5N3 mentioned this issue Jan 9, 2022

Faster minimum/maximum/extrema for Float64/32 #43725

Closed

Moelf mentioned this issue Mar 14, 2022

extrema could be faster #44606

Open

N5N3 mentioned this issue Jun 1, 2022

faster min/max for IEEEFloats #45532

Closed

mbauman added fold sum, maximum, reduce, foldl, etc. and removed arrays [a, r, r, a, y, s] labels Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extrema is slower than maximum + minimum #31442

Extrema is slower than maximum + minimum #31442

mbauman commented Mar 22, 2019

ghost commented Mar 23, 2019

mbauman commented Mar 25, 2019

nlw0 commented Apr 14, 2019

nlw0 commented Apr 14, 2019

nlw0 commented Apr 15, 2019

mbauman commented Apr 15, 2019

nlw0 commented Apr 15, 2019 •

edited

Loading

nlw0 commented Apr 15, 2019

mbauman commented Apr 15, 2019

chriselrod commented Apr 15, 2019 •

edited

Loading

nlw0 commented Apr 16, 2019

mbauman commented Apr 16, 2019

nlw0 commented Apr 16, 2019

nlw0 commented Apr 17, 2019

nlw0 commented Apr 17, 2019

mbauman commented Apr 17, 2019

nlw0 commented Apr 17, 2019

nlw0 commented Apr 18, 2019 •

edited

Loading

nlw0 commented Apr 18, 2019

nlw0 commented Apr 21, 2019 •

edited

Loading

nlw0 commented Apr 22, 2019

nlw0 commented Apr 22, 2019

nlw0 commented Apr 24, 2019

nlw0 commented Apr 24, 2019 •

edited

Loading

jakubwro commented Jan 14, 2020

nalimilan commented Jan 14, 2020

jakubwro commented Jan 14, 2020 •

edited

Loading

jakubwro commented Jan 15, 2020

wizebt commented Feb 18, 2020

wizebt commented Feb 18, 2020

jakubwro commented Feb 22, 2020

Extrema is slower than maximum + minimum #31442

Extrema is slower than maximum + minimum #31442

Comments

mbauman commented Mar 22, 2019

ghost commented Mar 23, 2019

mbauman commented Mar 25, 2019

nlw0 commented Apr 14, 2019

nlw0 commented Apr 14, 2019

nlw0 commented Apr 15, 2019

mbauman commented Apr 15, 2019

nlw0 commented Apr 15, 2019 • edited Loading

nlw0 commented Apr 15, 2019

mbauman commented Apr 15, 2019

chriselrod commented Apr 15, 2019 • edited Loading

nlw0 commented Apr 16, 2019

mbauman commented Apr 16, 2019

nlw0 commented Apr 16, 2019

nlw0 commented Apr 17, 2019

nlw0 commented Apr 17, 2019

mbauman commented Apr 17, 2019

nlw0 commented Apr 17, 2019

nlw0 commented Apr 18, 2019 • edited Loading

nlw0 commented Apr 18, 2019

nlw0 commented Apr 21, 2019 • edited Loading

nlw0 commented Apr 22, 2019

nlw0 commented Apr 22, 2019

nlw0 commented Apr 24, 2019

nlw0 commented Apr 24, 2019 • edited Loading

jakubwro commented Jan 14, 2020

nalimilan commented Jan 14, 2020

jakubwro commented Jan 14, 2020 • edited Loading

jakubwro commented Jan 15, 2020

wizebt commented Feb 18, 2020

wizebt commented Feb 18, 2020

jakubwro commented Feb 22, 2020

nlw0 commented Apr 15, 2019 •

edited

Loading

chriselrod commented Apr 15, 2019 •

edited

Loading

nlw0 commented Apr 18, 2019 •

edited

Loading

nlw0 commented Apr 21, 2019 •

edited

Loading

nlw0 commented Apr 24, 2019 •

edited

Loading

jakubwro commented Jan 14, 2020 •

edited

Loading