Faster minimum/maximum/extrema #45581

mikmoore · 2022-06-04T17:56:48Z

This PR uses a new method to greatly accelerate minimum/maximum/extrema for IEEEFloat values. The concept is to convert floats to signed integers that respect the same ordering that min/max do for floats. Since integers can be min/max'd much more quickly than floats, this leads to a considerable speedup. This is a minor extension to the concept used by isless since #39090. I used the changes here to slightly accelerate that version of isless as well.

Note that the old versions had special-cased _mapreduce_impl for min/max. The reckless application of these led to very poor performance for integer minimum/maximum. This implementation eliminates those specializations, improving integer performance as well.

Benchmark:

using BenchmarkTools
a = randn(1000);
b = zeros(1000);
c = @. ifelse(rand() < 0.1, missing, a);
j = rand(Int,1000);
@btime minimum($a); # "typical" case
@btime minimum($b); # pathological case for old version
@btime extrema($a);
@btime minimum($c); # with missing
@btime minimum($j); # integer performance
@btime searchsortedlast($a, x) setup=(x=randn()); # isless performance

Before:

  812.222 ns (0 allocations: 0 bytes)
  1.620 μs (0 allocations: 0 bytes)
  4.629 μs (0 allocations: 0 bytes)
  1.220 μs (0 allocations: 0 bytes)
  262.315 ns (0 allocations: 0 bytes)
  14.729 ns (0 allocations: 0 bytes)

This PR:

  173.764 ns (0 allocations: 0 bytes)
  173.721 ns (0 allocations: 0 bytes)
  278.276 ns (0 allocations: 0 bytes)
  954.839 ns (0 allocations: 0 bytes)
  92.453 ns (0 allocations: 0 bytes)
  11.311 ns (0 allocations: 0 bytes)

~~#43725:~~ EDIT: never mind - it sounds like I didn't get that benchmark right

Results are similar for maximum.
This idea does not support values that may be Missing whereas the original specialization did. Interestingly, eliminating that specialization (none of the new code here is run with Missing) still led to a performance improvement.

For Float64, this PR makes minimum/maximum 4.7x faster (more in pathological case), extrema 16x faster, and isless 1.3x faster. Note that sort already has specializations for floats, which remains faster than the naive application of this updated isless. This also makes minimum/maximum 2.8x faster for Int.

The technical idea here is finished, but I could use some help on the architecture. The current version makes _mapreduce(f,op,...) check for whether a registered acceleration function exists for op for the type of f applied to the return type. It only takes effect if the return_type of f applied to the input is known to be a IEEEFloat. In the case that such a function exists, it wraps f in the accelerating transformation and applies an appropriate inverse transformation to the result. Even if the architectural idea is sound, I'm not sure I put the corresponding functions in reasonable places within the code.

N5N3 · 2022-06-05T02:10:53Z

I'm affraid you didn't set up #43725 successfully. As all-zero input should not be a pain there.

This solution is good. The biggest concern might be it treats NaNs as ordered numbers:

julia> using QNaNs

julia> minimum([qnan(2),qnan(1)]) |> qnan
1
julia> minimum([qnan(1),qnan(2)]) |> qnan
1

I not sure whether @tkf would be happy with this. (This would generate inference-dependent result.)

If the core-devs are happy with this change.
I think we'd better do this optimization at mapreduce_impl level. This would helps solve the Currently breakage on our type-based reduce_empty.

Also, I think it would be good to also accelerate slow dimensional reduction. e.g. (minimum(a, dims = (1, 3))).
To achieve this, I think we'd better make floatorder_min return a float and define a new reduction function to handle them.

mikmoore · 2022-06-05T17:00:34Z

Thanks for the feedback.

I imagined I might have had that benchmarking wrong for that other PR. I removed the comparison from my earlier post.

You are correct that this version of minimum would differ from a nonspecializing reduce(min,...) in its preference among multiple NaNs. This may be objectionable.

I've fixed the breaking-empty-collections issue, although it still looks brittle to me.

I agree that this should also apply to dimensional reductions. I'm not sure I understand what you mean by "make floatorder_min return a float". While it's totally possible to define a pairwise min function based on this method, using it in a reduction involves redundant transformations to the integer space and so carries a significant performance penalty (such that other methods tend to perform better). Can you elaborate a little?
For these reductions, the representations are bit-compatible so I might imagine doing the dimensional reduction into a reinterpret(Int64,...) array, then applying the inverse-transforms in-place and reinterpreting as Float64. Although that approach feels brittle, too.

The perfect world would involve a float min function that would vectorize under reduce -- none of this specialization would be needed at all. Unfortunately, I don't know compilers well enough to win that fight. My tinkering has been unsuccessful.

nalimilan · 2022-06-05T21:09:01Z

Radix sort uses a similar approach:

julia/base/sort.jl

Lines 1434 to 1444 in bd8dbc3

    
           uint_map(x::Float32, ::Left) = ~reinterpret(UInt32, x) 
        
           uint_unmap(::Type{Float32}, u::UInt32, ::Left) = reinterpret(Float32, ~u) 
        
           uint_map(x::Float32, ::Right) = reinterpret(UInt32, x) 
        
           uint_unmap(::Type{Float32}, u::UInt32, ::Right) = reinterpret(Float32, u) 
        
           UIntMappable(::Type{Float32}, ::Union{Left, Right}) = UInt32 
        
           uint_map(x::Float64, ::Left) = ~reinterpret(UInt64, x) 
        
           uint_unmap(::Type{Float64}, u::UInt64, ::Left) = reinterpret(Float64, ~u) 
        
           uint_map(x::Float64, ::Right) = reinterpret(UInt64, x) 
        
           uint_unmap(::Type{Float64}, u::UInt64, ::Right) = reinterpret(Float64, u) 
        
           UIntMappable(::Type{Float64}, ::Union{Left, Right}) = UInt64

Cc: @LilithHafner

LilithHafner

It would be nice to re-use from Sort.jl, but I believe the specifications here are sufficiently different with respect to handling NaNs that I don't think re-use is viable.

base/reduce.jl

base/float.jl

base/reduce.jl

Co-authored-by: Lilith Orion Hafner <[email protected]>

mikmoore · 2022-06-06T00:14:09Z

I'm increasingly unhappy with how I've bolted this together. Hijacking the f function to do the transformation seems a little fragile. While the implementation here isn't too complicated, I'm not too excited to write the dimensional version (for, eg, minimum(...;dims=...)). I'm open to other ideas if people can think of something more composable.

N5N3 · 2022-06-06T04:55:31Z

@mikmoore f85099c is a very primitive extension for slow dimensional reduction.
I removed the acceleration for foldl branch, as I believe we should inplement a general fast mapreduce_impl for non-linear reduction.

mikmoore · 2022-06-06T13:56:54Z

@mikmoore f85099c is a very primitive extension for slow dimensional reduction.

I like that approach much better! I'll incorporate it when I have some time.

mcabbott · 2023-01-06T01:09:32Z

I didn't manage to build this branch, but doing @eval Base ... to load its new functions into master, I see ... [probably just a mistake]

the following problems: ``` julia> extrema([1,2,Inf]) (1.0, 3.999999999999999)

julia> extrema([1,2,NaN])
(NaN, -1.3482698511467371e308)

These don't occur with `maximum`. They do occur with Float32.

N5N3 · 2023-01-06T04:39:14Z

How did you set it?
My local build has pulled in a further version of this PR. And it has no such problem.

julia> extrema([1,2,Inf])
(1.0, Inf)

julia> extrema([1,2,NaN])
(NaN, NaN)

julia> extrema([1,2,NaN32])
(NaN32, NaN32)

julia> extrema([1,2,Inf32])
(1.0f0, Inf32)

N5N3 · 2023-01-06T05:06:04Z

@mikmoore would you mind me push f85099c here if you have no time to finish this.
Since triage said we dont care about which NaN to return, I think it would be good to finish this PR and see whether arm platforms really need it?

mikmoore · 2023-01-06T05:58:54Z

Sorry. I did spend some time a bit ago incorporating the sort of reorganization you suggested. I pushed that just now but wouldn't be upset if you integrated parts or all of your changes instead.

I spent some more time on this over the holidays but have been stuck on minimum!/maximum!. It's been a week or two, but as I recall they're causing allocations and are not as fast as they should be -- probably something to do with not specializing function arguments but I've been unsuccessfully in my attempts to fix it. Until that gets fixed, I don't think this can be finished. Feel free to take a shot at that as well, if you like.

Here is the set I've been using to test this:

using BenchmarkTools
a = randn(1000);
b = @. ifelse(a <= 0, zero(eltype(a)), a); # convert all negatives to +0.0
d = @. ifelse(rand() < 0.1, missing, a); # convert some values to missing
m = reshape(a,20,50);
r = Matrix{Float64}(undef,size(m,1),1); c = Matrix{Float64}(undef,1,size(m,2));
j = rand(Int,1000);
@btime minimum($a); # "typical" case
@btime minimum($b); # pathological case for old version
@btime minimum($m;dims=1);
@btime minimum($m;dims=2);
@btime minimum!($c,$m);
@btime minimum!($r,$m);
@btime map!(minimum,$c,eachcol($m)); # this appears to be faster than the previous but shouldn't be
@btime map!(minimum,$r,eachrow($m)); # this appears to be faster than the previous but shouldn't be
@btime extrema($a);
@btime minimum($d); # with missing
@btime minimum($j); # integer performance
@btime searchsortedlast($a, x) setup=(x=randn()); # isless performance

N5N3 · 2023-01-06T07:18:24Z

have been stuck on minimum!/maximum!.

I'm afraid your change didn't touch the minimum!/maximum!. As they call mapreducedim! directly.
And I think mapreducedim! wont like _makefast_mapreduction as there's no guarantee that the dest array has the same eltype as f.(A).
(And only the dim1 reduction would benefit from the optimization)

It helps to avoid extra allocation (caused by non-specialization)

The accuracy won't be better.

N5N3 · 2023-01-06T08:03:52Z

With 413b739 I have

julia> @btime minimum!($c,$m);
  861.905 ns (0 allocations: 0 bytes)

julia> @btime minimum!($r,$m);
  450.000 ns (0 allocations: 0 bytes)

julia> @btime map!(minimum,$c,eachcol($m));
  1.220 μs (0 allocations: 0 bytes)

julia> @btime map!(minimum,$r,eachrow($m));
  915.789 ns (0 allocations: 0 bytes)

So I think the original concern get resolved.

LilithHafner

Have you considered using dispatch and a lazy map instead of hooks? i.e.

mapreduce_impl(f, op, A::AbstractArrayOrBroadcasted, [...]) =
    mapreduce_impl(_mapped_eltype(f, A), f, op, A, [...])
mapreduce_impl(_, f, op, A::AbstractArrayOrBroadcasted, [...]) =
    # existing implementation
for op in (min, max)
    @eval mapreduce_impl(::IEEEFloat, f, typeof($op), A::AbstractArrayOrBroadcasted, [...]) =
        postprocess(mapreduce_impl(f, fast_op, Iterators.map(preprocess, A), [...]))
end

base/reduce.jl

base/float.jl

Co-Authored-By: Lilith Orion Hafner <[email protected]>

N5N3 · 2023-01-06T16:31:01Z

Iterators.map would return a Generator and fallback to mapfoldl. I'm not sure we want to add a similar hook there.

LilithHafner · 2023-01-06T16:58:34Z

I mistakenly assumed that Iterators.map(::AbstractArray)::AbstractArray 💔

LilithHafner · 2023-01-06T16:59:50Z

What about

for op in (min, max)
    @eval mapreduce_impl(::IEEEFloat, f, typeof($op), A::AbstractArrayOrBroadcasted, [...]) =
        postprocess(mapreduce_impl(preprocess ∘ f, fast_op, A, [...]))
end

N5N3 · 2023-01-06T17:06:41Z

@mikmoore tried this at _mapreduce level (see 81a0ff5).
But this approach seems not work well with mapreducedim! (as we only want to activate the optimization for dim1 reduction.)
For me it makes sense to keep mapreducedim! and mapreduce_impl share a similar code style.

LilithHafner · 2023-01-06T19:11:53Z

That makes sense. Thanks for humoring me. It's frustrating how complex the mapreduce implementation is.

mikmoore · 2023-01-07T16:35:59Z

base/reducedim.jl

            end
-            R[i1,IR] = r
+            R[i1,IR] = post(r)
        end
    else
        @inbounds for IA in CartesianIndices(indsAt)


Shouldn't it be possible to apply the same series of transformations when reducing along non-first dimensions? Or am I missing where that has already been handled?

EDIT: maybe now I see. This was your comment about the destination array possibly being of a different type. Could this still be done if it has a same-size type (or at least the exact same type) so that they are reinterpret-able?

Yes, exactly. And all the optimizations here are reinterpret-able.

I dont think non-dim1 cases would be accelerated a lot as we will have to add a map!(pre, dest, dest) berfore the kernal loop, and a map!(post, dest, dest) after it.
reinterpret might be 0-cost, but flipneg+add offset+mem access not.

And non dim1 minimum/maximum is vectorlized quite well

julia> A = randn(512, 512); julia> @btime minimum($A, dims = 1); 65.400 μs (7 allocations: 4.33 KiB) julia> @btime minimum($A, dims = 2); 56.500 μs (7 allocations: 4.33 KiB)

On my PC it's faster than dim1 with optimizaiton.

A simple bench

f(b, c) = begin pre, op, post = Base._makefast_reducution(min, Float64) Base.initarray!(b, identity, min, true, c) map!(pre, b, b) Base.mapreducedim!(pre, op, b, c); map!(post, b, b) end A = randn(512, 512); B = randn(512); ---------------------------------------- julia> @btime f($B, $A); 58.500 μs (0 allocations: 0 bytes) julia> @btime minimum!($B, $A); 57.700 μs (0 allocations: 0 bytes)

EDIT: profile shows that the overhead of map!(pre, b, b) & map!(post, b, b) is neglible is this case.

dpinol · 2024-10-09T08:20:13Z

Hi,
any chance this PR could be moved on?
thanks

mikmoore · 2024-10-09T13:46:59Z

I'm increasingly un-enamored with this PR. However, I still think that the current specializations should still be removed (or at least limited only to IEEEFloat) and the isless improvement seems fine.

But the rest of it is a lot of code and complication (and maintenance) for some modest improvements. Further, I suspect that this would be a regression on AMD systems where the desired min/max semantics are available as native instructions.

Currently, the reduce family of functions are just too complicated to make specializations like this palatable. I suspect that's why those of us that worked on this eventually burned out. And there's further work happening on those families (universal pairwise reduce, changes to init semantics) that may some adjustment here.

But feel free to pick this up and get it over the line yourself, if you like. It would be easier (almost done, probably?) if you abandoned dimensional reduction and merely aimed to improve the whole-collection case.

mikmoore added 4 commits June 4, 2022 10:25

faster minimum/maximum for IEEEFloat

a5bf6a4

faster isless

aea0ec6

only apply _makefast if type is concrete

a0b5853

fast extrema

0fc9ce0

N5N3 requested a review from tkf June 5, 2022 02:11

N5N3 added the performance Must go faster label Jun 5, 2022

mikmoore added 2 commits June 5, 2022 09:07

less ambiguity in accelerator definitions

130b0e5

only apply accelerator if it exists

02efa06

LilithHafner reviewed Jun 5, 2022

View reviewed changes

base/reduce.jl Outdated Show resolved Hide resolved

base/float.jl Show resolved Hide resolved

base/reduce.jl Outdated Show resolved Hide resolved

base/reduce.jl Outdated Show resolved Hide resolved

Apply suggestions from code review

f00ac8c

Co-authored-by: Lilith Orion Hafner <[email protected]>

N5N3 mentioned this pull request Jun 22, 2022

Make minmax faster for Float32/64 #41709

Merged

N5N3 mentioned this pull request Jul 5, 2022

Correctness issues with minimum and maximum #45932

Open

mikmoore mentioned this pull request Dec 7, 2022

Use native fmin/fmax in aarch64 #47814

Merged

reorganize

81a0ff5

N5N3 mentioned this pull request Jan 2, 2023

Should there be @fastmath maximum etc? #48082

Closed

mikmoore marked this pull request as draft January 6, 2023 05:59

mcabbott mentioned this pull request Jan 6, 2023

@fastmath maximum #48153

Merged

N5N3 added 2 commits January 6, 2023 15:42

Merge branch 'master' into pr/45581

0316ef9

Move the optimization deeper and spread it.

413b739

It helps to avoid extra allocation (caused by non-specialization)

Turn off pairwise reduction for some op.

068082a

The accuracy won't be better.

N5N3 force-pushed the floatminimummaximum branch from 6077812 to 068082a Compare January 6, 2023 07:50

LilithHafner mentioned this pull request Jan 6, 2023

Remove redundant definition of UIntType #48157

Merged

LilithHafner reviewed Jan 6, 2023

View reviewed changes

base/reduce.jl Outdated Show resolved Hide resolved

base/reduce.jl Outdated Show resolved Hide resolved

base/float.jl Show resolved Hide resolved

Adopt suggestion.

a9bd1e8

Co-Authored-By: Lilith Orion Hafner <[email protected]>

mikmoore commented Jan 7, 2023

View reviewed changes

N5N3 mentioned this pull request Mar 11, 2023

performance regression for min_fast and Float16 on master #48848

Closed

mikmoore mentioned this pull request Feb 14, 2024

@fastmath support for sum, prod, extrema and extrema! #49910

Open

mikmoore mentioned this pull request Oct 30, 2024

Move all platforms to use llvm.minimum/llvm.maximum for fmin/fmax #56371

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster minimum/maximum/extrema #45581

Faster minimum/maximum/extrema #45581

mikmoore commented Jun 4, 2022 •

edited

Loading

N5N3 commented Jun 5, 2022 •

edited

Loading

mikmoore commented Jun 5, 2022

nalimilan commented Jun 5, 2022

LilithHafner left a comment

mikmoore commented Jun 6, 2022

N5N3 commented Jun 6, 2022 •

edited

Loading

mikmoore commented Jun 6, 2022

mcabbott commented Jan 6, 2023 •

edited

Loading

N5N3 commented Jan 6, 2023

N5N3 commented Jan 6, 2023 •

edited

Loading

mikmoore commented Jan 6, 2023 •

edited

Loading

N5N3 commented Jan 6, 2023 •

edited

Loading

N5N3 commented Jan 6, 2023

LilithHafner left a comment •

edited

Loading

N5N3 commented Jan 6, 2023

LilithHafner commented Jan 6, 2023

LilithHafner commented Jan 6, 2023

N5N3 commented Jan 6, 2023 •

edited

Loading

LilithHafner commented Jan 6, 2023

mikmoore Jan 7, 2023 •

edited

Loading

LilithHafner Jan 7, 2023

N5N3 Jan 7, 2023 •

edited

Loading

N5N3 Jan 7, 2023 •

edited

Loading

dpinol commented Oct 9, 2024

mikmoore commented Oct 9, 2024

Faster minimum/maximum/extrema #45581

Are you sure you want to change the base?

Faster minimum/maximum/extrema #45581

Conversation

mikmoore commented Jun 4, 2022 • edited Loading

N5N3 commented Jun 5, 2022 • edited Loading

mikmoore commented Jun 5, 2022

nalimilan commented Jun 5, 2022

LilithHafner left a comment

Choose a reason for hiding this comment

mikmoore commented Jun 6, 2022

N5N3 commented Jun 6, 2022 • edited Loading

mikmoore commented Jun 6, 2022

mcabbott commented Jan 6, 2023 • edited Loading

N5N3 commented Jan 6, 2023

N5N3 commented Jan 6, 2023 • edited Loading

mikmoore commented Jan 6, 2023 • edited Loading

N5N3 commented Jan 6, 2023 • edited Loading

N5N3 commented Jan 6, 2023

LilithHafner left a comment • edited Loading

Choose a reason for hiding this comment

N5N3 commented Jan 6, 2023

LilithHafner commented Jan 6, 2023

LilithHafner commented Jan 6, 2023

N5N3 commented Jan 6, 2023 • edited Loading

LilithHafner commented Jan 6, 2023

mikmoore Jan 7, 2023 • edited Loading

Choose a reason for hiding this comment

LilithHafner Jan 7, 2023

Choose a reason for hiding this comment

N5N3 Jan 7, 2023 • edited Loading

Choose a reason for hiding this comment

N5N3 Jan 7, 2023 • edited Loading

Choose a reason for hiding this comment

dpinol commented Oct 9, 2024

mikmoore commented Oct 9, 2024

mikmoore commented Jun 4, 2022 •

edited

Loading

N5N3 commented Jun 5, 2022 •

edited

Loading

N5N3 commented Jun 6, 2022 •

edited

Loading

mcabbott commented Jan 6, 2023 •

edited

Loading

N5N3 commented Jan 6, 2023 •

edited

Loading

mikmoore commented Jan 6, 2023 •

edited

Loading

N5N3 commented Jan 6, 2023 •

edited

Loading

LilithHafner left a comment •

edited

Loading

N5N3 commented Jan 6, 2023 •

edited

Loading

mikmoore Jan 7, 2023 •

edited

Loading

N5N3 Jan 7, 2023 •

edited

Loading

N5N3 Jan 7, 2023 •

edited

Loading