-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster minimum/maximum/extrema #45581
base: master
Are you sure you want to change the base?
Conversation
I'm affraid you didn't set up #43725 successfully. As all-zero input should not be a pain there. This solution is good. The biggest concern might be it treats julia> using QNaNs
julia> minimum([qnan(2),qnan(1)]) |> qnan
1
julia> minimum([qnan(1),qnan(2)]) |> qnan
1 I not sure whether @tkf would be happy with this. (This would generate inference-dependent result.) If the core-devs are happy with this change. Also, I think it would be good to also accelerate slow dimensional reduction. e.g. ( |
Thanks for the feedback. I imagined I might have had that benchmarking wrong for that other PR. I removed the comparison from my earlier post. You are correct that this version of I've fixed the breaking-empty-collections issue, although it still looks brittle to me. I agree that this should also apply to dimensional reductions. I'm not sure I understand what you mean by "make The perfect world would involve a float |
Radix sort uses a similar approach: Lines 1434 to 1444 in bd8dbc3
Cc: @LilithHafner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to re-use from Sort.jl, but I believe the specifications here are sufficiently different with respect to handling NaNs that I don't think re-use is viable.
Co-authored-by: Lilith Orion Hafner <[email protected]>
I'm increasingly unhappy with how I've bolted this together. Hijacking the |
I didn't manage to build this branch, but doing
the following problems:
```
julia> extrema([1,2,Inf])
(1.0, 3.999999999999999)
julia> extrema([1,2,NaN])
|
How did you set it? julia> extrema([1,2,Inf])
(1.0, Inf)
julia> extrema([1,2,NaN])
(NaN, NaN)
julia> extrema([1,2,NaN32])
(NaN32, NaN32)
julia> extrema([1,2,Inf32])
(1.0f0, Inf32) |
Sorry. I did spend some time a bit ago incorporating the sort of reorganization you suggested. I pushed that just now but wouldn't be upset if you integrated parts or all of your changes instead. I spent some more time on this over the holidays but have been stuck on Here is the set I've been using to test this: using BenchmarkTools
a = randn(1000);
b = @. ifelse(a <= 0, zero(eltype(a)), a); # convert all negatives to +0.0
d = @. ifelse(rand() < 0.1, missing, a); # convert some values to missing
m = reshape(a,20,50);
r = Matrix{Float64}(undef,size(m,1),1); c = Matrix{Float64}(undef,1,size(m,2));
j = rand(Int,1000);
@btime minimum($a); # "typical" case
@btime minimum($b); # pathological case for old version
@btime minimum($m;dims=1);
@btime minimum($m;dims=2);
@btime minimum!($c,$m);
@btime minimum!($r,$m);
@btime map!(minimum,$c,eachcol($m)); # this appears to be faster than the previous but shouldn't be
@btime map!(minimum,$r,eachrow($m)); # this appears to be faster than the previous but shouldn't be
@btime extrema($a);
@btime minimum($d); # with missing
@btime minimum($j); # integer performance
@btime searchsortedlast($a, x) setup=(x=randn()); # isless performance |
I'm afraid your change didn't touch the |
It helps to avoid extra allocation (caused by non-specialization)
The accuracy won't be better.
6077812
to
068082a
Compare
With 413b739 I have julia> @btime minimum!($c,$m);
861.905 ns (0 allocations: 0 bytes)
julia> @btime minimum!($r,$m);
450.000 ns (0 allocations: 0 bytes)
julia> @btime map!(minimum,$c,eachcol($m));
1.220 μs (0 allocations: 0 bytes)
julia> @btime map!(minimum,$r,eachrow($m));
915.789 ns (0 allocations: 0 bytes) So I think the original concern get resolved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered using dispatch and a lazy map instead of hooks? i.e.
mapreduce_impl(f, op, A::AbstractArrayOrBroadcasted, [...]) =
mapreduce_impl(_mapped_eltype(f, A), f, op, A, [...])
mapreduce_impl(_, f, op, A::AbstractArrayOrBroadcasted, [...]) =
# existing implementation
for op in (min, max)
@eval mapreduce_impl(::IEEEFloat, f, typeof($op), A::AbstractArrayOrBroadcasted, [...]) =
postprocess(mapreduce_impl(f, fast_op, Iterators.map(preprocess, A), [...]))
end
Co-Authored-By: Lilith Orion Hafner <[email protected]>
|
I mistakenly assumed that |
What about for op in (min, max)
@eval mapreduce_impl(::IEEEFloat, f, typeof($op), A::AbstractArrayOrBroadcasted, [...]) =
postprocess(mapreduce_impl(preprocess ∘ f, fast_op, A, [...]))
end |
That makes sense. Thanks for humoring me. It's frustrating how complex the |
end | ||
R[i1,IR] = r | ||
R[i1,IR] = post(r) | ||
end | ||
else | ||
@inbounds for IA in CartesianIndices(indsAt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be possible to apply the same series of transformations when reducing along non-first dimensions? Or am I missing where that has already been handled?
EDIT: maybe now I see. This was your comment about the destination array possibly being of a different type. Could this still be done if it has a same-size type (or at least the exact same type) so that they are reinterpret
-able?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. And all the optimizations here are reinterpret
-able.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think non-dim1 cases would be accelerated a lot as we will have to add a map!(pre, dest, dest)
berfore the kernal loop, and a map!(post, dest, dest)
after it.
reinterpret
might be 0-cost, but flipneg
+add offset
+mem access
not.
And non dim1 minimum/maximum is vectorlized quite well
julia> A = randn(512, 512);
julia> @btime minimum($A, dims = 1);
65.400 μs (7 allocations: 4.33 KiB)
julia> @btime minimum($A, dims = 2);
56.500 μs (7 allocations: 4.33 KiB)
On my PC it's faster than dim1 with optimizaiton.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simple bench
f(b, c) = begin
pre, op, post = Base._makefast_reducution(min, Float64)
Base.initarray!(b, identity, min, true, c)
map!(pre, b, b)
Base.mapreducedim!(pre, op, b, c);
map!(post, b, b)
end
A = randn(512, 512); B = randn(512);
----------------------------------------
julia> @btime f($B, $A);
58.500 μs (0 allocations: 0 bytes)
julia> @btime minimum!($B, $A);
57.700 μs (0 allocations: 0 bytes)
EDIT: profile shows that the overhead of map!(pre, b, b)
& map!(post, b, b)
is neglible is this case.
Hi, |
I'm increasingly un-enamored with this PR. However, I still think that the current specializations should still be removed (or at least limited only to But the rest of it is a lot of code and complication (and maintenance) for some modest improvements. Further, I suspect that this would be a regression on AMD systems where the desired Currently, the But feel free to pick this up and get it over the line yourself, if you like. It would be easier (almost done, probably?) if you abandoned dimensional reduction and merely aimed to improve the whole-collection case. |
This PR uses a new method to greatly accelerate
minimum
/maximum
/extrema
forIEEEFloat
values. The concept is to convert floats to signed integers that respect the same ordering thatmin
/max
do for floats. Since integers can be min/max'd much more quickly than floats, this leads to a considerable speedup. This is a minor extension to the concept used byisless
since #39090. I used the changes here to slightly accelerate that version ofisless
as well.Note that the old versions had special-cased
_mapreduce_impl
formin
/max
. The reckless application of these led to very poor performance for integerminimum
/maximum
. This implementation eliminates those specializations, improving integer performance as well.Benchmark:
Before:
This PR:
#43725:EDIT: never mind - it sounds like I didn't get that benchmark rightResults are similar for
maximum
.This idea does not support values that may be
Missing
whereas the original specialization did. Interestingly, eliminating that specialization (none of the new code here is run withMissing
) still led to a performance improvement.For
Float64
, this PR makesminimum
/maximum
4.7x faster (more in pathological case),extrema
16x faster, andisless
1.3x faster. Note thatsort
already has specializations for floats, which remains faster than the naive application of this updatedisless
. This also makesminimum
/maximum
2.8x faster forInt
.The technical idea here is finished, but I could use some help on the architecture. The current version makes
_mapreduce(f,op,...)
check for whether a registered acceleration function exists forop
for the type off
applied to the return type. It only takes effect if thereturn_type
off
applied to the input is known to be aIEEEFloat
. In the case that such a function exists, it wrapsf
in the accelerating transformation and applies an appropriate inverse transformation to the result. Even if the architectural idea is sound, I'm not sure I put the corresponding functions in reasonable places within the code.