performance regression for `min_fast` and `Float16` on master #48848

matthias314 · 2023-03-01T21:09:53Z

Current master is almost 6x slower than before:

julia> v = rand(Float16, 2^10);
julia> @btime @fastmath foldl(min, $v)
  1.103 μs (0 allocations: 0 bytes)   # 1.9.0-beta4
  6.352 μs (0 allocations: 0 bytes)   # 1.10.0-DEV.680

max_fast is not affected, nor are Float32 and Float64.

Julia Version 1.10.0-DEV.680
Commit e2390597241 (2023-03-01 10:07 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 4 virtual cores

The text was updated successfully, but these errors were encountered:

christiangnrd · 2023-03-08T18:38:32Z

I tested this on:

M2 Max MacOS
M1 MacOS
Raspberry Pi 4 RPi OS
Ryzen 3700X Linux
Ryzen 3700X Windows
i9-12900 Linux

And I am pretty sure this an x64 architecture-specific bug because I was able to reproduce the regression on AMD and Intel processors on both Linux and Windows. With the ARM processors on both MacOS and Linux, I saw a significant performance improvement anywhere from 3x (rpi) to 30x (M1 and M2).

giordano · 2023-03-08T19:09:43Z

With the ARM processors on both MacOS and Linux, I saw a significant performance improvement anywhere from 3x (rpi) to 30x (M1 and M2).

Between which versions? A speedup might be masked by #43327 on M1/M2 which have native Float16, but that doesn't apply to the Raspberry Pi, I think.

gbaraldi · 2023-03-08T19:16:12Z

Ok, this is a very odd failure. Look at the following non optimized IRs
On 1.9 we generate

    %5 = fcmp olt half %0, %1, !dbg !7
; └└
; ┌ @ essentials.jl:575 within `ifelse`
   %6 = zext i1 %5 to i8, !dbg !14
   %7 = trunc i8 %6 to i1, !dbg !14
   %8 = xor i1 %7, true, !dbg !14
   %9 = select i1 %8, half %1, half %0, !dbg !14
; └
  ret half %9, !dbg !13

But on master we generate

   %5 = fcmp fast olt half %0, %1, !dbg !7
; └└
; ┌ @ essentials.jl:586 within `ifelse`
   %6 = zext i1 %5 to i8, !dbg !12
   %7 = trunc i8 %6 to i1, !dbg !12
   %8 = xor i1 %7, true, !dbg !12
   %9 = select i1 %8, half %1, half %0, !dbg !12
; └
  ret half %9, !dbg !11

Note the fast on fcmp

After optimization we get
1.9

define half @julia_min_fast_597(half %0, half %1) #0 {
top:
; ┌ @ essentials.jl:575 within `ifelse`
   %2 = fpext half %0 to float
   %3 = fpext half %1 to float
   %4 = fcmp olt float %2, %3
   %5 = select i1 %4, half %0, half %1
; └
  ret half %5
}

master

define half @julia_min_fast_303(half %0, half %1) #0 {
top:
; ┌ @ essentials.jl:586 within `ifelse`
   %2 = call fast half @llvm.minnum.f16(half %0, half %1)
; └
  ret half %2
}

And it seems that the codegen for @llvm.minnum.f16 isn't as good as ours.

gbaraldi · 2023-03-08T19:19:07Z

Native code
1.9

   pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        movl    %esi, %eax
        movzwl  %di, %ecx
        vmovd   %ecx, %xmm0
        vcvtph2ps       %xmm0, %xmm0
        movzwl  %ax, %ecx
        vmovd   %ecx, %xmm1
        vcvtph2ps       %xmm1, %xmm1
        vucomiss        %xmm0, %xmm1
        cmoval  %edi, %eax
                                        # kill: def $ax killed $ax killed $eax
        popq    %rbp
        .cfi_def_cfa %rsp, 8
        retq

master

        push    rbp
        .cfi_def_cfa_offset 16
        .cfi_offset rbp, -16
        mov     rbp, rsp
        .cfi_def_cfa_register rbp
        movzx   eax, di
        vmovd   xmm0, eax
        vcvtph2ps       xmm0, xmm0
        movzx   eax, si
        vmovd   xmm1, eax
        vcvtph2ps       xmm1, xmm1
        vminss  xmm2, xmm1, xmm0
        vcmpunordss     xmm0, xmm0, xmm0
        vblendvps       xmm0, xmm2, xmm1, xmm0
        vcvtps2ph       xmm0, xmm0, 4
        vmovd   eax, xmm0
                                        # kill: def $ax killed $ax killed $eax
        pop     rbp
        .cfi_def_cfa rsp, 8
        ret

@oscardssmith

christiangnrd · 2023-03-08T19:20:42Z

Between which versions? A speedup might be masked by #43327 on M1/M2 which have native Float16, but that doesn't apply to the Raspberry Pi, I think.

I was comparing between 1.9.0-rc1 on all of them and 1.10.0-DEV.753 for some and DEV.759 for others.

gbaraldi · 2023-03-08T19:42:55Z

Just for reference, that optimization done above makes this code very much faster on Float32.

christiangnrd · 2023-03-10T20:09:11Z

3ff718650875b3783176d6bdc5e7b271ec573561 is the first bad commit
commit 3ff718650875b3783176d6bdc5e7b271ec573561
Author: Chris Elrod <[email protected]>
Date:   Tue Dec 27 05:26:37 2022 -0500

    Define `gt_fast` and `ge_fast` (#47972)
    
    The important thing here is that `>` is used in `max_fast`, `min_fast`, and `minmax_fast`.
    These functions did not SIMD in reductions because `>` didn't have an associated fast op.
    This PR fixes that.

 base/fastmath.jl | 4 ++++
 1 file changed, 4 insertions(+)

giordano · 2023-03-10T20:10:52Z

CC @chriselrod

chriselrod · 2023-03-10T22:16:43Z

Just for reference, that optimization done above makes this code very much faster on Float32.

Yes, with 2^10 elements without the PR:

julia> @btime foldl(@fastmath(min), $v32)
  954.714 ns (0 allocations: 0 bytes)
0.0009765625f0

With the PR:

julia> @btime foldl(@fastmath(min), $v32)
  47.266 ns (0 allocations: 0 bytes)
0.00048828125f0

Bad codegen of @llvm.minnum.f16 sounds like an LLVM issue.
If you don't have hardware with native support, I suggest manually converting to Float32 and then truncating when done.

julia> @btime foldl(min, $v)
  1.314 μs (0 allocations: 0 bytes)
Float16(0.0004883)

julia> @btime foldl(@fastmath(min), $v)
  4.861 μs (0 allocations: 0 bytes)
Float16(0.0004883)

julia> @btime Float16(foldl(@fastmath(min), Iterators.map(Float32,$v)))
  69.507 ns (0 allocations: 0 bytes)
Float16(0.0004883)

This will yield much better performance on Julia master than you can get on 1.9.

I assume if you do have hardware support, LLVM's @llvm.minnum.f16 will actually do a good job.

matthias314 · 2023-03-11T16:22:36Z

Bad codegen of @llvm.minnum.f16 sounds like an LLVM issue.

If we have to wait for the LLVM folks to correct this, could the following be a temporary fix:

import Base.FastMath: min_fast, max_fast
min_fast(x::Float16, y::Float16) = -max_fast(-x, -y)

Then

julia> v = rand(Float16, 2^10);
julia> @btime foldl(max_fast, $v);
  1.113 μs (0 allocations: 0 bytes)
julia> @btime foldl(min_fast, $v);
  6.353 μs (0 allocations: 0 bytes)  # original min_fast
  1.111 μs (0 allocations: 0 bytes)  # definition above

If you don't have hardware with native support, I suggest manually converting to Float32 and then truncating when done.

Would it make sense to make this approach the default definition of sum, prod, maximum, minimum and extrema for Float16? On my computer this would give the following speed-up factors: 42 for sum, 2.9 for prod, 9.5 for maximum and 14 for minimum.

N5N3 · 2023-03-11T16:46:04Z

Would it make sense to make this approach the default definition of

Perhaps we can combine it with _makefast_reduction once we land #45581.

matthias314 · 2023-07-20T21:54:54Z

It seems that the problem has disappeared on 1.10-alpha1:

julia> v = rand(Float16, 2^10);
julia> @btime @fastmath foldl(min, $v);
  1.216 μs (0 allocations: 0 bytes)
julia> @btime @fastmath foldl(max, $v);
  1.192 μs (0 allocations: 0 bytes)

The timings are the now almost the same (but around 10% slower than on 1.9).

giordano · 2024-11-01T00:33:18Z

This seems to be fixed on master:

$ julia +1.11 -q
julia> using BenchmarkTools

julia> v = rand(Float16, 2^10);

julia> @btime @fastmath foldl(min, $v);
  5.350 μs (0 allocations: 0 bytes)

vs

$ julia +nightly -q
julia> using BenchmarkTools

julia> v = rand(Float16, 2^10);

julia> @btime @fastmath foldl(min, $v);
  1.573 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.12.0-DEV.1525
Commit da74ef1933b (2024-10-31 18:15 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen 9 3950X 16-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver2)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)

LLVM code on v1.12 is also much better and it has a proper vector body:

julia> code_llvm(foldl, (typeof(Base.FastMath.min_fast), Vector{Float16}); debuginfo=:none)

; [...]
vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <16 x half> [ %minmax.ident.splat, %vector.ph ], [ %7, %vector.body ]
  %vec.phi51 = phi <16 x half> [ %minmax.ident.splat, %vector.ph ], [ %8, %vector.body ]
  %vec.phi52 = phi <16 x half> [ %minmax.ident.splat, %vector.ph ], [ %9, %vector.body ]
  %vec.phi53 = phi <16 x half> [ %minmax.ident.splat, %vector.ph ], [ %10, %vector.body ]
  %offset.idx = shl i64 %index, 1
  %2 = or disjoint i64 %offset.idx, 4
  %3 = getelementptr i8, ptr %invariant.gep, i64 %2
  %4 = getelementptr half, ptr %3, i64 16
  %5 = getelementptr half, ptr %3, i64 32
  %6 = getelementptr half, ptr %3, i64 48
  %wide.load = load <16 x half>, ptr %3, align 2
  %wide.load54 = load <16 x half>, ptr %4, align 2
  %wide.load55 = load <16 x half>, ptr %5, align 2
  %wide.load56 = load <16 x half>, ptr %6, align 2
  %7 = call fast <16 x half> @llvm.minnum.v16f16(<16 x half> %vec.phi, <16 x half> %wide.load)
  %8 = call fast <16 x half> @llvm.minnum.v16f16(<16 x half> %vec.phi51, <16 x half> %wide.load54)
  %9 = call fast <16 x half> @llvm.minnum.v16f16(<16 x half> %vec.phi52, <16 x half> %wide.load55)
  %10 = call fast <16 x half> @llvm.minnum.v16f16(<16 x half> %vec.phi53, <16 x half> %wide.load56)
  %index.next = add nuw i64 %index, 64
  %11 = icmp eq i64 %index.next, %n.vec
  br i1 %11, label %middle.block, label %vector.body
; [...]

Instead on v1.11:

julia> code_llvm(foldl, (typeof(Base.FastMath.min_fast), Vector{Float16}); debuginfo=:none)

; Function Signature: foldl(typeof(Base.FastMath.min_fast), Array{Float16, 1})
define half @julia_foldl_7171(ptr noundef nonnull align 8 dereferenceable(24) %"itr::Array") #0 {
top:
  %0 = getelementptr inbounds i8, ptr %"itr::Array", i64 16
  %.size.sroa.0.0.copyload = load i64, ptr %0, align 8
  %.not.not = icmp eq i64 %.size.sroa.0.0.copyload, 0
  br i1 %.not.not, label %L84, label %L35

L35:                                              ; preds = %top
  %1 = load ptr, ptr %"itr::Array", align 8
  %2 = load half, ptr %1, align 2
  %.not58.not68.not = icmp eq i64 %.size.sroa.0.0.copyload, 1
  br i1 %.not58.not68.not, label %L91, label %L76

L76:                                              ; preds = %L76, %L35
  %3 = phi i64 [ %value_phi2071, %L76 ], [ 1, %L35 ]
  %value_phi2071 = phi i64 [ %6, %L76 ], [ 2, %L35 ]
  %value_phi1970 = phi half [ %7, %L76 ], [ %2, %L35 ]
  %4 = getelementptr inbounds half, ptr %1, i64 %3
  %5 = load half, ptr %4, align 2
  %6 = add i64 %value_phi2071, 1
  %7 = call fast half @llvm.minnum.f16(half %value_phi1970, half %5)
  %exitcond.not = icmp eq i64 %value_phi2071, %.size.sroa.0.0.copyload
  br i1 %exitcond.not, label %L91, label %L76

L84:                                              ; preds = %top
  call void @j_reduce_empty_7183() #11
  unreachable

L91:                                              ; preds = %L76, %L35
  %value_phi19.lcssa = phi half [ %2, %L35 ], [ %7, %L76 ]
  ret half %value_phi19.lcssa
}

I don't know if this was fixed by an upgrade to a newer LLVM version or perhaps #55486, which simplified the code generation for Float16.

matthias314 · 2024-11-01T21:16:34Z

I'm not saying that you should open this again, but I just want to point out that compared to 1.10 I now see a 2-3x slowdown for both min_fast and max_fast: For Julia 1.10.5 I get

julia> v = rand(Float16, 2^10);

julia> @b @fastmath foldl(min, $v)
1.200 μs

julia> @b @fastmath foldl(max, $v)
1.191 μs

For master it's

julia> @b @fastmath foldl(min, $v)
3.021 μs

julia> @b @fastmath foldl(max, $v)
2.256 μs

Julia Version 1.12.0-DEV.1526
Commit 7715cf287a (2024-10-31 23:07 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, skylake)

oscardssmith · 2024-11-01T21:48:36Z

I believe it's the expected outccome that Float16 will be slow without a CPU that supports it. Now that there is actually hardware that supports Float16, the path we're taking is to optimize the routines for the hardware that supports it (since on the hardware that doesn't support it, you don't want to use Float16 for compute anyway).

giordano · 2024-11-01T23:08:39Z

Can't reproduce:

$ for version in 1.10 1.11 nightly; do julia +${version} -e '@show VERSION; using BenchmarkTools; v = rand(Float16, 2^10); @btime @fastmath foldl(min, $v); @btime @fastmath foldl(max, $v)'; done
VERSION = v"1.10.6"
  3.319 μs (0 allocations: 0 bytes)
  3.320 μs (0 allocations: 0 bytes)
VERSION = v"1.11.1"
  5.657 μs (0 allocations: 0 bytes)
  1.128 μs (0 allocations: 0 bytes)
VERSION = v"1.12.0-DEV.1535"
  1.586 μs (0 allocations: 0 bytes)
  1.165 μs (0 allocations: 0 bytes)

Same machine as above (no native Float16)

gbaraldi added performance Must go faster regression Regression in behavior compared to a previous version bisect wanted labels Mar 1, 2023

gbaraldi mentioned this issue Mar 8, 2023

fastmath fminnum.f16 could be better on non avx512_fp16 x86 llvm/llvm-project#61271

Open

giordano added the float16 label Mar 9, 2023

giordano removed the bisect wanted label Mar 10, 2023

giordano closed this as completed Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance regression for `min_fast` and `Float16` on master #48848

performance regression for `min_fast` and `Float16` on master #48848

matthias314 commented Mar 1, 2023

christiangnrd commented Mar 8, 2023

giordano commented Mar 8, 2023

gbaraldi commented Mar 8, 2023

gbaraldi commented Mar 8, 2023

christiangnrd commented Mar 8, 2023 •

edited

Loading

gbaraldi commented Mar 8, 2023

christiangnrd commented Mar 10, 2023

giordano commented Mar 10, 2023

chriselrod commented Mar 10, 2023

matthias314 commented Mar 11, 2023 •

edited

Loading

N5N3 commented Mar 11, 2023

matthias314 commented Jul 20, 2023

giordano commented Nov 1, 2024

matthias314 commented Nov 1, 2024

oscardssmith commented Nov 1, 2024

giordano commented Nov 1, 2024

performance regression for min_fast and Float16 on master #48848

performance regression for min_fast and Float16 on master #48848

Comments

matthias314 commented Mar 1, 2023

christiangnrd commented Mar 8, 2023

giordano commented Mar 8, 2023

gbaraldi commented Mar 8, 2023

gbaraldi commented Mar 8, 2023

christiangnrd commented Mar 8, 2023 • edited Loading

gbaraldi commented Mar 8, 2023

christiangnrd commented Mar 10, 2023

giordano commented Mar 10, 2023

chriselrod commented Mar 10, 2023

matthias314 commented Mar 11, 2023 • edited Loading

N5N3 commented Mar 11, 2023

matthias314 commented Jul 20, 2023

giordano commented Nov 1, 2024

matthias314 commented Nov 1, 2024

oscardssmith commented Nov 1, 2024

giordano commented Nov 1, 2024

performance regression for `min_fast` and `Float16` on master #48848

performance regression for `min_fast` and `Float16` on master #48848

christiangnrd commented Mar 8, 2023 •

edited

Loading

matthias314 commented Mar 11, 2023 •

edited

Loading