Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance regression for min_fast and Float16 on master #48848

Closed
matthias314 opened this issue Mar 1, 2023 · 16 comments
Closed

performance regression for min_fast and Float16 on master #48848

matthias314 opened this issue Mar 1, 2023 · 16 comments
Labels
float16 performance Must go faster regression Regression in behavior compared to a previous version

Comments

@matthias314
Copy link
Contributor

Current master is almost 6x slower than before:

julia> v = rand(Float16, 2^10);
julia> @btime @fastmath foldl(min, $v)
  1.103 μs (0 allocations: 0 bytes)   # 1.9.0-beta4
  6.352 μs (0 allocations: 0 bytes)   # 1.10.0-DEV.680

max_fast is not affected, nor are Float32 and Float64.

Julia Version 1.10.0-DEV.680
Commit e2390597241 (2023-03-01 10:07 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 4 virtual cores
@gbaraldi gbaraldi added performance Must go faster regression Regression in behavior compared to a previous version bisect wanted labels Mar 1, 2023
@christiangnrd
Copy link
Contributor

I tested this on:

  • M2 Max MacOS
  • M1 MacOS
  • Raspberry Pi 4 RPi OS
  • Ryzen 3700X Linux
  • Ryzen 3700X Windows
  • i9-12900 Linux

And I am pretty sure this an x64 architecture-specific bug because I was able to reproduce the regression on AMD and Intel processors on both Linux and Windows. With the ARM processors on both MacOS and Linux, I saw a significant performance improvement anywhere from 3x (rpi) to 30x (M1 and M2).

@giordano
Copy link
Contributor

giordano commented Mar 8, 2023

With the ARM processors on both MacOS and Linux, I saw a significant performance improvement anywhere from 3x (rpi) to 30x (M1 and M2).

Between which versions? A speedup might be masked by #43327 on M1/M2 which have native Float16, but that doesn't apply to the Raspberry Pi, I think.

@gbaraldi
Copy link
Member

gbaraldi commented Mar 8, 2023

Ok, this is a very odd failure. Look at the following non optimized IRs
On 1.9 we generate

    %5 = fcmp olt half %0, %1, !dbg !7
; └└
; ┌ @ essentials.jl:575 within `ifelse`
   %6 = zext i1 %5 to i8, !dbg !14
   %7 = trunc i8 %6 to i1, !dbg !14
   %8 = xor i1 %7, true, !dbg !14
   %9 = select i1 %8, half %1, half %0, !dbg !14
; └
  ret half %9, !dbg !13

But on master we generate

   %5 = fcmp fast olt half %0, %1, !dbg !7
; └└
; ┌ @ essentials.jl:586 within `ifelse`
   %6 = zext i1 %5 to i8, !dbg !12
   %7 = trunc i8 %6 to i1, !dbg !12
   %8 = xor i1 %7, true, !dbg !12
   %9 = select i1 %8, half %1, half %0, !dbg !12
; └
  ret half %9, !dbg !11

Note the fast on fcmp

After optimization we get
1.9

define half @julia_min_fast_597(half %0, half %1) #0 {
top:
; ┌ @ essentials.jl:575 within `ifelse`
   %2 = fpext half %0 to float
   %3 = fpext half %1 to float
   %4 = fcmp olt float %2, %3
   %5 = select i1 %4, half %0, half %1
; └
  ret half %5
}

master

define half @julia_min_fast_303(half %0, half %1) #0 {
top:
; ┌ @ essentials.jl:586 within `ifelse`
   %2 = call fast half @llvm.minnum.f16(half %0, half %1)
; └
  ret half %2
}

And it seems that the codegen for @llvm.minnum.f16 isn't as good as ours.

@gbaraldi
Copy link
Member

gbaraldi commented Mar 8, 2023

Native code
1.9

   pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        movl    %esi, %eax
        movzwl  %di, %ecx
        vmovd   %ecx, %xmm0
        vcvtph2ps       %xmm0, %xmm0
        movzwl  %ax, %ecx
        vmovd   %ecx, %xmm1
        vcvtph2ps       %xmm1, %xmm1
        vucomiss        %xmm0, %xmm1
        cmoval  %edi, %eax
                                        # kill: def $ax killed $ax killed $eax
        popq    %rbp
        .cfi_def_cfa %rsp, 8
        retq

master

        push    rbp
        .cfi_def_cfa_offset 16
        .cfi_offset rbp, -16
        mov     rbp, rsp
        .cfi_def_cfa_register rbp
        movzx   eax, di
        vmovd   xmm0, eax
        vcvtph2ps       xmm0, xmm0
        movzx   eax, si
        vmovd   xmm1, eax
        vcvtph2ps       xmm1, xmm1
        vminss  xmm2, xmm1, xmm0
        vcmpunordss     xmm0, xmm0, xmm0
        vblendvps       xmm0, xmm2, xmm1, xmm0
        vcvtps2ph       xmm0, xmm0, 4
        vmovd   eax, xmm0
                                        # kill: def $ax killed $ax killed $eax
        pop     rbp
        .cfi_def_cfa rsp, 8
        ret

@oscardssmith

@christiangnrd
Copy link
Contributor

christiangnrd commented Mar 8, 2023

Between which versions? A speedup might be masked by #43327 on M1/M2 which have native Float16, but that doesn't apply to the Raspberry Pi, I think.

I was comparing between 1.9.0-rc1 on all of them and 1.10.0-DEV.753 for some and DEV.759 for others.

@gbaraldi
Copy link
Member

gbaraldi commented Mar 8, 2023

Just for reference, that optimization done above makes this code very much faster on Float32.

@christiangnrd
Copy link
Contributor

3ff718650875b3783176d6bdc5e7b271ec573561 is the first bad commit
commit 3ff718650875b3783176d6bdc5e7b271ec573561
Author: Chris Elrod <[email protected]>
Date:   Tue Dec 27 05:26:37 2022 -0500

    Define `gt_fast` and `ge_fast` (#47972)
    
    The important thing here is that `>` is used in `max_fast`, `min_fast`, and `minmax_fast`.
    These functions did not SIMD in reductions because `>` didn't have an associated fast op.
    This PR fixes that.

 base/fastmath.jl | 4 ++++
 1 file changed, 4 insertions(+)

@giordano
Copy link
Contributor

CC @chriselrod

@chriselrod
Copy link
Contributor

Just for reference, that optimization done above makes this code very much faster on Float32.

Yes, with 2^10 elements without the PR:

julia> @btime foldl(@fastmath(min), $v32)
  954.714 ns (0 allocations: 0 bytes)
0.0009765625f0

With the PR:

julia> @btime foldl(@fastmath(min), $v32)
  47.266 ns (0 allocations: 0 bytes)
0.00048828125f0

Bad codegen of @llvm.minnum.f16 sounds like an LLVM issue.
If you don't have hardware with native support, I suggest manually converting to Float32 and then truncating when done.

julia> @btime foldl(min, $v)
  1.314 μs (0 allocations: 0 bytes)
Float16(0.0004883)

julia> @btime foldl(@fastmath(min), $v)
  4.861 μs (0 allocations: 0 bytes)
Float16(0.0004883)

julia> @btime Float16(foldl(@fastmath(min), Iterators.map(Float32,$v)))
  69.507 ns (0 allocations: 0 bytes)
Float16(0.0004883)

This will yield much better performance on Julia master than you can get on 1.9.

I assume if you do have hardware support, LLVM's @llvm.minnum.f16 will actually do a good job.

@matthias314
Copy link
Contributor Author

matthias314 commented Mar 11, 2023

Bad codegen of @llvm.minnum.f16 sounds like an LLVM issue.

If we have to wait for the LLVM folks to correct this, could the following be a temporary fix:

import Base.FastMath: min_fast, max_fast
min_fast(x::Float16, y::Float16) = -max_fast(-x, -y)

Then

julia> v = rand(Float16, 2^10);
julia> @btime foldl(max_fast, $v);
  1.113 μs (0 allocations: 0 bytes)
julia> @btime foldl(min_fast, $v);
  6.353 μs (0 allocations: 0 bytes)  # original min_fast
  1.111 μs (0 allocations: 0 bytes)  # definition above

If you don't have hardware with native support, I suggest manually converting to Float32 and then truncating when done.

Would it make sense to make this approach the default definition of sum, prod, maximum, minimum and extrema for Float16? On my computer this would give the following speed-up factors: 42 for sum, 2.9 for prod, 9.5 for maximum and 14 for minimum.

@N5N3
Copy link
Member

N5N3 commented Mar 11, 2023

Would it make sense to make this approach the default definition of

Perhaps we can combine it with _makefast_reduction once we land #45581.

@matthias314
Copy link
Contributor Author

It seems that the problem has disappeared on 1.10-alpha1:

julia> v = rand(Float16, 2^10);
julia> @btime @fastmath foldl(min, $v);
  1.216 μs (0 allocations: 0 bytes)
julia> @btime @fastmath foldl(max, $v);
  1.192 μs (0 allocations: 0 bytes)

The timings are the now almost the same (but around 10% slower than on 1.9).

@giordano
Copy link
Contributor

giordano commented Nov 1, 2024

This seems to be fixed on master:

$ julia +1.11 -q
julia> using BenchmarkTools

julia> v = rand(Float16, 2^10);

julia> @btime @fastmath foldl(min, $v);
  5.350 μs (0 allocations: 0 bytes)

vs

$ julia +nightly -q
julia> using BenchmarkTools

julia> v = rand(Float16, 2^10);

julia> @btime @fastmath foldl(min, $v);
  1.573 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.12.0-DEV.1525
Commit da74ef1933b (2024-10-31 18:15 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen 9 3950X 16-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver2)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)

LLVM code on v1.12 is also much better and it has a proper vector body:

julia> code_llvm(foldl, (typeof(Base.FastMath.min_fast), Vector{Float16}); debuginfo=:none)
; [...]
vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <16 x half> [ %minmax.ident.splat, %vector.ph ], [ %7, %vector.body ]
  %vec.phi51 = phi <16 x half> [ %minmax.ident.splat, %vector.ph ], [ %8, %vector.body ]
  %vec.phi52 = phi <16 x half> [ %minmax.ident.splat, %vector.ph ], [ %9, %vector.body ]
  %vec.phi53 = phi <16 x half> [ %minmax.ident.splat, %vector.ph ], [ %10, %vector.body ]
  %offset.idx = shl i64 %index, 1
  %2 = or disjoint i64 %offset.idx, 4
  %3 = getelementptr i8, ptr %invariant.gep, i64 %2
  %4 = getelementptr half, ptr %3, i64 16
  %5 = getelementptr half, ptr %3, i64 32
  %6 = getelementptr half, ptr %3, i64 48
  %wide.load = load <16 x half>, ptr %3, align 2
  %wide.load54 = load <16 x half>, ptr %4, align 2
  %wide.load55 = load <16 x half>, ptr %5, align 2
  %wide.load56 = load <16 x half>, ptr %6, align 2
  %7 = call fast <16 x half> @llvm.minnum.v16f16(<16 x half> %vec.phi, <16 x half> %wide.load)
  %8 = call fast <16 x half> @llvm.minnum.v16f16(<16 x half> %vec.phi51, <16 x half> %wide.load54)
  %9 = call fast <16 x half> @llvm.minnum.v16f16(<16 x half> %vec.phi52, <16 x half> %wide.load55)
  %10 = call fast <16 x half> @llvm.minnum.v16f16(<16 x half> %vec.phi53, <16 x half> %wide.load56)
  %index.next = add nuw i64 %index, 64
  %11 = icmp eq i64 %index.next, %n.vec
  br i1 %11, label %middle.block, label %vector.body
; [...]

Instead on v1.11:

julia> code_llvm(foldl, (typeof(Base.FastMath.min_fast), Vector{Float16}); debuginfo=:none)
; Function Signature: foldl(typeof(Base.FastMath.min_fast), Array{Float16, 1})
define half @julia_foldl_7171(ptr noundef nonnull align 8 dereferenceable(24) %"itr::Array") #0 {
top:
  %0 = getelementptr inbounds i8, ptr %"itr::Array", i64 16
  %.size.sroa.0.0.copyload = load i64, ptr %0, align 8
  %.not.not = icmp eq i64 %.size.sroa.0.0.copyload, 0
  br i1 %.not.not, label %L84, label %L35

L35:                                              ; preds = %top
  %1 = load ptr, ptr %"itr::Array", align 8
  %2 = load half, ptr %1, align 2
  %.not58.not68.not = icmp eq i64 %.size.sroa.0.0.copyload, 1
  br i1 %.not58.not68.not, label %L91, label %L76

L76:                                              ; preds = %L76, %L35
  %3 = phi i64 [ %value_phi2071, %L76 ], [ 1, %L35 ]
  %value_phi2071 = phi i64 [ %6, %L76 ], [ 2, %L35 ]
  %value_phi1970 = phi half [ %7, %L76 ], [ %2, %L35 ]
  %4 = getelementptr inbounds half, ptr %1, i64 %3
  %5 = load half, ptr %4, align 2
  %6 = add i64 %value_phi2071, 1
  %7 = call fast half @llvm.minnum.f16(half %value_phi1970, half %5)
  %exitcond.not = icmp eq i64 %value_phi2071, %.size.sroa.0.0.copyload
  br i1 %exitcond.not, label %L91, label %L76

L84:                                              ; preds = %top
  call void @j_reduce_empty_7183() #11
  unreachable

L91:                                              ; preds = %L76, %L35
  %value_phi19.lcssa = phi half [ %2, %L35 ], [ %7, %L76 ]
  ret half %value_phi19.lcssa
}

I don't know if this was fixed by an upgrade to a newer LLVM version or perhaps #55486, which simplified the code generation for Float16.

@giordano giordano closed this as completed Nov 1, 2024
@matthias314
Copy link
Contributor Author

I'm not saying that you should open this again, but I just want to point out that compared to 1.10 I now see a 2-3x slowdown for both min_fast and max_fast: For Julia 1.10.5 I get

julia> v = rand(Float16, 2^10);

julia> @b @fastmath foldl(min, $v)
1.200 μs

julia> @b @fastmath foldl(max, $v)
1.191 μs

For master it's

julia> @b @fastmath foldl(min, $v)
3.021 μs

julia> @b @fastmath foldl(max, $v)
2.256 μs
Julia Version 1.12.0-DEV.1526
Commit 7715cf287a (2024-10-31 23:07 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, skylake)

@oscardssmith
Copy link
Member

I believe it's the expected outccome that Float16 will be slow without a CPU that supports it. Now that there is actually hardware that supports Float16, the path we're taking is to optimize the routines for the hardware that supports it (since on the hardware that doesn't support it, you don't want to use Float16 for compute anyway).

@giordano
Copy link
Contributor

giordano commented Nov 1, 2024

Can't reproduce:

$ for version in 1.10 1.11 nightly; do julia +${version} -e '@show VERSION; using BenchmarkTools; v = rand(Float16, 2^10); @btime @fastmath foldl(min, $v); @btime @fastmath foldl(max, $v)'; done
VERSION = v"1.10.6"
  3.319 μs (0 allocations: 0 bytes)
  3.320 μs (0 allocations: 0 bytes)
VERSION = v"1.11.1"
  5.657 μs (0 allocations: 0 bytes)
  1.128 μs (0 allocations: 0 bytes)
VERSION = v"1.12.0-DEV.1535"
  1.586 μs (0 allocations: 0 bytes)
  1.165 μs (0 allocations: 0 bytes)

Same machine as above (no native Float16)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
float16 performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

7 participants