-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance regression for min_fast
and Float16
on master
#48848
Comments
I tested this on:
And I am pretty sure this an x64 architecture-specific bug because I was able to reproduce the regression on AMD and Intel processors on both Linux and Windows. With the ARM processors on both MacOS and Linux, I saw a significant performance improvement anywhere from 3x (rpi) to 30x (M1 and M2). |
Between which versions? A speedup might be masked by #43327 on M1/M2 which have native Float16, but that doesn't apply to the Raspberry Pi, I think. |
Ok, this is a very odd failure. Look at the following non optimized IRs %5 = fcmp olt half %0, %1, !dbg !7
; └└
; ┌ @ essentials.jl:575 within `ifelse`
%6 = zext i1 %5 to i8, !dbg !14
%7 = trunc i8 %6 to i1, !dbg !14
%8 = xor i1 %7, true, !dbg !14
%9 = select i1 %8, half %1, half %0, !dbg !14
; └
ret half %9, !dbg !13 But on master we generate %5 = fcmp fast olt half %0, %1, !dbg !7
; └└
; ┌ @ essentials.jl:586 within `ifelse`
%6 = zext i1 %5 to i8, !dbg !12
%7 = trunc i8 %6 to i1, !dbg !12
%8 = xor i1 %7, true, !dbg !12
%9 = select i1 %8, half %1, half %0, !dbg !12
; └
ret half %9, !dbg !11 Note the After optimization we get define half @julia_min_fast_597(half %0, half %1) #0 {
top:
; ┌ @ essentials.jl:575 within `ifelse`
%2 = fpext half %0 to float
%3 = fpext half %1 to float
%4 = fcmp olt float %2, %3
%5 = select i1 %4, half %0, half %1
; └
ret half %5
} master define half @julia_min_fast_303(half %0, half %1) #0 {
top:
; ┌ @ essentials.jl:586 within `ifelse`
%2 = call fast half @llvm.minnum.f16(half %0, half %1)
; └
ret half %2
} And it seems that the codegen for |
Native code pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
movl %esi, %eax
movzwl %di, %ecx
vmovd %ecx, %xmm0
vcvtph2ps %xmm0, %xmm0
movzwl %ax, %ecx
vmovd %ecx, %xmm1
vcvtph2ps %xmm1, %xmm1
vucomiss %xmm0, %xmm1
cmoval %edi, %eax
# kill: def $ax killed $ax killed $eax
popq %rbp
.cfi_def_cfa %rsp, 8
retq master push rbp
.cfi_def_cfa_offset 16
.cfi_offset rbp, -16
mov rbp, rsp
.cfi_def_cfa_register rbp
movzx eax, di
vmovd xmm0, eax
vcvtph2ps xmm0, xmm0
movzx eax, si
vmovd xmm1, eax
vcvtph2ps xmm1, xmm1
vminss xmm2, xmm1, xmm0
vcmpunordss xmm0, xmm0, xmm0
vblendvps xmm0, xmm2, xmm1, xmm0
vcvtps2ph xmm0, xmm0, 4
vmovd eax, xmm0
# kill: def $ax killed $ax killed $eax
pop rbp
.cfi_def_cfa rsp, 8
ret |
I was comparing between 1.9.0-rc1 on all of them and 1.10.0-DEV.753 for some and DEV.759 for others. |
Just for reference, that optimization done above makes this code very much faster on |
|
CC @chriselrod |
Yes, with julia> @btime foldl(@fastmath(min), $v32)
954.714 ns (0 allocations: 0 bytes)
0.0009765625f0 With the PR: julia> @btime foldl(@fastmath(min), $v32)
47.266 ns (0 allocations: 0 bytes)
0.00048828125f0 Bad codegen of julia> @btime foldl(min, $v)
1.314 μs (0 allocations: 0 bytes)
Float16(0.0004883)
julia> @btime foldl(@fastmath(min), $v)
4.861 μs (0 allocations: 0 bytes)
Float16(0.0004883)
julia> @btime Float16(foldl(@fastmath(min), Iterators.map(Float32,$v)))
69.507 ns (0 allocations: 0 bytes)
Float16(0.0004883) This will yield much better performance on Julia master than you can get on 1.9. I assume if you do have hardware support, LLVM's |
If we have to wait for the LLVM folks to correct this, could the following be a temporary fix:
Then
Would it make sense to make this approach the default definition of |
Perhaps we can combine it with |
It seems that the problem has disappeared on 1.10-alpha1:
The timings are the now almost the same (but around 10% slower than on 1.9). |
This seems to be fixed on $ julia +1.11 -q
julia> using BenchmarkTools
julia> v = rand(Float16, 2^10);
julia> @btime @fastmath foldl(min, $v);
5.350 μs (0 allocations: 0 bytes) vs $ julia +nightly -q
julia> using BenchmarkTools
julia> v = rand(Float16, 2^10);
julia> @btime @fastmath foldl(min, $v);
1.573 μs (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.12.0-DEV.1525
Commit da74ef1933b (2024-10-31 18:15 UTC)
Build Info:
Official https://julialang.org release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 32 × AMD Ryzen 9 3950X 16-Core Processor
WORD_SIZE: 64
LLVM: libLLVM-18.1.7 (ORCJIT, znver2)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores) LLVM code on v1.12 is also much better and it has a proper vector body: julia> code_llvm(foldl, (typeof(Base.FastMath.min_fast), Vector{Float16}); debuginfo=:none) ; [...]
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <16 x half> [ %minmax.ident.splat, %vector.ph ], [ %7, %vector.body ]
%vec.phi51 = phi <16 x half> [ %minmax.ident.splat, %vector.ph ], [ %8, %vector.body ]
%vec.phi52 = phi <16 x half> [ %minmax.ident.splat, %vector.ph ], [ %9, %vector.body ]
%vec.phi53 = phi <16 x half> [ %minmax.ident.splat, %vector.ph ], [ %10, %vector.body ]
%offset.idx = shl i64 %index, 1
%2 = or disjoint i64 %offset.idx, 4
%3 = getelementptr i8, ptr %invariant.gep, i64 %2
%4 = getelementptr half, ptr %3, i64 16
%5 = getelementptr half, ptr %3, i64 32
%6 = getelementptr half, ptr %3, i64 48
%wide.load = load <16 x half>, ptr %3, align 2
%wide.load54 = load <16 x half>, ptr %4, align 2
%wide.load55 = load <16 x half>, ptr %5, align 2
%wide.load56 = load <16 x half>, ptr %6, align 2
%7 = call fast <16 x half> @llvm.minnum.v16f16(<16 x half> %vec.phi, <16 x half> %wide.load)
%8 = call fast <16 x half> @llvm.minnum.v16f16(<16 x half> %vec.phi51, <16 x half> %wide.load54)
%9 = call fast <16 x half> @llvm.minnum.v16f16(<16 x half> %vec.phi52, <16 x half> %wide.load55)
%10 = call fast <16 x half> @llvm.minnum.v16f16(<16 x half> %vec.phi53, <16 x half> %wide.load56)
%index.next = add nuw i64 %index, 64
%11 = icmp eq i64 %index.next, %n.vec
br i1 %11, label %middle.block, label %vector.body
; [...] Instead on v1.11: julia> code_llvm(foldl, (typeof(Base.FastMath.min_fast), Vector{Float16}); debuginfo=:none) ; Function Signature: foldl(typeof(Base.FastMath.min_fast), Array{Float16, 1})
define half @julia_foldl_7171(ptr noundef nonnull align 8 dereferenceable(24) %"itr::Array") #0 {
top:
%0 = getelementptr inbounds i8, ptr %"itr::Array", i64 16
%.size.sroa.0.0.copyload = load i64, ptr %0, align 8
%.not.not = icmp eq i64 %.size.sroa.0.0.copyload, 0
br i1 %.not.not, label %L84, label %L35
L35: ; preds = %top
%1 = load ptr, ptr %"itr::Array", align 8
%2 = load half, ptr %1, align 2
%.not58.not68.not = icmp eq i64 %.size.sroa.0.0.copyload, 1
br i1 %.not58.not68.not, label %L91, label %L76
L76: ; preds = %L76, %L35
%3 = phi i64 [ %value_phi2071, %L76 ], [ 1, %L35 ]
%value_phi2071 = phi i64 [ %6, %L76 ], [ 2, %L35 ]
%value_phi1970 = phi half [ %7, %L76 ], [ %2, %L35 ]
%4 = getelementptr inbounds half, ptr %1, i64 %3
%5 = load half, ptr %4, align 2
%6 = add i64 %value_phi2071, 1
%7 = call fast half @llvm.minnum.f16(half %value_phi1970, half %5)
%exitcond.not = icmp eq i64 %value_phi2071, %.size.sroa.0.0.copyload
br i1 %exitcond.not, label %L91, label %L76
L84: ; preds = %top
call void @j_reduce_empty_7183() #11
unreachable
L91: ; preds = %L76, %L35
%value_phi19.lcssa = phi half [ %2, %L35 ], [ %7, %L76 ]
ret half %value_phi19.lcssa
} I don't know if this was fixed by an upgrade to a newer LLVM version or perhaps #55486, which simplified the code generation for |
I'm not saying that you should open this again, but I just want to point out that compared to 1.10 I now see a 2-3x slowdown for both
For master it's
|
I believe it's the expected outccome that Float16 will be slow without a CPU that supports it. Now that there is actually hardware that supports Float16, the path we're taking is to optimize the routines for the hardware that supports it (since on the hardware that doesn't support it, you don't want to use Float16 for compute anyway). |
Can't reproduce: $ for version in 1.10 1.11 nightly; do julia +${version} -e '@show VERSION; using BenchmarkTools; v = rand(Float16, 2^10); @btime @fastmath foldl(min, $v); @btime @fastmath foldl(max, $v)'; done
VERSION = v"1.10.6"
3.319 μs (0 allocations: 0 bytes)
3.320 μs (0 allocations: 0 bytes)
VERSION = v"1.11.1"
5.657 μs (0 allocations: 0 bytes)
1.128 μs (0 allocations: 0 bytes)
VERSION = v"1.12.0-DEV.1535"
1.586 μs (0 allocations: 0 bytes)
1.165 μs (0 allocations: 0 bytes) Same machine as above (no native Float16) |
Current master is almost 6x slower than before:
max_fast
is not affected, nor areFloat32
andFloat64
.The text was updated successfully, but these errors were encountered: