Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve accuracy of conversions to floating-point numbers (#129) #138

Merged
merged 1 commit into from
Nov 22, 2019

Conversation

kimikage
Copy link
Collaborator

This fixes (mitigates) the issue #129.

@kimikage
Copy link
Collaborator Author

kimikage commented Nov 17, 2019

Benchmark

see:
#129 (comment)
#129 (comment)
#129 (comment)

Known Issues for Users

Rounding error

In some cases, the 1-lsb errors still occurs. (see also: #129 (comment))

For example:

julia> x = reinterpret(N4f28, 0b1111_1111_1111_1111_1111_1111_0111_0000)
15.999999523N4f28

julia> Float32(x)
16.0f0

julia> Float32(BigFloat(Rational(x)))
15.999999f0

Workaround: use of the divisions.

Slowdown

With a few exceptions, the accuracy comes at price of the speed.
The following particular cases are relatively noticeable.

  • N8f8, N7f9, N6f10, N5f11, N4f12, N3f13, N2f14, N1f15 -> Float32
    N16

  • N15f17, N8f24, N7f25, N6f26, N3f29, N1f31 -> Float64
    N32

Workaround: use of scaledual. (see also: #130)

Slowdown with ReinterpretArray

The conversion methods are designed to be suitable for the SIMD vectorization. However, the current ReinterpretArray is not SIMD-suitable (cf. #125, JuliaLang/julia#28980), so it will cause a further slowdown.
Especially, in cases of Normed{UInt32,f}->Float32 where 2 <= f <= 24, the LLVM generates a destructive conditional branch.

Workaround: use of Array instead of ReinterpretArray (i.e. avoiding rand(::Type{<:Normed}, sz::Dims))

Conversions to BigFloat

This PR does not change the conversions to BigFloat. I think the BigFloat users need the accuracy rather than the speed.

Workaround: use of Rational in the first step of two-step conversion.

@kimikage kimikage force-pushed the issue129 branch 5 times, most recently from a782be8 to 2cfddd3 Compare November 19, 2019 13:42
Copy link
Member

@timholy timholy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Perhaps check the @pure comment below, then merge at will!

src/normed.jl Outdated
macro exp2(n)
:(_exp2(Val($(esc(n)))))
end
@generated _exp2(::Val{N}) where {N} = (x = exp2(N); :($x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could both of these be Base.@pure? With @generated one should technically provide a fallback (e.g., https://github.com/JuliaLang/julia/blob/e5ee8da70a646f1f54730e727932ebaeaee256d5/base/namedtuple.jl#L75-L83) in preparation for the day when it becomes more common to ship precompiled binaries.

Copy link
Collaborator Author

@kimikage kimikage Nov 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! _exp2(::Val{N}) where {N} = exp2(N) seems to work well! (not fully tested)
I found that the generated function can be a workaround for the integer division problems. And then, I thought the same technique could be used for exp2, and I did it as shown above.

However, I don't know any substitutions for the generated inv_rawone().
Should I add a fallback? I am looking for the documentation about @generated.

Copy link
Collaborator Author

@kimikage kimikage Nov 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I do not fully understand the mechanism of @generated, I added a fallback by following the example.

inv_rawone(x) = (@generated) ? (y = 1.0 / rawone(x); :($y)) : 1.0 / rawone(x)

BTW, I am thinking of moving these low-level utility functions and macros to a new file (e.g. "src/utilities.jl")
Surprisingly, rawone is defined in "src/normed.jl" even though it can be called with a Fixed argument. Moreover, although there is no noticeable overhead, the implementation of rawone takes a detour:

interpret> FixedPointNumbers.rawone(1N5f3)
rawone(v) in FixedPointNumbers at ~/.julia/packages/FixedPointNumbers/870tH/src/normed.jl:38
 one(x::Normed) in FixedPointNumbers at ~/.julia/packages/FixedPointNumbers/870tH/src/normed.jl:37
  oneunit(x::Normed) in FixedPointNumbers at ~/.julia/packages/FixedPointNumbers/870tH/src/normed.jl:36
   one(::Type{T}) where T<:Normed in FixedPointNumbers at ~/.julia/packages/FixedPointNumbers/870tH/src/normed.jl:34
    oneunit(::Type{T}) where T<:Normed in FixedPointNumbers at ~/.julia/packages/FixedPointNumbers/870tH/src/normed.jl:32
  #unused# = Normed{UInt8,3}
  T = Normed{UInt8,3}
 31  function oneunit(::Type{T}) where T <: Normed
 32      T(typemax(rawtype(T)) >> (8 * sizeof(T) - nbitsfrac(T)), 0)
     end

Even though we can get the rawone value (which isa Unsigned) in oneunit(::Type{T}), it will be converted to Normed, and reinterpreted back to Unsigned.

Should I refactor the zero, one, oneunit and rawone first, as a part of #139? Or keep the utility functions in "src/normed.jl" for the moment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that making this more straightforward seems desirable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With

diff --git a/src/normed.jl b/src/normed.jl
index 737657b..07f5b86 100644
--- a/src/normed.jl
+++ b/src/normed.jl
@@ -114,7 +114,7 @@ end
 _exp2(::Val{N}) where {N} = exp2(N)
 
 # for Julia v1.0, which does not fold `div_float` before inlining
-inv_rawone(x) = (@generated) ? (y = 1.0 / rawone(x); :($y)) : 1.0 / rawone(x)
+@pure inv_rawone(x) = 1.0 / rawone(x)
 
 function (::Type{T})(x::Normed) where {T <: AbstractFloat}
     # The following optimization for constant division may cause rounding errors.

I get

julia> x = N8f24(15.2)
15.2N8f24

julia> @code_llvm Float32(x)

;  @ /home/tim/.julia/dev/FixedPointNumbers/src/normed.jl:159 within `Float32'
define float @julia_Float32_17656(%jl_value_t addrspace(10)*, { i32 } addrspace(11)* nocapture nonnull readonly dereferenceable(4)) {
top:
;  @ /home/tim/.julia/dev/FixedPointNumbers/src/normed.jl:160 within `Float32'
; ┌ @ Base.jl:20 within `getproperty'
   %2 = getelementptr inbounds { i32 }, { i32 } addrspace(11)* %1, i64 0, i32 0
; └
; ┌ @ int.jl:498 within `unsafe_trunc'
; │┌ @ int.jl:464 within `rem'
    %3 = load i32, i32 addrspace(11)* %2, align 4
; └└
;  @ /home/tim/.julia/dev/FixedPointNumbers/src/normed.jl:169 within `Float32'
; ┌ @ promotion.jl:349 within `<' @ int.jl:49
   %4 = icmp sgt i32 %3, -1
; └
  %5 = select i1 %4, double 0.000000e+00, double 0x4070000010000010
;  @ /home/tim/.julia/dev/FixedPointNumbers/src/normed.jl:170 within `Float32'
; ┌ @ float.jl:60 within `Float64'
   %6 = sitofp i32 %3 to double
; └
; ┌ @ float.jl:410 within `muladd'
   %7 = fmul contract double %6, 0x3E70000010000010
   %8 = fadd contract double %7, %5
; └
;  @ /home/tim/.julia/dev/FixedPointNumbers/src/normed.jl:170 within `Float32' @ float.jl:251
  %9 = fptrunc double %8 to float
;  @ /home/tim/.julia/dev/FixedPointNumbers/src/normed.jl:170 within `Float32'
  ret float %9
}

on Julia 1.0. It is sometimes joked that @pure means "ask Jameson if it's OK," but here I think it's pretty straightforward.

Copy link
Collaborator Author

@kimikage kimikage Nov 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you go into more detail?
This is a matter of the speed, not the correctness. In the case of "Matrix of Vec4", the "master" is slower than the "simple div" except the BigFloat cases.
#129 (comment)

The default inline_cost_threshold is 100 and the cost of div_float is 20.
For example, if we want to use 4 packed numbers, the cost of single method call (callee) should be reduced to 100 / 4 = 25, or the inlining of the caller may fail and the SIMD vectorization will be incomplete. In addition, the failure of SIMD vectorization causes the failure of the broadcast optimization. Therefore, one div_float (i.e. "simple div") is OK, but two div_floats are not good. Although an intermediate variable can reduce the number of div_floats, there are still mul_float, fptrunc and something trivial.
Julia's optimizer seems to be more complicated, but in my experience, this can explain the reason of the slowdown on Julia v1.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I wasn't looking at vectorization, sorry. The point is that the LLVM is the last stage before native code.

If @pure doesn't work, let's keep your @generated version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that the LLVM is the last stage before native code.

I get the point, but even if it's correct in the LLVM IR...

Especially, in cases of Normed{UInt32,f}->Float32 where 2 <= f <= 24, the LLVM generates a destructive conditional branch.

  %5 = select i1 %4, double 0.000000e+00, double 0x4070000010000010

-> 💥 💥 💥

        vxorpd  %xmm0, %xmm0, %xmm0
        testl   %eax, %eax
        jns     L30
        vmovsd  .LCPI639_0, %xmm0       # xmm0 = mem[0],zero
L30:

I think there is almost no way to intervene in the LLVM backend from Julia scripts. By using -0.0 instead of 0.0, the conditional branch (jns) can be removed, but it stalls in the SIMD version. So I gave up!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably don't know as much about SIMD as you do, but is select actually destructive? It's not technically a branch. See https://github.com/eschnett/SIMD.jl#representing-simd-vector-types-in-julia (search for select) and http://blog.llvm.org/2011/12/llvm-31-vector-changes.html (suggesting it went into LLVM 3.1).

Of course, whether it applies these is another question.

I didn't mean to dissuade you from the @generated solution, by all means adopt that if it works better. I personally would also not stress too much about Julia 1.0 performance, if it's really painful. "Use a more recent Julia version" is a perfectly fine response to a performance concern.

Copy link
Collaborator Author

@kimikage kimikage Nov 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, select is not destructive. Since this is a rare (technically interesting but personally hateful) case, I just left the comment. This is a problem with the non-SIMD LLVM IR as you dumped above. Moreover, although I don't like the branch, it is still reasonable in points of the binary size (i.e. cache efficiency) and the high-performance branch prediction of modern CPUs. (This usecase unfortunately does not go well with the branch prediction, but the LLVM backend cannot know that.)

I didn't mean to dissuade you from the @generated solution, by all means adopt that if it works better.

I am aware of that. It is just my hesitation. The following comment which I wrote above:

BTW, I am thinking of moving these low-level utility functions and macros to a new file (e.g. "src/utilities.jl")

implicitly suggests that perhaps I will add more @generated (e.g. div_rawones(a, b) = rawone(a) / rawone(b)) without the users noticing. I do not think it is pretty good to (over-)use @generated, but it is certainly powerful.

Now, I am not going to resist the temptation of powerful @generated. This commit is far from what I originally envisioned, but I am happy because float(3N5f3) == 3 comes true! 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants