-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fma operation? #6330
Comments
I think we need some sort of interface, but I'm not sure what would be best. @stevengj mentioned that LLVM automatically rewrites multiply/add expressions when hardware fma is available, but it would be useful to have some control over this: in fact this is one of the recommended "should"s of the IEEE 754 standard (section 10.4). In particular, it would be useful to:
|
Note that if you go this route, you should really have access to four intrinsics:
(Plus a lot more if you go to SIMD versions, since there are vector instructions that allow you to interleave fma and fms within the vector, for example.) The C99 standard screwed up in providing just one. We had to roll our own in FFTW. @matteo-frigo may know more about the current state of compiler enlightenment with these sorts of instructions. The right thing would be for LLVM (and Julia) to preserve an FMA when it occurs explicitly in the code, so that you can get the desired FMA just by parenthesizing your code in the desired way. As I mentioned on the mailing list, gcc didn't always do this, but I don't know where LLVM stands. In theory, I guess there should also be a way to prevent FMAs, although I've never had a practical need for this. |
Oh gawd. Do not want. |
Just to make it more interesting, here are a couple of issues with FMA, in addition to those pointed out by stevengj. First, the FMA implemented by PowerPC and by Intel differ in (at least) one subtle way. PowerPC computes +/-(a_b +/- c), whereas Intel implements +/-(a_b) +/- c. The two differ when the result is -0; for example, -(+0_+0 - +0) = -0, whereas +0 - +0_+0 = +0. So a pedantic (I dare not say correct) compiler cannot translate the PowerPC expression into the x86 expression. Second, common-subexpression elimination makes things interesting. For example, FFTs tend to compute pairs of expressions of the form a +/- b_c, which reduces to two fma's. However, a sufficiently smart compiler will tend to compute t=b_c; a +/- t, which is three flops. Historically in fftw, we were able to defeat gcc-3.x with some hackish workaround, but gcc-4.[0-2] had a more formidable optimizer that was always isolating the common subexpression no matter what. I haven't tried more recent gcc's or llvm in this respect, but that's something to watch for. |
OpenCL has two fma operations, fma and mad. The former (fma) specifies that the fused, more accurate semantics is required; unless the equivalent of -ffast-math is required, this can be expensive if the hardware does not support it. (fma() also exists in C.) The latter (mad) is simply a combined multiply-add operation that simplifies the compiler's task of generating efficient code; it will only translate to a hardware instruction if this is faster than a multiply-add sequence. Which of these do we want for Julia? Both? fma() is clearly useful -- if you need the increased accuracy, you need it. mad() is a nice convenience to have, in particular for types (matrices?) where the compiler otherwise would not be able to synthesize the combined operation. |
I've been thinking about this recently as well. From my own limited experience, there are 3 common situations:
I'm not sure where the situation mentioned by @matteo-frigo falls, or if it deserves another category. (1) is perhaps the most common:
(2) is the current behaviour. If we were following either of the first two or the macro approach above, we might want to consider an (3) I'm not sure the best way to handle this. We could have a global function mulpi_ext(x::Float64)
m = 3.141592653589793
m_hi = 3.1415926218032837
m_lo = 3.178650954705639e-8
y_hi = m*x
if NATIVEFMA
# with fma()
y_lo = fma(m_lo, x, fma(m_hi, x, -y_hi))
else
# without fma()
u = 134217729.0*x # 0x1p27 + 1
x_hi = u-(u-x)
x_lo = x-x_hi
y_lo = x_hi * m_lo + (x_lo* m_hi + ((x_hi*m_hi-y_hi) + x_lo*m_lo))
end
Double64(y_hi,y_lo)
end Admittedly, it would be a bit messy, but mostly would be restricted to library code. This also assumes that the compiler is able to replace the |
Isn't your case (3) the same as (1)? Whether an instruction is available in hardware or needs to be open-code as a sequence of instructions only affects performance, hence this is a choice about how fast things should run. The fma() operation (without intermediate rounding) is part of the IEEE standard; we should provide it in any case. Your explicit code transformation is equivalent to @fastmath or -ffast-math. There are typically a few similar IEEE features where the programmer may want to specify "I didn't think about associativity here; feel free to re-associate for performance" or "I don't care about inf/nan/denormal semantics here; feel free to optimize". I'd go for expressing semantics only in the code, with muladd() as a convenience function and fma() as per IEEE standard. Optimizations and allowed transformations come later, either via macros or via compiler options. |
The distinction is that in (1) you're happy for the fallback to be simply The other point is that in (3) you probably want to write the fallback separately, as software fma's are woefully inefficient: e.g. see the openlibm code. +1 for a |
I think this issue can now be closed, as #9855 covers the indicator issue. |
LLVM has a floating-point fma operation:
http://llvm.org/docs/LangRef.html#llvm-fmuladd-intrinsic
Does it make sense to have an
fma
function? It makes sense for floating-point, can be emulated with integers and other numeric types with a simple fallback, and could allow matrix code doingA*B + C
to avoid temporaries.The text was updated successfully, but these errors were encountered: