MulAdd and MulAddAssign #387

Pr0methean · 2024-01-05T00:28:17Z

AVX2 and ARM NEON have fused-multiply-and-add instructions, so it would be useful to be able to explicitly emit them with implementations of MulAdd and MulAddAssign. This is the basis of peak FLOP/s figures of merit, so it will likely improve performance on matrix multiplication benchmarks.

calebzulawski · 2024-01-05T00:44:48Z

mul_add is supported by the StdFloat trait: https://godbolt.org/z/MvoaM5ddW

ds84182 · 2024-09-08T19:46:07Z

mul_add is supported by the StdFloat trait: https://godbolt.org/z/MvoaM5ddW

StdFloat is not available in no_std environments. It would be nice to have an "imprecise" version of mul_add that falls back to mul + add rather than a call to fma when a fused instruction is not available.

programmerjake · 2024-09-09T05:39:54Z

I've been thinking for a while that Rust should have a function exactly like llvm.fmuladd for both scalars and vectors, where its only ever a fma or a separate fmul-fadd pair, no other options -- in particular it doesn't have other "fast math" optimizations that tend to screw up all your numeric properties and/or make stuff UB.

We could use LLVM's naming but Rust already uses the mul_add for fused multiply add (which can be a good or bad thing, depending on how you look at it).
Maybe name the new function mul_add_opt_round? since the only difference is it may round between the mul and add but mul_add only rounds at the end, never between the mul and add.

HadrienG2 · 2024-11-22T14:44:48Z

I fully agree that something like llvm.fmuladd would be very useful. I've had to implement it by hand way too many times.

As far as naming is concerned, I would advise finding a naming convention that can also work for #235 (reductions). As from both of these issues, I get the same general feeling of needing to manage a tradeoff between floating-point output reproducibility and efficient hardware implementations, where the perspectives of people that priorize each concern seem so irreconcilable that providing two APIs that each priorize one of the concerns sounds like the pragmatic choice.

In both cases, it seems to me that providing a "relaxed" variant of the operation that does the computation as efficiently as possible for the target architecture, at the expense of providing non-reproducible output across targets, would be desirable in addition to the existing "reproducible" variant. Of course, we may also want to improve the performance/precision compromise of the "reproducible" variant before stabilizing it, as discussed in #235.

In the C/++ world, the tradition is to call "relaxed" operations "fast" operations. This would give us names like fast_mul_add or fast_reduce_add. Following this performance-focused naming convention would improve familiarity for C/++ devs, at the expense of arguably insufficiently stressing the fact that the output of these operations is not reproducible in the name.

Alternatively, we could use a different adjective that is not familiar to C/++ devs but tries to express the underlying design compromise better, e.g. relaxed_mul_add/relaxed_reduce_add .

calebzulawski · 2024-11-22T15:24:23Z

I didn't realize at first that fmuladd is a different intrinsic than fma and isn't just hypothetical. It should be pretty easy to add that intrinsic to rust.

andy-thomason · 2024-11-22T17:02:08Z

Be a little careful of StdFloat::mul_add as llvm has an awful habit of converting it into multiple glibc scalar calls
on the default architecture. This may have changed now, so correct me if I'm wrong.

programmerjake · 2024-11-22T20:12:02Z

note that llvm.fmuladd and fast math mean different things, llvm.fmuladd without any fast math flags is guaranteed to always be equivalent to either fma or a separate fmul and fadd, chosen non-deterministically, no other options. fast math flags allow waay more options such as reassociation, treating 0.0 and -0.0 as identical, treating Infinity/NaN as UB, reducing precision arbitrarily, etc., so please don't name them both the same thing (fast_<op>/relaxed_<op>/etc.) because they're very much not.

Pr0methean · 2024-11-22T21:40:55Z

In the C/++ world, the tradition is to call "relaxed" operations "fast" operations. This would give us names like fast_mul_add or fast_reduce_add. Following this performance-focused naming convention would improve familiarity for C/++ devs, at the expense of arguably insufficiently stressing the fact that the output of these operations is not reproducible in the name.

Alternatively, we could use a different adjective that is not familiar to C/++ devs but tries to express the underlying design compromise better, e.g. relaxed_mul_add/relaxed_reduce_add .

Java has java.lang.Math and java.lang.StrictMath; I'd favor doing something similar by having some operations be instance methods in a MathContext struct, which would have an enum to specify relaxed or strict. (It could also have members to specify rounding modes etc. like in other languages, or to provide access to the x87 extended precision on x86_64.)

HadrienG2 · 2024-11-23T04:27:59Z

Be a little careful of StdFloat::mul_add as llvm has an awful habit of converting it into multiple glibc scalar calls on the default architecture. This may have changed now, so correct me if I'm wrong.

Indeed, if LLVM sees an "unconditional" FMA instruction and cannot prove that the target always has hardware FMA, then it 1/scalarizes everything and 2/introduces a layer of libm call indirection. llvm.fmuladd is about fixing that by generating MUL+ADD pairs instead of an FMA when target FMA support cannot be proven at compile-time.

…workingjubilee Add simd_relaxed_fma intrinsic Adds compiler support for rust-lang/portable-simd#387 (comment) r? `@workingjubilee` cc `@RalfJung` is this kind of nondeterminism a problem for miri/opsem?

Rollup merge of rust-lang#133395 - calebzulawski:simd_relaxed_fma, r=workingjubilee Add simd_relaxed_fma intrinsic Adds compiler support for rust-lang/portable-simd#387 (comment) r? `@workingjubilee` cc `@RalfJung` is this kind of nondeterminism a problem for miri/opsem?

…bilee Add simd_relaxed_fma intrinsic Adds compiler support for rust-lang/portable-simd#387 (comment) r? `@workingjubilee` cc `@RalfJung` is this kind of nondeterminism a problem for miri/opsem?

Pr0methean added the C-feature-request Category: a feature request, i.e. not implemented / a PR label Jan 5, 2024

calebzulawski mentioned this issue Nov 23, 2024

Add simd_relaxed_fma intrinsic rust-lang/rust#133395

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MulAdd and MulAddAssign #387

MulAdd and MulAddAssign #387

Pr0methean commented Jan 5, 2024

calebzulawski commented Jan 5, 2024

ds84182 commented Sep 8, 2024

programmerjake commented Sep 9, 2024 •

edited

Loading

HadrienG2 commented Nov 22, 2024 •

edited

Loading

calebzulawski commented Nov 22, 2024

andy-thomason commented Nov 22, 2024

programmerjake commented Nov 22, 2024

Pr0methean commented Nov 22, 2024

HadrienG2 commented Nov 23, 2024 •

edited

Loading

MulAdd and MulAddAssign #387

MulAdd and MulAddAssign #387

Comments

Pr0methean commented Jan 5, 2024

calebzulawski commented Jan 5, 2024

ds84182 commented Sep 8, 2024

programmerjake commented Sep 9, 2024 • edited Loading

HadrienG2 commented Nov 22, 2024 • edited Loading

calebzulawski commented Nov 22, 2024

andy-thomason commented Nov 22, 2024

programmerjake commented Nov 22, 2024

Pr0methean commented Nov 22, 2024

HadrienG2 commented Nov 23, 2024 • edited Loading

programmerjake commented Sep 9, 2024 •

edited

Loading

HadrienG2 commented Nov 22, 2024 •

edited

Loading

HadrienG2 commented Nov 23, 2024 •

edited

Loading