Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MulAdd and MulAddAssign #387

Open
Pr0methean opened this issue Jan 5, 2024 · 9 comments
Open

MulAdd and MulAddAssign #387

Pr0methean opened this issue Jan 5, 2024 · 9 comments
Labels
C-feature-request Category: a feature request, i.e. not implemented / a PR

Comments

@Pr0methean
Copy link

AVX2 and ARM NEON have fused-multiply-and-add instructions, so it would be useful to be able to explicitly emit them with implementations of MulAdd and MulAddAssign. This is the basis of peak FLOP/s figures of merit, so it will likely improve performance on matrix multiplication benchmarks.

@Pr0methean Pr0methean added the C-feature-request Category: a feature request, i.e. not implemented / a PR label Jan 5, 2024
@calebzulawski
Copy link
Member

mul_add is supported by the StdFloat trait: https://godbolt.org/z/MvoaM5ddW

@ds84182
Copy link

ds84182 commented Sep 8, 2024

mul_add is supported by the StdFloat trait: https://godbolt.org/z/MvoaM5ddW

StdFloat is not available in no_std environments. It would be nice to have an "imprecise" version of mul_add that falls back to mul + add rather than a call to fma when a fused instruction is not available.

@programmerjake
Copy link
Member

programmerjake commented Sep 9, 2024

I've been thinking for a while that Rust should have a function exactly like llvm.fmuladd for both scalars and vectors, where its only ever a fma or a separate fmul-fadd pair, no other options -- in particular it doesn't have other "fast math" optimizations that tend to screw up all your numeric properties and/or make stuff UB.

We could use LLVM's naming but Rust already uses the mul_add for fused multiply add (which can be a good or bad thing, depending on how you look at it).
Maybe name the new function mul_add_opt_round? since the only difference is it may round between the mul and add but mul_add only rounds at the end, never between the mul and add.

@HadrienG2
Copy link

HadrienG2 commented Nov 22, 2024

I fully agree that something like llvm.fmuladd would be very useful. I've had to implement it by hand way too many times.

As far as naming is concerned, I would advise finding a naming convention that can also work for #235 (reductions). As from both of these issues, I get the same general feeling of needing to manage a tradeoff between floating-point output reproducibility and efficient hardware implementations, where the perspectives of people that priorize each concern seem so irreconcilable that providing two APIs that each priorize one of the concerns sounds like the pragmatic choice.

In both cases, it seems to me that providing a "relaxed" variant of the operation that does the computation as efficiently as possible for the target architecture, at the expense of providing non-reproducible output across targets, would be desirable in addition to the existing "reproducible" variant. Of course, we may also want to improve the performance/precision compromise of the "reproducible" variant before stabilizing it, as discussed in #235.

In the C/++ world, the tradition is to call "relaxed" operations "fast" operations. This would give us names like fast_mul_add or fast_reduce_add. Following this performance-focused naming convention would improve familiarity for C/++ devs, at the expense of arguably insufficiently stressing the fact that the output of these operations is not reproducible in the name.

Alternatively, we could use a different adjective that is not familiar to C/++ devs but tries to express the underlying design compromise better, e.g. relaxed_mul_add/relaxed_reduce_add .

@calebzulawski
Copy link
Member

I didn't realize at first that fmuladd is a different intrinsic than fma and isn't just hypothetical. It should be pretty easy to add that intrinsic to rust.

@andy-thomason
Copy link

Be a little careful of StdFloat::mul_add as llvm has an awful habit of converting it into multiple glibc scalar calls
on the default architecture. This may have changed now, so correct me if I'm wrong.

@programmerjake
Copy link
Member

note that llvm.fmuladd and fast math mean different things, llvm.fmuladd without any fast math flags is guaranteed to always be equivalent to either fma or a separate fmul and fadd, chosen non-deterministically, no other options. fast math flags allow waay more options such as reassociation, treating 0.0 and -0.0 as identical, treating Infinity/NaN as UB, reducing precision arbitrarily, etc., so please don't name them both the same thing (fast_<op>/relaxed_<op>/etc.) because they're very much not.

@Pr0methean
Copy link
Author

In the C/++ world, the tradition is to call "relaxed" operations "fast" operations. This would give us names like fast_mul_add or fast_reduce_add. Following this performance-focused naming convention would improve familiarity for C/++ devs, at the expense of arguably insufficiently stressing the fact that the output of these operations is not reproducible in the name.

Alternatively, we could use a different adjective that is not familiar to C/++ devs but tries to express the underlying design compromise better, e.g. relaxed_mul_add/relaxed_reduce_add .

Java has java.lang.Math and java.lang.StrictMath; I'd favor doing something similar by having some operations be instance methods in a MathContext struct, which would have an enum to specify relaxed or strict. (It could also have members to specify rounding modes etc. like in other languages, or to provide access to the x87 extended precision on x86_64.)

@HadrienG2
Copy link

HadrienG2 commented Nov 23, 2024

Be a little careful of StdFloat::mul_add as llvm has an awful habit of converting it into multiple glibc scalar calls on the default architecture. This may have changed now, so correct me if I'm wrong.

Indeed, if LLVM sees an "unconditional" FMA instruction and cannot prove that the target always has hardware FMA, then it 1/scalarizes everything and 2/introduces a layer of libm call indirection. llvm.fmuladd is about fixing that by generating MUL+ADD pairs instead of an FMA when target FMA support cannot be proven at compile-time.

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Dec 3, 2024
…workingjubilee

Add simd_relaxed_fma intrinsic

Adds compiler support for rust-lang/portable-simd#387 (comment)

r? `@workingjubilee`

cc `@RalfJung` is this kind of nondeterminism a problem for miri/opsem?
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Dec 3, 2024
Rollup merge of rust-lang#133395 - calebzulawski:simd_relaxed_fma, r=workingjubilee

Add simd_relaxed_fma intrinsic

Adds compiler support for rust-lang/portable-simd#387 (comment)

r? `@workingjubilee`

cc `@RalfJung` is this kind of nondeterminism a problem for miri/opsem?
bjorn3 pushed a commit to rust-lang/rustc_codegen_cranelift that referenced this issue Dec 4, 2024
…bilee

Add simd_relaxed_fma intrinsic

Adds compiler support for rust-lang/portable-simd#387 (comment)

r? `@workingjubilee`

cc `@RalfJung` is this kind of nondeterminism a problem for miri/opsem?
antoyo pushed a commit to rust-lang/rustc_codegen_gcc that referenced this issue Dec 11, 2024
…bilee

Add simd_relaxed_fma intrinsic

Adds compiler support for rust-lang/portable-simd#387 (comment)

r? `@workingjubilee`

cc `@RalfJung` is this kind of nondeterminism a problem for miri/opsem?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature-request Category: a feature request, i.e. not implemented / a PR
Projects
None yet
Development

No branches or pull requests

6 participants