hypotenuse function for f32, f64 #15

skewballfox · 2024-11-09T23:07:51Z

I created a Hypot trait with a single function. It's been tested for avx2 and avx2fma, I don't have the hardware to test the avx512 implementation. I noticed a small diff from std hypot for some normal number pairs for f64 (hypot(0.1e-38,1e-38) had a diff around e-54) .

I wasn't sure whether to add hypot to automath or keep it separate. it's currently separate. Neon might take a little longer

let me know if I need to change anything. Also can you test the avx512 implementation?

EDIT: a bit more information on the difference from std hypot. I've found the difference is easy to trigger, if you provide two numbers for f64 that are normal but either are really small or large (kicks in around 1e-24 and 1e25) and exactly one order of magnitude apart (ex: 1e-24 and 1e-25), the result will differ from std hypot by exactly 1 ulp. I should probably add a test to ensure that results stay within 1 ulp of the std function.

This isn't really a rigorous analysis or anything, just eyeball testing. It might vary for other inputs

ChillFish8 · 2024-11-10T00:31:18Z

Yup will give this a look over and test tomorrow 👍

skewballfox · 2024-11-16T20:00:19Z

hey, out of curiosity, why not use libm for the non-nightly no_std math implementations?

also, a bit out of scope for the PR, but I changed the fast math implementation to use the sqrt intrinsic.

ChillFish8 · 2024-11-16T21:32:40Z

hey, out of curiosity, why not use libm for the non-nightly no_std math implementations?

Mostly to control the behaviour of the API and to keep the core lib zero-dependency and easy to audit.

a bit out of scope for the PR, but I changed the fast math implementation to use the

This change doesn't make sense, the intrinsic is already stable as f32::sqrt() and f64::sqrt() so the change doesn't change the actual logic going on. Also, this change breaks no STD builds because sqrt isn't available, which is why StdMath has a compile-time check and fallback to an approximate sqrt implementation.

ChillFish8 · 2024-11-16T21:33:38Z

Btw, I have approved CI runs for you :D That should let you check ARM & NEON works, I will test the AVX512 stuff, but haven't got around to doing it yet, sorry!

skewballfox · 2024-11-16T23:05:35Z

This change doesn't make sense, the intrinsic is already stable as f32::sqrt() and f64::sqrt() so the change doesn't change the actual logic going on. Also, this change breaks no STD builds because sqrt isn't available, which is why StdMath has a compile-time check and fallback to an approximate sqrt implementation.

Ah, I didn't realize any of the compiler intrinsics had a dependency on the standard library. I reverted it, I'm just using direct calculation with no_std for now, I'll circle back around once I get the safe api implemented

skewballfox · 2024-11-23T16:49:51Z

a couple of things I wanted to get some feedback on:

running cargo clippy --fix changes some of the code not related to the PR, mainly removing the lifetime on memory loader. I wanted to make sure that was actually correct for committing it
what's the relationship between test in impl, op, and export? should I move the hypot stuff currently in impl_test to the op hypot file?
for the no_std implementation, without libm. The test are almost definitely going to fail for any of the simd implementations because the difference in accuracy for the square root functions. I could port sqrt from libm but it's a pretty gnarly bit of code.

I think I might have figured out how to fix the 1 ulp difference.I need to change scaling to only be applied if values fall outside a certain range.

skewballfox · 2024-11-24T17:55:15Z

so, with the 1 ulp difference, it turned out to not to be a problem with the scaling. it's how the floating point error is being corrected for. haven't had much success porting the error correction process mentioned in the paper, but in either case I think being within 1 ulp is probably an acceptable level of accuracy in most cases.

I can maybe try to fix it in a later PR.

Let me know if there is something I should correct, I think the PR is ready for review, though clippy CI check will likely throw an error due to the lifetime on memory loader, and a few unused type parameters in core_routine_boilerplate (both unrelated to changes from the pr)

ChillFish8 · 2024-12-01T16:36:26Z

cfavml/src/danger/core_simd_api.rs

+    /// and a point (`x`, `y`) on the Euclidean plane.
+    unsafe fn hypot(a: Self::Register, b: Self::Register) -> Self::Register;
+    #[inline(always)]
+    unsafe fn hypot_dense(


Nit, please add docs for this function as well, and just add a note under # safety that it depends on the safety requirements of SimdRegister<T>.

just add a note under # safety that it depends on the safety requirements of SimdRegister

You mean for the docs of op_hypot correct?

Yeah, technically it depends on both but I guess if you satisfy op_hypot you satisfy the register requirements anyway.

cfavml/src/danger/export_hypot.rs

cfavml/src/danger/impl_avx512.rs

cfavml/src/danger/impl_neon.rs

ChillFish8 · 2024-12-01T16:43:33Z

cfavml/src/danger/op_hypot.rs

+use crate::mem_loader::{IntoMemLoader, MemLoader};
+
+#[inline(always)]
+/// A generic vector hypot over one vector and single value.


Is this correct? Technically mem loaders will accept:

single & single

single & vec

vec & single

vec & vec

ChillFish8 · 2024-12-01T16:44:58Z

cfavml/src/export_docs/arithmetic_hypot_vertical.md

+result = [0; dims]
+
+for i in range(dims):
+    result[i] = a[i].hypot(b[i])


Might be useful to have this be a bit more in-depth of what hypot is actually doing as just calling the function doesn't give much of an insight into the actual math and operation ordering going on.

would this be better?

result = [0; dims] // assume |a|<|b| for all a,b for i in range(dims): let hi = a[i].abs() let lo = b[i].abs() let scale = 1/hi result[i] = hi * sqrt(lo*scale + 1) return result

Looks good, I'm tempted to have it be abs(a[i]) just so it is consistent, other than that looks good, much clearer.

cfavml/src/math/default.rs

ChillFish8 · 2024-12-01T16:47:52Z

cfavml/src/safe_trait_numeric_ops.rs

+use crate::mem_loader::{IntoMemLoader, MemLoader};
+
+/// Various arithmetic operations over vectors.
+pub trait NumericOps: Sized + Copy {


A little bit of bike shedding; If this is called NumericOps, does this open us up to a bit of pain if later we want to add a function that falls under this category, but for i32, i8, etc... Since here NumericOps is probably a bit more like FloatOps or similar.

No worries on the bike shedding.

What functions would fall into this category that would be valid for integers? I can change the name of the trait/file to hypot, but I imagine there would be other float specific ops that would be useful to add at some point.

What naming scheme do you think we should go with?

Not sure, I suspect that, we are more likely to have Float-only ops rather than int-only ops most of the time for this lib, A trait for just the hypot safe op is probably going to become a bit verbose pretty quickly, we could put it under MiscFloatOps as a bit of a catchall and then come up with a better fitting name if we collect a few more functions under this category/application.

ChillFish8 · 2024-12-01T16:49:42Z

It looks pretty great overall. The comments on the arithmetic ops themselves on the SIMD were very useful to digest and understand.

There are a few notes around code style and a little bit of naming debate but nothing insane.

I think one thing missing is a dedicated high level function like cfavml::hypot()

skewballfox · 2024-12-07T20:07:50Z

So, I've run into a bit of a roadblock. It looks like for the macos test, the test are specifically failing for subnormal values for the neon implementation. The difference seems to be pretty significant (the exponents don't match), and I've hit the limit of what I can figure out without being able to directly test on hardware.

ChillFish8 · 2024-12-07T20:12:05Z

I'll try have a look tomorrow and see if I can spot what's up with it

ChillFish8 · 2024-12-18T00:13:34Z

Shoot I forgot about this, I'll try and look at this when I get the chance, sorry!

skewballfox · 2024-12-18T00:19:10Z

Oh thank god. Who in their right mind would think the natural way to order the args for a fused multiply add function would be add first?

ChillFish8 · 2024-12-18T00:20:43Z

AH you ran into that issue 😅 Yeah I ran into the same thing when first doing the Neon stuff, the ordering of the accumulators is a different ordering which is... Unhelpful.

hypotenuse function for f32, f64

a343307

use sqrt intrinsic for fastmath, implemented Hypot for Neon

11db444

added hypot test for Neon, Axx512

b1e9599

skewballfox had a problem deploying to dev November 16, 2024 21:29 — with GitHub Actions Failure

skewballfox had a problem deploying to dev November 16, 2024 21:29 — with GitHub Actions Error

revert sqrt change, fallback to direct calculation if no_std for now

a1a2e27

skewballfox added 2 commits November 20, 2024 12:02

added export_hypot

c1b88c7

safe api implemented

5ba40f7

skewballfox added 2 commits November 23, 2024 10:51

fixed issue with hypot test, fixed clippy warnings

7479b04

added export hypot test for avx2fma, subnormal floats

4085561

skewballfox marked this pull request as ready for review November 24, 2024 16:59

skewballfox had a problem deploying to dev December 1, 2024 16:35 — with GitHub Actions Failure

skewballfox had a problem deploying to dev December 1, 2024 16:35 — with GitHub Actions Error

skewballfox temporarily deployed to dev December 1, 2024 16:35 — with GitHub Actions Inactive

skewballfox had a problem deploying to dev December 1, 2024 16:35 — with GitHub Actions Failure

skewballfox had a problem deploying to dev December 1, 2024 16:35 — with GitHub Actions Error