Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hypotenuse function for f32, f64 #15

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

skewballfox
Copy link

@skewballfox skewballfox commented Nov 9, 2024

I created a Hypot trait with a single function. It's been tested for avx2 and avx2fma, I don't have the hardware to test the avx512 implementation. I noticed a small diff from std hypot for some normal number pairs for f64 (hypot(0.1e-38,1e-38) had a diff around e-54) .

I wasn't sure whether to add hypot to automath or keep it separate. it's currently separate. Neon might take a little longer

let me know if I need to change anything. Also can you test the avx512 implementation?

EDIT: a bit more information on the difference from std hypot. I've found the difference is easy to trigger, if you provide two numbers for f64 that are normal but either are really small or large (kicks in around 1e-24 and 1e25) and exactly one order of magnitude apart (ex: 1e-24 and 1e-25), the result will differ from std hypot by exactly 1 ulp. I should probably add a test to ensure that results stay within 1 ulp of the std function.

This isn't really a rigorous analysis or anything, just eyeball testing. It might vary for other inputs

@ChillFish8
Copy link
Owner

Yup will give this a look over and test tomorrow 👍

@skewballfox
Copy link
Author

hey, out of curiosity, why not use libm for the non-nightly no_std math implementations?

also, a bit out of scope for the PR, but I changed the fast math implementation to use the sqrt intrinsic.

@ChillFish8
Copy link
Owner

hey, out of curiosity, why not use libm for the non-nightly no_std math implementations?

Mostly to control the behaviour of the API and to keep the core lib zero-dependency and easy to audit.

a bit out of scope for the PR, but I changed the fast math implementation to use the

This change doesn't make sense, the intrinsic is already stable as f32::sqrt() and f64::sqrt() so the change doesn't change the actual logic going on. Also, this change breaks no STD builds because sqrt isn't available, which is why StdMath has a compile-time check and fallback to an approximate sqrt implementation.

@ChillFish8
Copy link
Owner

Btw, I have approved CI runs for you :D That should let you check ARM & NEON works, I will test the AVX512 stuff, but haven't got around to doing it yet, sorry!

@skewballfox
Copy link
Author

skewballfox commented Nov 16, 2024

This change doesn't make sense, the intrinsic is already stable as f32::sqrt() and f64::sqrt() so the change doesn't change the actual logic going on. Also, this change breaks no STD builds because sqrt isn't available, which is why StdMath has a compile-time check and fallback to an approximate sqrt implementation.

Ah, I didn't realize any of the compiler intrinsics had a dependency on the standard library. I reverted it, I'm just using direct calculation with no_std for now, I'll circle back around once I get the safe api implemented

@skewballfox
Copy link
Author

skewballfox commented Nov 23, 2024

a couple of things I wanted to get some feedback on:

  • running cargo clippy --fix changes some of the code not related to the PR, mainly removing the lifetime on memory loader. I wanted to make sure that was actually correct for committing it
  • what's the relationship between test in impl, op, and export? should I move the hypot stuff currently in impl_test to the op hypot file?
  • for the no_std implementation, without libm. The test are almost definitely going to fail for any of the simd implementations because the difference in accuracy for the square root functions. I could port sqrt from libm but it's a pretty gnarly bit of code.

I think I might have figured out how to fix the 1 ulp difference.I need to change scaling to only be applied if values fall outside a certain range.

@skewballfox skewballfox marked this pull request as ready for review November 24, 2024 16:59
@skewballfox
Copy link
Author

so, with the 1 ulp difference, it turned out to not to be a problem with the scaling. it's how the floating point error is being corrected for. haven't had much success porting the error correction process mentioned in the paper, but in either case I think being within 1 ulp is probably an acceptable level of accuracy in most cases.

I can maybe try to fix it in a later PR.

Let me know if there is something I should correct, I think the PR is ready for review, though clippy CI check will likely throw an error due to the lifetime on memory loader, and a few unused type parameters in core_routine_boilerplate (both unrelated to changes from the pr)

/// and a point (`x`, `y`) on the Euclidean plane.
unsafe fn hypot(a: Self::Register, b: Self::Register) -> Self::Register;
#[inline(always)]
unsafe fn hypot_dense(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, please add docs for this function as well, and just add a note under # safety that it depends on the safety requirements of SimdRegister<T>.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just add a note under # safety that it depends on the safety requirements of SimdRegister

You mean for the docs of op_hypot correct?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, technically it depends on both but I guess if you satisfy op_hypot you satisfy the register requirements anyway.

use crate::mem_loader::{IntoMemLoader, MemLoader};

#[inline(always)]
/// A generic vector hypot over one vector and single value.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? Technically mem loaders will accept:

  • single & single
  • single & vec
  • vec & single
  • vec & vec

result = [0; dims]

for i in range(dims):
result[i] = a[i].hypot(b[i])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful to have this be a bit more in-depth of what hypot is actually doing as just calling the function doesn't give much of an insight into the actual math and operation ordering going on.

Copy link
Author

@skewballfox skewballfox Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this be better?

result = [0; dims]

// assume |a|<|b| for all a,b
for i in range(dims):
    let hi = a[i].abs()
    let lo = b[i].abs()
    let scale = 1/hi
    result[i] = hi * sqrt(lo*scale + 1)

return result

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I'm tempted to have it be abs(a[i]) just so it is consistent, other than that looks good, much clearer.

use crate::mem_loader::{IntoMemLoader, MemLoader};

/// Various arithmetic operations over vectors.
pub trait NumericOps: Sized + Copy {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little bit of bike shedding; If this is called NumericOps, does this open us up to a bit of pain if later we want to add a function that falls under this category, but for i32, i8, etc... Since here NumericOps is probably a bit more like FloatOps or similar.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries on the bike shedding.

What functions would fall into this category that would be valid for integers? I can change the name of the trait/file to hypot, but I imagine there would be other float specific ops that would be useful to add at some point.

What naming scheme do you think we should go with?

Copy link
Owner

@ChillFish8 ChillFish8 Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, I suspect that, we are more likely to have Float-only ops rather than int-only ops most of the time for this lib, A trait for just the hypot safe op is probably going to become a bit verbose pretty quickly, we could put it under MiscFloatOps as a bit of a catchall and then come up with a better fitting name if we collect a few more functions under this category/application.

@ChillFish8
Copy link
Owner

ChillFish8 commented Dec 1, 2024

It looks pretty great overall. The comments on the arithmetic ops themselves on the SIMD were very useful to digest and understand.

There are a few notes around code style and a little bit of naming debate but nothing insane.

I think one thing missing is a dedicated high level function like cfavml::hypot()

@skewballfox
Copy link
Author

So, I've run into a bit of a roadblock. It looks like for the macos test, the test are specifically failing for subnormal values for the neon implementation. The difference seems to be pretty significant (the exponents don't match), and I've hit the limit of what I can figure out without being able to directly test on hardware.

@ChillFish8
Copy link
Owner

I'll try have a look tomorrow and see if I can spot what's up with it

@ChillFish8
Copy link
Owner

Shoot I forgot about this, I'll try and look at this when I get the chance, sorry!

@skewballfox
Copy link
Author

Oh thank god. Who in their right mind would think the natural way to order the args for a fused multiply add function would be add first?

@ChillFish8
Copy link
Owner

AH you ran into that issue 😅 Yeah I ran into the same thing when first doing the Neon stuff, the ordering of the accumulators is a different ordering which is... Unhelpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants