Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation-dependent reciprocal [sqrt] approximation instructions #4

Open
stoklund opened this issue Apr 14, 2017 · 17 comments
Open
Labels
instruction-proposal out of scope Out of scope for this proposal, can be explored in a subsequent proposal.

Comments

@stoklund
Copy link

The proposal in WebAssembly/simd#1 includes these instructions:

  • f32x4.reciprocalApproximation(a: v128) -> v128,
  • f32x4.reciprocalSqrtApproximation(a: v128) -> v128,
  • f64x2.reciprocalApproximation(a: v128) -> v128, and
  • f64x2.reciprocalSqrtApproximation(a: v128) -> v128.

The corresponding scalar instructions are mentioned in the future features design document.

These instructions are available in the ARM, Intel, and MIPS SIMD instruction sets. They are typically implemented as table-driven piece-wise linear approximations on the interval [0.5; 1) or [0.25; 1) and then extended to the full exponent domain. However, the exact nature of the approximation is implementation-dependent, so different ISAs will get different results.

The approximations are either used as-is for low-precision graphics or physics code, or they are used as starting points and refined with one or more steps of Newton-Raphson.

@jfbastien
Copy link
Member

My main objection to doing this early is that the results differ per platform, which WebAssembly has tried very hard to avoid. These instructions have a history of providing great perf when precision doesn't matter, so I see them as somewhat special.

I think we need to quantify these gains across "tier 1" ISAs, in the context where they're usually employed. We then need to also document what kind of results they return in different implementations so that our spec can provide some bounds (cf. JavaScript's original sin / cos for how not to do this).

@stoklund
Copy link
Author

Fortunately, these functions are better behaved than sin/cos, so it should be easy to specify a quite tight maximum relative error along with an allowance for weird results near 0. For example, Intel promises |Relative Error| ≤ 1.5 ∗ 2−12, with subnormal inputs mapped to ±∞.

There should also be less room for software shenanigans since straight up computing 1/x or 1/sqrt(x) in one or two instructions is much faster than computing sin(x).

@billbudge
Copy link

A concern I have is that f64x2 is not supported well on platforms other than Intel. Implementations would be forced to lower these opcodes to the equivalent scalar ops, probably leading to zero or negative performance improvement. So I'm wondering if the f64x2 type should be included.

@jfbastien
Copy link
Member

@billbudge (hi Bill!) is this specifically about f64x2 approximation instructions, or about f64x2 in general?

@billbudge
Copy link

Hi JF. Yes, f64x2 in general. We think it will be hard enough to get clear performance wins with f32x4, so somewhat skeptical about f64x2, especially on platforms like ARM.

@jfbastien
Copy link
Member

In ARMv8 A64 supports f64x2. I'm happy with this, but I agree that this type is related to the "portable performance" bar I suggested.

I think it's a separate issue from approximation instructions though. Could you file it?

stoklund referenced this issue in WebAssembly/simd Apr 25, 2017
To be debated in #3.

Add a complete list of omitted operations to Overview.md.
stoklund referenced this issue in WebAssembly/simd Apr 25, 2017
Fold the portable-simd.md specification and the WebAssembly mapping into
a single SIMD.md document that is specific to WebAssembly. This is
easier to read and avoids a lot of confusion.

- Don't attempt to specify floating-point semantics again. Just refer to
  existing WebAssembly behavior.
- Remove the minNum and maxNum operations which were never intended to
  be included in WebAssembly.
- Clarify the trapping behavior of the float-to-int conversions.
- Remove the descriptions of recipriocal [sqrt] approximations. See #3.
- Rename all operations to use snake_case.
- Add *.const instructions for all the new types.
- Remove the partial load and store operators as suggested by
  @sunfishcode. They were removed from asm.js too.
@lemaitre
Copy link

As long as the constraint about the reproducibility of the results across clients is not soften, this question has absolutely no answer.
It is not possible to propose such a function that can give the exact same result across architectures.
SSE, Neon, Altivec, AVX512: all those ISAs have different instructions giving different results.

But if this constraint disappears, then I think the best way would be to have something like this:

  • f32x4.reciprocalApproximation(a: v128, prec: int8) -> v128
  • f32x4.reciprocalSqrtApproximation(a: v128, prec: int8) -> v128
  • f64x2.reciprocalApproximation(a: v128, prec: int8) -> v128
  • f64x2.reciprocalSqrtApproximation(a: v128, prec: int8) -> v128
    will compute an approximation of f(a) with at least prec bits correct.
    This instruction will call a fast hardware instruction, and then recover accuracy with some Newton-Raphson iterations to achieve the expected accuracy.
    This will improve the portability of such a function, but will still give "slightly" different results from different architectures.

@gnzlbg
Copy link

gnzlbg commented May 24, 2018

From the ISAs with reciprocal square-root estimate which one has the largest error?

Some (most?) ISAs also provide instructions that perform single Newton-Raphson iterations for this operation, so an alternative here would be to just specify an upper bound on the error, and use the reciprocal square-root instruction of the ISA when it satisfies the upper bound, and add one or more NR-iterations to that result as required in ISAs where this is not the case.

@lemaitre
Copy link

lemaitre commented May 24, 2018

From the ISAs with reciprocal square-root estimate which one has the largest error?

I would say Neon, but I don't remember well. Most documents I have found don't specify the precision of this instruction. And some give a maximum error of 1/4096 (12 bits) which would be the same as x86, but in my experience, it's less accurate than intel's: either arm is less accurate than 12 bits, or Intel is more accurate...

Some (most?) ISAs also provide instructions that perform single Newton-Raphson iterations for this operation

Only Neon supports such an instruction. With other ISAs, you need to implement the NR iteration yourself with a bunch of MUL/FMA.

an alternative here would be to just specify an upper bound on the error, and use the reciprocal square-root instruction of the ISA when it satisfies the upper bound, and add one or more NR-iterations to that result as required in ISAs where this is not the case.

I tend to think that's the way to do it.
Actually, with what I proposed, the accuracy requirement was not fixed but embedded within the instruction.

Some extra thoughts:
reciprocal (square root) on double precision is not available one SSE and AVX (but is available on Neon 64, VSX and AVX512).
It is possible to do a fast and rough estimation in that case though.

AVX512 has now instructions that are more accurate for those operations: 14 and 28 bits

@gnzlbg
Copy link

gnzlbg commented May 24, 2018

Do you know the name of the VSX instruction that supports 64-bit rsqrte on powerpc? The 32-bit instruction also has an error of 1/4096 like the SSE instructions.

MIPS MSA supports 32-bit and 64-bit reciprocal square-root estimate with an error of at most 2 ULPs.


Also worst case one can implement these by doing a full vector square root, which most ISAs support, followed by a divide. That's slow, but would still be a conforming implementation.

@gnzlbg
Copy link

gnzlbg commented May 24, 2018

It seems that PowerPC implements this for 64-bit floats with 14-bit precision. So 12-bit precision for 32-bit and 14-bit precision for 64-bit look like "reasonable" upper bounds on the error to me.

@lemaitre
Copy link

lemaitre commented May 24, 2018

Do you know the name of the VSX instruction that supports 64-bit rsqrte on powerpc?

On 64 bits, there is xvrsqrtesp for F32 and xvrsqrtedp for F64 (both are available through the C intrinsics vec_rsqrte).
I would assume they have the same precision, I haven't put any effort to confirm this feeling.

MIPS MSA supports 32-bit and 64-bit reciprocal square-root estimate with an error of at most 2 ULPs.

I never used MIPS, so you're telling me.

It seems that PowerPC implements this for 64-bit floats with 14-bit precision. So 12-bit precision for 32-bit and 14-bit precision for 64-bit look like "reasonable" upper bounds on the error to me.

I have to disagree with your conclusion: One way to implement the fast reciprocal for F64 when you don't have any instruction is to convert your input fo F32, call the fast reciprocal on F32, and convert back to F64.
Doing that you "inherit" the precision from F32. By setting the F64 accuracy requirement to be slightly higher than for F32, you clearly forbid such an implementation.

Actually, that's why I would recommend the precision not to be fixed: The instruction could take an immediate to specify the accuracy required, and when the assembly is generated: the fast reciprocal instruction will be used and enough NR iterations will be generated to achieve the accuracy required depending on the actual hardware.

So users should not write NR themselves and just specify they want for instance a 20 bit accurate result and that's it.
However, I think the accuracy requirement should be an immediate and not a runtime argument, but that's up to debate.

Also worst case one can implement these by doing a full vector square root, which most ISAs support, followed by a divide. That's slow, but would still be a conforming implementation.

I totally agree with you on that point. By the way, if the target precision is too high, that implementation can be the fastest implementation possible.

@gnzlbg
Copy link

gnzlbg commented May 24, 2018

I have to disagree with your conclusion: One way to implement the fast reciprocal for F64 when you don't have any instruction is to convert your input fo F32, call the fast reciprocal on F32, and convert back to F64.
Doing that you "inherit" the precision from F32. By setting the F64 accuracy requirement to be slightly higher than for F32, you clearly forbid such an implementation.

While I think this is what some ISAs actually do (does PowerPC 64 do this?), the largest representable f32 is 3.4 x 10^38 while the largest representable f64 is 1.80 x 10^308. There is a whole range of f64s that just are not representable in f32s. At that point, why expose the version for f64 at all? I'd argue that it would be better to just not expose it, and let the user do the truncation to f32 however it sees fit (INFINITY, clamp, etc.).

Actually, that's why I would recommend the precision not to be fixed: The instruction could take an immediate to specify the accuracy required, and when the assembly is generated: the fast reciprocal instruction will be used and enough NR iterations will be generated to achieve the accuracy required depending on the actual hardware.

This is actually a really interesting idea.

However, I think the accuracy requirement should be an immediate and not a runtime argument,

I think making them an immediate is reasonable: those who want run-time behavior can just branch on it (e.g. have a switch that calls the intrinsic for different immediates), that's what the compiler will have to generate if it were to accept a run-time argument anyways (although the compiler might be able to be a bit more clever about the switch since it might know for a particular target which options are available).

@lemaitre
Copy link

While I think this is what some ISAs actually do (does PowerPC 64 do this?), the largest representable f32 is3.4 x 10^38while the largest representablef64is1.80 x 10^308. There is a whole range off64s that just are not representable inf32s. At that point, why expose the version forf64at all? I'd argue that it would be better to just not expose it, and let the user do the truncation tof32` however it sees fit.

I never said that was the only way to implement it. That's true this specific implementation does not support input outside the range of F32.

Another way to implement it is with some bit hacks: https://en.wikipedia.org/wiki/Fast_inverse_square_root
This article presents only the F32 bit hack (which is not worth anymore), but it can also be done for F64, and it is still worth depending on the accuracy you need.
It requires more NR iterations, but nothing fancy.
This technique support the whole range of F64 (except subnormals)

I think making them an immediate is reasonable: those who want run-time behavior can just branch on it (e.g. have a switch that calls the intrinsic for different immediates), that's what the compiler will have to generate if it were to accept a run-time argument anyways (although the compiler might be able to be a bit more clever about the switch since it might know for a particular target which options are available).

Agreed. I don't see any valid use case for a runtime precision requirement anyway.

penzn referenced this issue in penzn/simd Aug 20, 2019
Extended loads: 32->64 bit ops
@ngzhian ngzhian transferred this issue from WebAssembly/simd Mar 17, 2021
@ngzhian
Copy link
Member

ngzhian commented Sep 8, 2021

So users should not write NR themselves and just specify they want for instance a 20 bit accurate result and that's it.
However, I think the accuracy requirement should be an immediate and not a runtime argument, but that's up to debate.

This is an interesting area to explore. Implementing the NR step seems simple enough on non-NEON backends. I would need some help figuring out how many steps of NR to perform to achieve the specified accuracy:

From https://www.felixcloutier.com/x86/rcpps, the relative error is |Relative Error| ≤ 1.5 * 2−12, and from https://en.wikipedia.org/wiki/Division_algorithm#Newton.E2.80.93Raphson_division : "the number of correct digits in the result roughly doubles for every iteration". So if the user wants 28 bits of accuracy for a float, how many rounds do I need?

(Sorry I know nothing about numerical analysis, any pointers here will be helpful)

@Maratyszcza
Copy link
Collaborator

Here is an old unpublished paper of mine. While its main topic is not related to reciprocal (sqrt) approximation operations, it does contain probably the best description and semi-analytical model of the subject in Section 4.

@ngzhian
Copy link
Member

ngzhian commented Mar 22, 2022

As discussed in https://github.com/WebAssembly/meetings/blob/main/simd/2022/SIMD-02-18.md#aob (search for reciprocal), this is likely out of scope, adding a label to indicate so.

@ngzhian ngzhian added the out of scope Out of scope for this proposal, can be explored in a subsequent proposal. label Mar 22, 2022
kangwoosukeq pushed a commit to prosyslab/v8 that referenced this issue Apr 28, 2022
These were originally proposed as a part of the fixed-width SIMD
proposal, and were then migrated to the relaxed-simd proposal
which also deems these operations out of scope.

Github issue: WebAssembly/relaxed-simd#4

Bug: v8:12284
Change-Id: I65ceb6dfd25c43cf49bd7ec5b5ecd6b32cc3516a
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/3595970
Reviewed-by: Thibaud Michaud <[email protected]>
Commit-Queue: Deepti Gandluri <[email protected]>
Cr-Commit-Position: refs/heads/main@{#80125}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
instruction-proposal out of scope Out of scope for this proposal, can be explored in a subsequent proposal.
Projects
None yet
Development

No branches or pull requests

7 participants