Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What about subnormals? #148

Closed
jfbastien opened this issue Jun 11, 2015 · 72 comments
Closed

What about subnormals? #148

jfbastien opened this issue Jun 11, 2015 · 72 comments
Labels
Milestone

Comments

@jfbastien
Copy link
Member

Forking from: #141 (comment)

Should denormal be:

  • Unspecified (say so in Nondeterminism.md).
  • Fully specified as IEEE-754 compliance.
  • Fully specified as IEEE-754 compliance for scalars, and do something for SIMD because existing hardware (especially ARMv7 NEON) doesn't support denormals.
  • Specified as DAZ/FTZ (not IEEE-754 compliant).

We should probably let ourselves change this from developer feedback, but I'd like to make some decision for MVP. I suggested on esdiscuss that JavaScript go full unspecified, and just do DAZ/FTZ because it's often faster. Yes, x86 is better than it used to be but that's not universal, ignores current hardware, and doesn't look towards what new hardware will do. I like leaving the door open :-)

For @sunfishcode searchability, I'll use the word "subnormals" too :-)

@jfbastien jfbastien added this to the Public Announcement milestone Jun 11, 2015
@sunfishcode sunfishcode changed the title What about denormals? What about subnormals? Jun 11, 2015
@sunfishcode
Copy link
Member

For @jfbastien searchability, I'll say the word "denormals" too :-). But IEEE754-2008 is about 7 years ago, so it's time to be up to date :-).

@titzer
Copy link

titzer commented Jun 11, 2015

I would argue for IEEE 754 compliance from the beginning. The rationale is that hardware and software all tends toward the standard over time. I think we're uncorking a long-term bottle of annoyance and incompatibility to deviate.

As for subnormals, it seems the same cycle has repeated multiple times in floating point history. It's been tempting for hardware to cut corners to either do FTZ or something else, and has always tended back to IEEE 754 full compliance. Even GPUs are implementing full IEEE now. Of our tier 1 platforms, the only one I am aware of where subnormals are not implemented at all is Float32x4 on Arm NEON (i.e. SIMD). Float64x2 on NEON is fully IEEE compliant. Scalar arithmetic on arm is of course compliant.

The subnormal situation seems to come down to SIMD, specifically the case of Arm NEON above.

Based on conversations with hardware designers, microarchitectures have gotten so much better at superscalar floating point, even Arm cores, that it might be acceptable to spec Float32 as IEEE as well as Float32 SIMD operations as IEEE, and then simply not do Float32x4 as SIMD on arm. But this is something we should measure and motivate when SIMD is coming into WebAsm. At that point, if the performance really justifies weakening the spec, we can weaken the spec. Otherwise, it would be hard to tighten up the spec later.

@jfbastien
Copy link
Member Author

My objection to that is: which developers care about denormals? Developers usually learn about them when they have stray denormals in their compute kernel and code goes orders of magnitude slower. I'm still looking for someone who wants denormal support.

@lukewagner
Copy link
Member

If we specify FTZ as a developer-controlled mode, we're not weakening the spec (as in, we're not making semantics any looser); we're giving strictly more power to developers and this is something that they are specifically asking for (in our discussions with asm.js-using gamedevs). If we consider that wasm will always be run on emerging platforms (which often start w/ terrible denormal perf) and very old platforms, then this will be a consistent feature request, not one that will get definitely fixed. Furthermore, from my discussions with Intel, even with new, optimized denormals, they're not equivalent in speed to normal numbers and they're also not optimized for all ops. This is why setting FTZ is standard practice; it's just one less perf cliff to worry about.

@jfbastien
Copy link
Member Author

Setting DAZ/FTZ should be a global property that's set once for the entire wasm application, though?

  • Or can it be set/unset arbitrarily and mess with AOT compilation?
  • What of threading setting/unsetting it?
  • Wasm shares its process with other modules and non-wasm code. What does that imply when setting/unsetting?

@lukewagner
Copy link
Member

I would expect it to be a global flag on a wasm module that cannot be toggled and can thus be assumed for AOT/cached compilation. That still leaves questions w/ dynamic linking but I'd default to the simple option of: if you try to dynamically link a wasm module w/ a different flag than you, loading fails.

@titzer
Copy link

titzer commented Jun 12, 2015

I want to see data, and the burden of proof for violating IEEE 754 is very
high in my book. Even if the performance gains are huge, they'll likely be
spotty, i.e. on only a small number of platforms. I'd recommend that we
find a way to spec it in an "opt-in to nondeterminism because I want speed"
fashion. But mostly I trust the IEEE 754 specification and the numerical
expertise that went in to it more than hardware manufacturers on the
cutting edge.

I think the phase where we gather sufficient data to motivate a digression
from IEEE 754 hasn't come yet, so spec'ing nondeterminism or deviance seems
premature now.

I think mandating FTZ is a no-go, since mode-switching seems to be really
expensive on Intel.

On Thu, Jun 11, 2015 at 12:14 PM, Luke Wagner [email protected]
wrote:

I would expect it to be a global flag on a wasm module that cannot be
toggled and can thus be assumed for AOT/cached compilation. That still
leaves questions w/ dynamic linking but I'd default to the simple option
of: if you try to dynamically link a wasm module w/ a different flag than
you, loading fails.


Reply to this email directly or view it on GitHub
#148 (comment).

@lukewagner
Copy link
Member

To be clear, it's not nondeterminism that is being discussed: it's (deterministically) flushing denormals to zero. Also, we have had reports (e.g. and these guys iirc) specifically about people hitting denormal perf problems in JS. That setting FTZ (globally, not toggling dynamically) is standard practice for whole domains (games, signal processing) demonstrates that this is something developers expect.

@pizlonator
Copy link
Contributor

I agree with titzer.

I don’t think there is much long-term value to specifying floating point behavior that is not IEEE.

-Filip

On Jun 11, 2015, at 5:09 PM, titzer [email protected] wrote:

I want to see data, and the burden of proof for violating IEEE 754 is very
high in my book. Even if the performance gains are huge, they'll likely be
spotty, i.e. on only a small number of platforms. I'd recommend that we
find a way to spec it in an "opt-in to nondeterminism because I want speed"
fashion. But mostly I trust the IEEE 754 specification and the numerical
expertise that went in to it more than hardware manufacturers on the
cutting edge.

I think the phase where we gather sufficient data to motivate a digression
from IEEE 754 hasn't come yet, so spec'ing nondeterminism or deviance seems
premature now.

I think mandating FTZ is a no-go, since mode-switching seems to be really
expensive on Intel.

On Thu, Jun 11, 2015 at 12:14 PM, Luke Wagner [email protected]
wrote:

I would expect it to be a global flag on a wasm module that cannot be
toggled and can thus be assumed for AOT/cached compilation. That still
leaves questions w/ dynamic linking but I'd default to the simple option
of: if you try to dynamically link a wasm module w/ a different flag than
you, loading fails.


Reply to this email directly or view it on GitHub
#148 (comment).


Reply to this email directly or view it on GitHub #148 (comment).

@angryoctopus
Copy link

I think the vast majority of developers that encounter denormals discover them while troubleshooting unexpected performance hits. In the DSP world it tends to be a very common performance gotcha so I would advocate defaulting FTZ with an optional global IEEE-compliant mode.

@mikolalysenko
Copy link

My preference would be full IEEE 754 support, or barring that to use DAZ. Undefined behavior would be a very terrible decision in my opinion. Inconsistent semantics make it very difficult to implement algorithms from computational geometry, which require exactly computing things like the sign of a determinant for example. A common technique to speed up these calculations is to use a floating point filter as a quick check before falling back to a more expensive exact arithmetic test. If the floating point behavior is not specified, then it becomes much more difficult (and in some cases impossible) to construct such a filter.

@sunfishcode
Copy link
Member

Here is a discussable proposal which I believe gives most people what they want, though it makes some tradeoffs (as any proposal must):

  • Default to full support for subnormals.
  • A function attribute sets either "standard" or "maybe_flush", with "maybe_flush" meaning that it's nondeterministic whether subnormals are flushed as input and/or output at each operation within the function body (and does not apply within called functions).
  • Float32x4 is not accelerated on 32-bit NEON in "standard" mode, but is in "maybe_flush" mode.

The following questions seem interesting:

Is "standard" the right default? Losing Float32x4 on 32-bit NEON by default is not pretty (though compilers and tools could help detect problems and guide developers to solutions). Compiler flags are a nuisance. However, abrupt underflow is also sometimes problematic, and it's non-IEEE, so it's a question of priorities and perhaps also short-term versus long-term.

Is function-body the right scope for mode switching? It's somewhat fine-grained, but also gives implementations a natural optimization boundary, because dealing with mode changes in the middle of a function is awkward. Inlining can blur such boundaries, but optimizers would at least have the option of declining to do inlining (or other interprocedural optimizations) across boundaries where the modes differ. And implementations might be able to avoid the cost of mode switching across many function boundaries when the mode doesn't actually change.

Is "maybe_flush" what we want, or would a straight "flush" be better? "maybe_flush" avoids requiring CPUs to have both DAZ and FTZ flags. And, some implementations may wish to stay in "standard" mode in some cases. But, it does introduce nondeterminism which could lead to different problems.

@juj
Copy link

juj commented Jul 2, 2015

What is the advantage of hardcoding in the spec that mode switching must
occur on function boundaries, and not in other places?

2015-07-02 3:22 GMT+03:00 Dan Gohman [email protected]:

Here is a discussable proposal which I believe gives most people what they
want, though it makes some tradeoffs (as any proposal must):

  • Default to full support for subnormals.
  • A function attribute sets either "standard" or "maybe_flush", with
    "maybe_flush" meaning that it's nondeterministic whether subnormals are
    flushed as input and/or output at each operation within the function body
    (and does not apply within called functions).
  • Float32x4 is not accelerated on 32-bit NEON in standard mode, but is
    in "maybe_flush" mode.

The following questions seem interesting:

Is "standard' the right default? Losing Float32x4 on 32-bit NEON by
default is not pretty (though compilers and tools could help detect
problems and guide developers to solutions). Compiler flags are a nuisance.
However, abrupt underflow is also sometimes problematic, and it's non-IEEE,
so it's a question of priorities and perhaps also short-term versus
long-term.

Is function-body the right scope for mode switching? It's somewhat
fine-grained, but also gives implementations a natural optimization
boundary, because dealing with mode changes in the middle of a function is
awkward. Inlining can blur such boundaries, but optimizers would at least
have the option of declining to do inlining (or other interprocedural
optimizations) across boundaries where the modes differ. And
implementations might be able to avoid the cost of mode switching across
many function boundaries when the mode doesn't actually change.

Is "maybe_flush" what we want, or would a straight "flush" be better?
"maybe_flush" avoids requiring CPUs to have both DAZ and FTZ flags. And,
some implementations may wish to stay in "standard" mode in some cases.
But, it does introduce nondeterminism which could lead to different
problems.


Reply to this email directly or view it on GitHub
#148 (comment).

@pizlonator
Copy link
Contributor

This makes perfect sense to me.

-Fil

On Jul 1, 2015, at 5:22 PM, Dan Gohman [email protected] wrote:

Here is a discussable proposal which I believe gives most people what they want, though it makes some tradeoffs (as any proposal must):

Default to full support for subnormals.
A function attribute sets either "standard" or "maybe_flush", with "maybe_flush" meaning that it's nondeterministic whether subnormals are flushed as input and/or output at each operation within the function body (and does not apply within called functions).
Float32x4 is not accelerated on 32-bit NEON in standard mode, but is in "maybe_flush" mode.
The following questions seem interesting:

Is "standard' the right default? Losing Float32x4 on 32-bit NEON by default is not pretty (though compilers and tools could help detect problems and guide developers to solutions). Compiler flags are a nuisance. However, abrupt underflow is also sometimes problematic, and it's non-IEEE, so it's a question of priorities and perhaps also short-term versus long-term.

Is function-body the right scope for mode switching? It's somewhat fine-grained, but also gives implementations a natural optimization boundary, because dealing with mode changes in the middle of a function is awkward. Inlining can blur such boundaries, but optimizers would at least have the option of declining to do inlining (or other interprocedural optimizations) across boundaries where the modes differ. And implementations might be able to avoid the cost of mode switching across many function boundaries when the mode doesn't actually change.

Is "maybe_flush" what we want, or would a straight "flush" be better? "maybe_flush" avoids requiring CPUs to have both DAZ and FTZ flags. And, some implementations may wish to stay in "standard" mode in some cases. But, it does introduce nondeterminism which could lead to different problems.


Reply to this email directly or view it on GitHub.

@sunfishcode
Copy link
Member

Structured mode switching, rather than just arbitrary dynamic mode switching, means that one can always statically determine the mode for any operation, which is an important property. Putting mode switches at function boundaries achieves this, though another option would be to have a mode-switch AST node which would be like a block node but would set the mode within its lexical extent.

Between function attributes and AST nodes, I chose function attributes because it gives implementations a few more options for avoiding mode switching costs. However, AST nodes would give applications some more flexibility, so we can consider both choices here.

@titzer
Copy link

titzer commented Jul 2, 2015

This proposal sounds pretty reasonable to me. I prefer the "opt-in" to
nondeterminism option. The AST node has the advantage of allowing the
source producer to do inlining without changing the semantics.

On Thu, Jul 2, 2015 at 5:32 PM, Dan Gohman [email protected] wrote:

Structured mode switching, rather than just arbitrary dynamic mode
switching, means that one can always statically determine the mode for any
operation, which is an important property. Putting mode switches at
function boundaries achieves this, though another option would be to have a
mode-switch AST node which would be like a block node but would set the
mode within its lexical extent.

Between function attributes and AST nodes, I chose function attributes
because it gives implementations a few more options for avoiding mode
switching costs. However, AST nodes would give applications some more
flexibility, so we can consider both choices here.


Reply to this email directly or view it on GitHub
#148 (comment).

@jfbastien
Copy link
Member Author

Should it be and AST node, or a per-operation property? That won't cause code bloat because of the way we specify operations.

@titzer
Copy link

titzer commented Jul 2, 2015

I'd also be fine(r) if this was a SIMD only feature, where vector
operations could specify IEEE, FTZ, or DontCare.

On Thu, Jul 2, 2015 at 6:27 PM, JF Bastien [email protected] wrote:

Should it be and AST node, or a per-operation property? That won't cause
code bloat because of the way we specify operations.


Reply to this email directly or view it on GitHub
#148 (comment).

@pizlonator
Copy link
Contributor

That sounds OK, though I would stick with IEEE and DontCare.

Having an FTZ attribute would mean mode switching in embedded scenarios - like if a native app uses JavaScriptCore.framework and then the JS code goes and loads some wasm (this is something I that I think we’ll eventually want to support). The rest of the app will almost certainly be IEEE.

-Filip

On Jul 2, 2015, at 9:30 AM, titzer [email protected] wrote:

I'd also be fine(r) if this was a SIMD only feature, where vector
operations could specify IEEE, FTZ, or DontCare.

On Thu, Jul 2, 2015 at 6:27 PM, JF Bastien [email protected] wrote:

Should it be and AST node, or a per-operation property? That won't cause
code bloat because of the way we specify operations.


Reply to this email directly or view it on GitHub
#148 (comment).


Reply to this email directly or view it on GitHub #148 (comment).

@jfbastien
Copy link
Member Author

For usecases like games I think DontCate isn't sufficient: folks actually want Fastest. If the HW makes denorms free then great, but otherwise they want FTZ.

@pizlonator
Copy link
Contributor

Right, they want Fastest. FTZ won’t be Fastest if you have to mode-switch on every native API boundary.

How about rename DontCare to Fastest? The point is: “I care less about the semantics of denorms than I care about how fast my code runs”.

-Filip

On Jul 2, 2015, at 9:36 AM, JF Bastien [email protected] wrote:

For usecases like games I think DontCate isn't sufficient: folks actually want Fastest. If the HW makes denorms free then great, but otherwise they want FTZ.


Reply to this email directly or view it on GitHub #148 (comment).

@pizlonator
Copy link
Contributor

This is sounding pretty convincing. Question: is the only available method of battling this slow down to enable FTZ, or are there other tricks that people use also?

-Filip

On Jul 10, 2015, at 12:40 PM, Shannon Smith [email protected] wrote:

Any developer who has spent time with audio DSP hit's the denormal problem pretty quickly. Any exponential decays or IIR filters unless carefully designed exhibit a pretty massive performance hit (at least 10x) once they drop into the denormal range.
One of the reasons I am excited about wasm is because it will enable things that really aren't possible with javascript due to performance like audio DSP, video codecs etc where control over things like denormals is critical.


Reply to this email directly or view it on GitHub #148 (comment).

@jfbastien
Copy link
Member Author

As for the non-browser uses of wasm, I think they are interesting but that’s not the killer app. The killer app here is the browser, so lets optimize for that.

@pizlonator I was thinking about in-browser usecases that are fully wasm, with little to no JS glue around. Yes, out-of-browser is also a usecase to which my argument applies, but I agree we can mostly ignore it for this discussion.

@angryoctopus
Copy link

This is sounding pretty convincing. Question: is the only available method of battling this slow down to enable FTZ, or are there other tricks that people use also?

If it you can't enable FTZ (eg when doing DSP in a Java VM) there are several tricks you can use such as testing for and flushing denormal values explicitly or injecting inaudible noise into filters to keep them out of the denormal range.
These sort of workarounds will work with wasm but it would be nicer to have a fastmath flag (or better yet a strictmath flag with fastmath being the default).

@pizlonator
Copy link
Contributor

To be clear though, do you want a fast-math flag in the style of Java, a fast-math flag in the style of C compilers, or a fast-math flag that just means FTZ?

-Filip

On Jul 10, 2015, at 1:20 PM, Shannon Smith [email protected] wrote:

This is sounding pretty convincing. Question: is the only available method of battling this slow down to enable FTZ, or are there other tricks that people use also?

If it you can't enable FTZ (eg when doing DSP in a Java VM) there are several tricks you can use such as testing for and flushing denormal values explicitly or injecting inaudible noise into filters to keep them out of the denormal range.
These sort of workarounds will work with wasm but it would be nicer to have a fastmath flag (or better yet a strictmath flag with fastmath being the default).


Reply to this email directly or view it on GitHub #148 (comment).

@pizlonator
Copy link
Contributor

Thought about this argument more, and I’m no longer so happy with deterministic FTZ even if it’s module-wide.

I expect that wasm users will modularize their code. This is tendency in any language that supports modules: you create separate modules for separate things.

A deterministic module-wide FTZ setting will make cross-module calls slower in cases where there is a settings mismatch.

I’m still not convinced about this FTZ thing. Another thing that occurred to me about empirically observed FTZ slow-downs is that they may be due to the presence of denormals changing the convergence characteristics of a numerical fixpoint - that is the fixpoint may take longer to converge. Of course it’s sad when native code exhibits different behavior in wasm than it did natively, but that ship has already sailed. I still don’t see evidence that the lack of FTZ prevents people from writing performant code; it feels like a nice-to-have. And having an FTZ setting that sometimes makes fine-grained cross-module calls slow seems broken.

-Filip

On Jul 10, 2015, at 12:46 AM, Luke Wagner [email protected] wrote:

If we define FTZ module-wide (which was my expectation at the beginning of this thread), then I wouldn't expect the mode flipping to be significant. I asked Intel people about the costs and they said that, while it's not something you want to be doing in a loop, it's not super expensive and has gotten cheaper. Thus, if you factor in the cost of doing the call and associated bookkeeping, I expect the mode flipping wouldn't be significant. There is the use case where someone writes a wasm module full of tiny-bodied functions that are called repeatedly from JS (say, some Math library), but these are simply cases where devs shouldn't set ftz.


Reply to this email directly or view it on GitHub #148 (comment).

@angryoctopus
Copy link

To be clear though, do you want a fast-math flag in the style of Java, a fast-math flag in the style of C compilers, or a fast-math flag that just means FTZ?

I was thinking more in line with C compilers where IEEE compliance is not guaranteed and denormal support may be disabled. Thinking about it more however, it would probably need to be more explicit.

@jfbastien
Copy link
Member Author

What @davidsehr proposed is to formalize a full math model, with more than just control on denormals. If we're doing scoped attributes then allow developers to wrap regions where reassociation is OK, where FP contraction can be done, and so on.

I suggest we discuss denormals in this issue, and figure out a wider math model in another issue, potentially punting to post-MVP: figuring out the denormal default matters IMO, but I think we can agree on other math behavior for MVP (essentially, not fast math).

@lukewagner
Copy link
Member

@pizlonator I see the theoretical multi-module app situation you're talking about (once we have dynamic linking, that is), but if we have a clear default mode (as a non-normative note in the spec and in llvm-wasm) then 99% of modules will all have that default mode. It would also make sense to issue a console warning when dynamically linking heterogeneous ftz-mode modules.

@kg
Copy link
Contributor

kg commented Jul 11, 2015

I'd argue concerns about the cost of the FTZ switch at module boundaries are also less relevant since the cost of calling out of/into an asm.js module is already elevated in SpiderMonkey (or was the last time I checked, anyhow). The overhead there can be aggressively optimized over time, but you're still going to effectively be transitioning between runtime environments, which means argument values (unless they're ints or floats in registers) are being marshaled into/out of the heap and various other setup is happening. I suspect there will always be some overhead involved here, so the introduction of more in the case of FTZ state mismatch is reasonable given the upside (superior, predictable performance in applications that need FTZ).

There are definitely scenarios where people will want to call into/out of wasm a lot, and in those cases we'll want to strongly discourage the use of FTZ. But the same is true for many existing native APIs - IIRC DirectX on Win32 is rather opinionated about x87 modes etc and it's just something game developers deal with.

@sunfishcode
Copy link
Member

@jfbastien To address the need for a full math model, I created #260.

@pizlonator
Copy link
Contributor

On Jul 11, 2015, at 1:58 AM, Katelyn Gadd [email protected] wrote:

I'd argue concerns about the cost of the FTZ switch at module boundaries are also less relevant since the cost of calling out of/into an asm.js module is already elevated in SpiderMonkey (or was the last time I checked, anyhow). The overhead there can be aggressively optimized over time, but you're still going to effectively be transitioning between runtime environments, which means argument values (unless they're ints or floats in registers) are being marshaled into/out of the heap and various other setup is happening. I suspect there will always be some overhead involved here, so the introduction of more in the case of FTZ state mismatch is reasonable given the upside (superior, predictable performance in applications that need FTZ).

Such an overhead doesn't exist in JSC. I don't want the spec to require me to have inter-module call overhead.
There are definitely scenarios where people will want to call into/out of wasm a lot, and in those cases we'll want to strongly discourage the use of FTZ. But the same is true for many existing native APIs - IIRC DirectX on Win32 is rather opinionated about x87 modes etc and it's just something game developers deal with.

This is not about native call overhead. This is inter-module call overhead between wasm modules.

I think it's silly to design wasm just for games, and to have a module mode flag that masquerades as a performance feature but could cause slow downs due to mode switching.

-Filip


Reply to this email directly or view it on GitHub.

@kg
Copy link
Contributor

kg commented Jul 11, 2015

How is that different from the myriad of existing performance techniques, though?

Tools like PGO can reduce your performance or break your application if guided by bad data/configured incorrectly. You might opt to use a lookup table in a scenario where it's actually more expensive than the computation due to memory characteristics. You might hand-inline some logic into your JavaScript, pushing its size over a threshold and causing some JS engines not to optimize it (FWIW, this can happen in .NET too). In the bad old days on x86, MMX and x87 shared registers so if you mixed those two you paid an enormous mode switch cost to bounce between them. Threading a performance-sensitive algorithm can reduce performance if it ends up highly contended on a lock or atomic.

There are very few optimizations you can make thoughtlessly that have no chance of hurting performance. Optimization is something that has to be an informed decision. FTZ is the same. AFAIK we're talking about an optional FTZ flag that defaults to off, so the vast majority of developers will be fine with the default and not turn it on. Many of those developers will be turning it on because their native application already had FTZ enabled, so they were paying that cost to begin with.

FWIW my SpiderMonkey example was not to imply that JSC is exactly like SpiderMonkey, but to imply that there will probably be some sort of overhead for JS<->WASM or Module<->Module transitions in most engines (eventually, if not right when the MVP is implemented). The design is already making various performance sacrifices for good reasons.

We could always make FTZ an advisory flag so it's spec-compliant for JSC to ignore it, and then we'll find out whether users care or not :-)

@pizlonator
Copy link
Contributor

On Jul 11, 2015, at 11:44 AM, Katelyn Gadd [email protected] wrote:

How is that different from the myriad of existing performance techniques, though?

Tools like PGO can reduce your performance or break your application if guided by bad data/configured incorrectly. You might opt to use a lookup table in a scenario where it's actually more expensive than the computation due to memory characteristics. You might hand-inline some logic into your JavaScript, pushing its size over a threshold and causing some JS engines not to optimize it (FWIW, this can happen in .NET too). In the bad old days on x86, MMX and x87 shared registers so if you mixed those two you paid an enormous mode switch cost to bounce between them. Threading a performance-sensitive algorithm can reduce performance if it ends up highly contended on a lock or atomic.

There are very few optimizations you can make thoughtlessly that have no chance of hurting performance. Optimization is something that has to be an informed decision. FTZ is the same. AFAIK we're talking about an optional FTZ flag that defaults to off, so the vast majority of developers will be fine with the default and not turn it on. Many of those developers will be turning it on because their native application already had FTZ enabled, so they were paying that cost to begin with.

The issue here is that we are introducing the need for calls between wasm modules (and JS<->wasm calls) to have an overhead where previously there was no need for any such overhead. And we are doing it to support an alleged optimization that relies on non-compliance with IEEE. And we have unreliable data supporting the allegation that it's an optimization at all.
FWIW my SpiderMonkey example was not to imply that JSC is exactly like SpiderMonkey, but to imply that there will probably be some sort of overhead for JS<->WASM or Module<->Module transitions in most engines (eventually, if not right when the MVP is implemented). The design is already making various performance sacrifices for good reasons.

I don't know what SpiderMonkey does, but the JSC approach being taken in our prototype will not have inter module call overhead. We'd rather go down the path of reducing module overheads rather than increasing them.

We could always make FTZ an advisory flag so it's spec-compliant for JSC to ignore it, and then we'll find out whether users care or not :-)

An FTZ advisory flag would be sort of OK and definitely better than a mandatory one.

Reply to this email directly or view it on GitHub.

@titzer
Copy link

titzer commented Jul 12, 2015

I think the larger issue here is that dealing with FTZ modes feels somewhat
premature, and it's a consideration that is in very strong need of data
here that show a real-world impact on a set of representative benchmarks.
We don't yet have a fully functional implementation, yet alone a maximally
performant one, and no benchmarks on which to base real measurements.
There's been a lot of heresay and anecdotes about potential slowdowns, but
for me the bar for deviating from IEEE should be really really high. It's
easy to add FTZ or nondeterministic denormals later and much harder to take
them away, so I think we should be careful to add a controversial
performance feature that could fail to pan out and also cause us grief
later.

On Sat, Jul 11, 2015 at 2:25 PM, pizlonator [email protected]
wrote:

On Jul 11, 2015, at 11:44 AM, Katelyn Gadd [email protected]
wrote:

How is that different from the myriad of existing performance
techniques, though?

Tools like PGO can reduce your performance or break your application if
guided by bad data/configured incorrectly. You might opt to use a lookup
table in a scenario where it's actually more expensive than the computation
due to memory characteristics. You might hand-inline some logic into your
JavaScript, pushing its size over a threshold and causing some JS engines
not to optimize it (FWIW, this can happen in .NET too). In the bad old days
on x86, MMX and x87 shared registers so if you mixed those two you paid an
enormous mode switch cost to bounce between them. Threading a
performance-sensitive algorithm can reduce performance if it ends up highly
contended on a lock or atomic.

There are very few optimizations you can make thoughtlessly that have no
chance of hurting performance. Optimization is something that has to be an
informed decision. FTZ is the same. AFAIK we're talking about an optional
FTZ flag that defaults to off, so the vast majority of developers will be
fine with the default and not turn it on. Many of those developers will be
turning it on because their native application already had FTZ enabled, so
they were paying that cost to begin with.

The issue here is that we are introducing the need for calls between wasm
modules (and JS<->wasm calls) to have an overhead where previously there
was no need for any such overhead. And we are doing it to support an
alleged optimization that relies on non-compliance with IEEE. And we have
unreliable data supporting the allegation that it's an optimization at all.
FWIW my SpiderMonkey example was not to imply that JSC is exactly like
SpiderMonkey, but to imply that there will probably be some sort of
overhead for JS<->WASM or Module<->Module transitions in most engines
(eventually, if not right when the MVP is implemented). The design is
already making various performance sacrifices for good reasons.

I don't know what SpiderMonkey does, but the JSC approach being taken in
our prototype will not have inter module call overhead. We'd rather go down
the path of reducing module overheads rather than increasing them.

We could always make FTZ an advisory flag so it's spec-compliant for JSC
to ignore it, and then we'll find out whether users care or not :-)

An FTZ advisory flag would be sort of OK and definitely better than a
mandatory one.

Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub
#148 (comment).

@kg
Copy link
Contributor

kg commented Jul 12, 2015

Consistent complaints from people who work on realtime audio and multimedia software are not 'hearsay' and FTZ is only controversial if you're talking about wanting to leave it out of an environment to simplify things. Mind you, simplification is a noble goal. But please don't miscategorize an important feature for real-world workloads, heavily used in existing production applications, as a 'controversial performance feature that could fail to pan out.'

If it's an advisory flag that people only use if they need it, the only way it would cause us grief is if applications ship with it for measurable real performance gains and then somehow we end up with architectural reasons to regret it later (like because we implemented it wrong). I'm not sure how badly we could mess up a module-wide FTZ flag. If we're concerned, we can punt with an explicit statement that we will 'do it right' post-MVP.

@pizlonator
Copy link
Contributor

On Jul 12, 2015, at 1:15 PM, Katelyn Gadd [email protected] wrote:

Consistent complaints from people who work on realtime audio and multimedia software are not 'hearsay' and FTZ is only controversial if you're talking about wanting to leave it out of an environment to simplify things. Mind you, simplification is a noble goal. But please don't miscategorize an important feature for real-world workloads, heavily used in existing production applications, as a 'controversial performance feature that could fail to pan out.'

If it's an advisory flag that people only use if they need it, the only way it would cause us grief is if applications ship with it for measurable real performance gains and then somehow we end up with architectural reasons to regret it later (like because we implemented it wrong). I'm not sure how badly we could mess up a module-wide FTZ flag. If we're concerned, we can punt with an explicit statement that we will 'do it right' post-MVP.

Exactly. The course that would make me happiest is to do it right post-MVP, and not mention FTZ in the MVP. The downside of adding FTZ to the MVP in the currently proposed forms is:

Downside of a Nondeterministic FTZ flag: it’s nondeterministic, which can lead to divergence between implementations. My own experience with FTZ is that some codes unexpectedly require either the presence of FTZ or the lack of it because it influences how some numeric fixpoint converges.

Downside of a Deterministic FTZ flag: it cannot be polyfilled and we can’t ever kill it. It also raises the bar for how much work is needed to achieve a compliant implementation.

I think I understand your argument in favor of FTZ: it is something that is beneficial to enable per-process in native apps that do audio, and those who do it feel strongly about it. I take it as a given that they feel strongly about it because they know things about this that I don’t. But I also know that it’s not the only way to get good performance in such code - you can chop away the denormals yourself if you really care, and people sometimes do this. This makes me suspect that FTZ may be more of a convenience nice-to-have than a performance showstopper.

Also, arguments about the performance of FTZ in native code aren’t directly transferable to wasm given wasm’s early state, for the following reasons:

  • In wasm we don’t have a notion of enabling FTZ per-process, so we have to resort to something else. That puts us in somewhat uncharted territory. We do not have hard data on the cost of FTZ mode switching across all of our target architectures, and we don’t have hard data on the cost of denormals across all of our target architectures. Hence, whatever we do here now, we would do it in the blind: while we know that whole-process FTZ is profitable, we don’t know if that profitability will hold when you consider the costs of inter-module mode flipping.
  • In wasm there will be other kinds of overheads already. If audio code uses FTZ to get a 1% speed-up, the speed-up in wasm will probably be less than 1%. That’s because wasm will have some as-yet unknown overhead from other things (memory accesses, lack of undef). The higher those base overheads, the less things like FTZ matter. This makes it difficult to make claims about the overhead of FTZ in wasm based on reports of FTZ overhead in native code.
  • We don’t know exactly when critical mass adoption of wasm will happen, and what the dominant CPUs will be once that happens. Undoubtedly some CPU(s) that seem important today will seem less important then, and also, there may be some new CPU(s) with entirely new constraints. This makes it profitable to defer semantic-changing perf features that are motivated by the CPUs of today. We should defer these things to when, as @titzer said, we have a well-optimized wasm implementation and we can run real apps on that implementation. To me that means that we should do it post-MVP.
  • Adding FTZ later is so easy! On the other hand, removing it is impossible. So, I believe that the bar for adding FTZ right now should be: does the lack of FTZ prevent widespread adoption of the MVP? I doubt that this will be the dominant issue influencing whether people try out wasm.

-Filip

@lukewagner
Copy link
Member

I expect FTZ isn't something we're going to see broadly across brenchmarks; it'll have 0 impact on 99.9% of apps and a 2x slowdown on the .1% of apps that happen to run into slow denormal ops on hot paths. But this is exactly the description of a post-MVP feature, so maybe that is the right path. Starting with less nondeterminism, one less thing to implement, and more polyfill fidelity in v.1 is a good consolation prize.

For now or post-MVP, I had one idea for a refinement on how to define FTZ: give modules a list of global options which are ignored if not known to the browser. Make "FTZ" an optional feature (one that could be permanently not-implemented while still being conforming). Engines which want "fastest" could unconditionally set-and-forget "FTZ". Codes that really want mandatory FTZ could feature test and, if FTZ wasn't present, take a different code path that did explicit denormal flushing. I wonder if llvm-wasm could even include a flag that did all this automatically (scoped or globally).

@kg
Copy link
Contributor

kg commented Jul 14, 2015

@jfbastien
Copy link
Member Author

As discussed today: we'll wait for data before coming to a conclusion. Leave bug #148 open, don't change FAQ with #260 just yet. Re-discuss when @titzer and @pizlonator can discuss over higher-throughput medium than github issues.

@MikeHolman
Copy link
Member

So until this morning I thought this conversation was about some mode for allowing FTZ, with subnormals by default. And I think we can get data and implement something good, so I wasn't too worried about this.

But having FTZ by default would incur a nontrivial penalty to every call across the FFI.

If you are very adamant about FTZ, then maybe we should move up the mode switching to an MVP issue, which could allow us to skirt the whole debate about defaults.

@sunfishcode
Copy link
Member

I am currently proposing we fix this with #271.

@sunfishcode
Copy link
Member

#271 is now merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests