-
Notifications
You must be signed in to change notification settings - Fork 692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relaxed SIMD #1401
Comments
Thanks @ngzhian @Maratyszcza. This is super-useful. It will be helpful if instructions as part of this extension are not limited to this top 5. We had evaluated Float min/max, float-int conversions, etc to be able to offer significant performance gains depending on platforms. Having these as new instruction variants than a relaxed-simd-mode extension (mentioned above) will help to minimize potential non-determinism introduced by engines. |
Yes, that list is not exhaustive, I hope we can come up with more when we eventually have a repo (similar to how we proposed and merged instructions for SIMD).
Good idea, I'll add this snippet to the description, thanks. |
Thanks. Related: #1393 (comment)
Is there a near-term plan to have the CG phase 0/1 poll? |
|
A tricky thing about Relaxed SIMD and related operators is the meaning of "consistent across runs". It's obviously valuable for a "program" to be able to assume that rounding is consistent across a "run", to avoid discontinuities etc., but in wasm in general, it's not always clear what the scope of a "run" is, for example for long-lived suspended and resumed instances, or instance graphs distributed across multiple underlying hosts. I have an idea for how to avoid this, and I'm curious what others think: Introduce a new opaque type, Initially, the only way to obtain an (global $fp (import "host.fp" "default") fpenv)
(func $foo (param f32) (param f32) (param f32) (result f32)
local.get 0
local.get 1
local.get 2
global.get $fp
f32.qfma
) (The names "host.fp" and "default" here are just for illustration; this is something we'd need to figure out.) Instead of saying In typical implementations, A downside is wasm code size; the import and |
I agree with this, but I'm not convinced it's worth guaranteeing this kind of strong internal consistency in the spec. We could leave it up to distributed and migratory engines to choose for themselves whether to provide this guarantee, and engines that are not distributed or migratory would be able to provide it without doing anything extra. The purpose of these instructions is to dramatically improve performance, but engines with strong determinism requirements won't be able to realize any performance improvement from them. In practice, I expect engines with such determinism requirements to not want to implement these instructions to begin with, so introducing extra complexity to make them deterministic would not be worth it. |
FMA seems to be a counterexample to this; it's available on lots of CPUs these days (and that link is out of date; there are many more now), so for example, many server environments today can comfortably depend on having FMA. It has wide applicability, including in domains where both determinism and distributed compute are important (linear algebra over tiled datasets). And it delivers major speedups (the ~30% number mentioned above). |
Are there cases where, if the instruction is not "consistent across runs" (for some definition), it will still be useful, and not confusing? I'm thinking of defining it like:
The determinism is local, and for this particular example there are always only 2 possible cases. But it isn't consistent, in that if you have the distributed execution, one host can choose to do a single rounding and another host can choose to do two roundings. If the application is robust to such differences, then they get the most performance out of this. |
There surely do exist applications that are robust to FMA rounding differently each time it's invoked. But there are also applications where absolute precision isn't important, but avoiding discontinuities is. For example, in many graphics programs, it may not be noticeable if the position or color of a particular object is slightly different from what a fully precise computation might show, but it may be noticeable if there's a bump in a line or a boundary in a gradient. I've seen GPU drivers split FMAs into discrete multiplies and adds in places where they can fit just the multiply or the add into the hardware pipeline in a particular place in the code, and I've seen actual applications be broken as a result. If we want consistency, we should say so in the spec. |
What if we solve the problem in a different direction by instead clarifying what we mean by "run" in the spec? We could add language to the effect of "FMA may have this or that behavior, but an instance may not observe both behaviors and all instances in a store must observe the same behavior." Again, for non-distributed, non-migratory engines, this imposes no additional burden. Distributed/migratory engines would still have to do whatever they would do with the explicit |
The store is abstract, just a way for the spec to talk about entities that can be linked. Can a store span suspend/resume cycles? Can it span networks? The spec just says if you can link exports to imports, it's a store, which gives embedders a lot of flexibility. If we start using the store to hold miscellaneous configuration information, it would give the store an identity, making these questions more complex. The point of |
I see your point about the store being an imperfect abstraction for this use case. If an engine snapshots an instance's memory and reinstantiates the module with that snapshotted memory on a different machine, it seems that you could call that a different store but still break the program if FMA semantics change. I guess I'm not convinced that letting producers express fine-grained intent here is useful enough to be worth this extra complexity, though. I can see that a use case could be constructed, but I also know that the users I work with who really want to use FMA won't need this. I'm of course biased, though, because I only work closely with Web users, for whom this wouldn't make a difference. Do you know of users who are eager for this fine-grained control? If there are none now, perhaps we could either make no guarantees or find a way to specify coarse-grained guarantees about semantic consistency in this proposal. Assuming there are no users now, I would be happy to revisit this and do another version with finer-grained control in the future when there are users ready to take advantage of that. |
One of the interesting properties of wasm is its virtualizability. All interactions with the outside world go through imports and exports, so it's straightforward to have WebAssembly instances with no concept of "the machine I'm on" as a distinct identity they can interact with. With WASI, we're working on an ecosystem where programs don't know about "the filesystem namespace of the machine I'm on" or "the local network configuration of the machine I'm on" (substitute in "the container I'm in" or "the VM I'm in" as needed ;-)). This means "the machine I'm on" could change over time, and "the machine I'm on" could be different from "the machine other instances I'm linked to are on". We can do this, without prearranged configuration, because the relationships between instances are completely described in their imports and exports. I myself and others are building on systems which will take advantage of this property in general. There are cases where we can get by without this property. For example with FMA, if we're ok limiting our scope to just servers, then we can maybe just depend on FMA being available everywhere. However, on one hand, I can't guarantee we'll always limit our scope to just servers. And on the other, the proposal here already has more than just FMA, and more things may be added in the future. I understand this is adding complexity up front, but my concern is that if we don't preserve this property of wasm, it will be difficult to re-introduce. And it doesn't seem like it's that complex for producers to produce or for minimal consumers to ignore. |
How is this addressed for NaNs in the |
The bits of a NaN produced by an |
@sunfishcode, if we did add an You're right that Wasm is generally meant to be deterministic in all the ways that matter and should be easily virtualizable. But this proposal is explicitly meant to be an exception to that rule, which is why it was split out of the SIMD proposal. I would not expect heterogenous runtimes that want to provide determinism to implement this proposal at all. Code that wants to run on runtimes that provide this proposal as well as runtimes that do not provide this proposal should use the feature detection proposal to accomplish that. |
Well, the NaN bits are formalized to be non-deterministic, but in practice they're platform-specific. I agree that it's rare for programs to care about the bits of NaNs beyond what is specified, but I wouldn't be surprised if some programs are implicitly relying on the fact that those bits at least behave the same "across runs". This is not to dismiss the concern @sunfishcode is raising. I just wonder if the scope of the concern is broader than this proposal. If it is, then for virtualizing engines, I wonder if what would make sense is to, when compiling a module, track any platform-specific behaviors it has and then, when instantiating a module, record those specifics of the current platform and make sure to only move instances to platforms with the same specifics. (Or to restrict the code to a platform-independent subset of wasm.) Thoughs? |
@tlively I outlined a use case above which is ok with FMA being nondeterministic, but which needs FMA to be consistent within a run. Discontinuities break real applications, even when full precision and full determinism aren't needed. Is @RossTate Re: NaNs: With NaNs we can at least tell developers "don't do that", whereas the FMA situations discussed here occur in common usage. And in some use cases, we can bypass the NaN problem using wasm engine flags to canonicalize NaNs. In all, the story indeed isn't completely watertight, but it's a fairly small problem in practice. Re: Tracking platform-specific behaviors: Yes, we can avoid breaking an instance using special tracking and maintaining extra state, but it doesn't address the linking issue, where we want to know up front if two instances need to see the same behavior. |
@sunfishcode, sorry, in previous comment I used "nondeterministic" and "deterministic" where I really meant "inconsistent within a run" and "consistent within a run." I updated it, so now it hopefully makes more sense. It's not that |
As a thought exercise, if we were to go the route of including the That said, what would it mean for Approximate reciprocal/reciprocal sqrt instructions? Unless I'm missing something, these will give slightly different results on different architectures and it doesn't seem preventable with |
@dtig We might think of |
I'd like to point out that we can't fully encapsulate non-determinism inside |
|
If a run involves moving execution to a different machine, approximation reciprocal instructions would produce different results if they lower to |
I’m still not sure |
@Maratyszcza It isn't about guaranteeing that we can always migrate, but about making application requirements explicit. @RossTate Any ideas you have for other ways to solve the problem are welcome. |
@sunfishcode at the extreme, we could document something equivalent to |
I think using an import also suffers from the same problem as trying to specify consistency in terms of the store; the consistency of the identity of the imported value depends on the same underlying notion of the lifetime of "a run." In the worst case, a spec-compliant engine could be created that had a different value for the import every time a host function calls into the instance. On each host-to-wasm call, this engine would actually reinstantiate the module with a different import value and replay the execution trace on the new instance before making the call. Obviously this would be an outlandish thing to do, but it demonstrates that using an import on its own does not provide any strong formal guarantees about internal consistency from the program's point of view. The best we can do is to tie the internal consistency guarantee to the lifetime of something that already exists in the spec, which would probably be an instance. Then it would be up to the engines to clarify what they consider to be the lifetime of an instance. The outlandish engine above would document that instances only live as long as it takes the module to return from a host-to-wasm call, and more realistic engines would document that instance identity does not change when an instance is migrated to a different machine. If we want programs to be able to opt in or out of internal consistency, it would be simplest to provide two versions of each instruction (or a discriminating immediate) to allow producers to express that directly. No need for an import or anything else. (We should still discuss later whether that should be a goal of this proposal, though.) |
It’s an interesting problem. I think I know various semantic techniques to formalize possible semantics, so I’m more interested in the pragmatics. For that, it would be helpful to understand the scenario in more depth. @sunfishcode, could you give a more detailed example of both something you want to allow and something you want to prevent, and then could you explain how you see type imports as helping distinguish the two cases? That’d give me a lot more to work with and offer suggestions for (next week, when I have access to a computer again). |
@tlively Instantiation is an observable event, from the perspective of a program. In general, hosts can't arbitrarily tear down and re-instantiate without consequences, including potentially losing data. I'm ok if programs can't rely on consistent floating-point nondeterminism across their instances being town down and re-instantiated. @RossTate Here's a summary: Some applications are ok if certain floating-point operators are nondeterministic, but they may rely on the operators being deterministic "at runtime", so that they don't see sudden discontinuities. We want a way to say that all the instances that make up a "run of a program" see the same behavior as each other. Imports are a way to express this: we can have one instance import an Thinking this through more, I have now also thought of a way to do this without first-class values. The idea of "intrinsic functions" for wasm has come up several times, but it's never been clear what the difference between an "instruction" and an "intrinsic function" is. Perhaps nondeterminism is now such a difference. If we made |
Thanks for the info, @sunfishcode! If one were to try to enforce this statically, I don't think imports would be enough. Consider that a module can import multiple For what you're describing, I think you'd need an extension like an effect system. The effect of platform-specific instructions would indicate the specific Does that make sense? |
@RossTate I left out a little too much detail in my summary :-}. We don't need mechanisms that fail type checking if the program uses multiple What we need is just a way for one piece of code to ask to be consistent with another piece of code. Whether we import globals or functions, imports give us that: the way a piece of code asks to be consistent with another piece of code is to use the same imported value. Imports can also be re-exported to other modules to provide the same value across module boundaries. |
@sunfishcode, how would you deal with this point that @Maratyszcza raised?
|
@tlively It isn't about guaranteeing that we can always migrate, but about giving code a way to state its needs. If module A imports, uses, and re-exports an |
@sunfishcode I'm worried that you are expecting imports/exports to be able to express more than they can (or at least than they can express easily). Could you give an example of a multi-instance program that should validate (because of consistent use of |
@RossTate, I don't think that using the "wrong" fpenv is supposed to be a validation error, but it might allow the program to observe inconsistent semantics that it was not expecting or otherwise it might over-constrain a distributed host. |
@RossTate There is no example of a program that should not validate :-). This isn't about catching programs using accidentally inconsistent Imagine assigning every possible floating-point reciprocal algorithm a unique integer value. With that, So there's no magic here and no invalidation. It works like plain values. |
Oh, okay. Sorry, you had said you needed to know required compatibilities "up front", which threw me off. What you're describing then seems to need to be able affect compilation (unless you want these to actually compile to function calls), so would |
Reciprocal algorithms implemented in hardware have very little limitations imposed on them (only maximum relative error on x86). They are lookup tables hard-coded in processor implementation. There is no API to get the parameters of these tables, so the only way to emulate them is to dump 16GB representing 4-byte output for each 4-byte input. |
@Maratyszcza Yes, that's correct. Fortunately though, I'm not looking to emulate anything here. I'm primarily looking to give applications a way to state their needs so that I can avoid doing things that would break them. @RossTate In most implementations, the idea is that there is only one possible If wasm added a pre-import mechanism, we could switch to using it for |
Relaxed SIMD moved to phase 1 at the CG meeting earlier today, we have our own repository at https://github.com/WebAssembly/relaxed-simd, please file issues and continue discussions there. I filed WebAssembly/relaxed-simd#1 to capture what I think is the main discussion here, around "consistency" and the various suggestions on how to specify that. Thank you everyone for all the discussion, and I look forward to more! |
Relaxed SIMD adds a set of useful instructions that introduce local non-determinism where the results of the instructions may vary based on hardware support.
The SIMD proposal focuses on getting a set of SIMD instructions that will speed up real world use cases while staying true to the deterministic core of the language. However, there are instructions that can unlock even more performance, but due to the architecture-dependent semantics, were not included. These instructions include:
These instructions have been suggested in multiple places: FMA (1, 2, 3), approximate reciprocal/reciprocal sqrt (1). Such instructions have also been mentioned as part of future features.
There is a soft dependency on feature-detection proposal, which will allow code to determine if certain instructions are supported by the hardware and these instructions can be safely relied on.
Non-determinism: The non-determinism in this proposal is limited to the result of an individual instruction and and is consistent across runs. There are no global control or flags involved. This means that given the following pseudocode:
w
can have different values depending on available hardware support. Multiple usages of the instruction will return the same resultw
, so the instruction is internally consistent.Initial prototypes indicate performance improvement of ~30% on modern CPU architectures. The alternative, which is to provide a deterministic FMA result using emulation, will be too slow to be of any use.
Potential extension: introduce a relaxed mode for existing SIMD instructions. Such a mode would be tied to the feature-detection proposal, where if relaxed-mode is supported, the existing SIMD instructions will be have non-deterministic behavior, e.g. NaN canonicalization, FP IEEE compliance modes used by developers (e.g. no-honor-inf, no-signed-zeros, no-trapping-math..) .
Keywords (for SEO): Fast SIMD
Co-champions: Marat Dukhan (@Maratyszcza) and Zhi An Ng (@ngzhian)
The text was updated successfully, but these errors were encountered: