-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMD + Data races #209
Comments
|
Yeah, I used "volatile" for lack of a better word. Atomic would be incorrect because the results are not and cannot be atomic.
Ah yes, that's a problem. However, speculative reads are allowed in LLVM, they just return undef, so if they implement the "freeze" operation, you could combine them (do a speculative read and then freeze the result) to non-atomically read from a racy memory location without introducing UB. |
You're not the first to want a "freeze" operation ;3 |
In theory, LLVM's atomic memcpy can express the desired semantics, and a memcpy of constant size equal to a SIMD register width could then be optimized to a SIMD load. But LLVM doesn't know how to do that optimization. Volatile loads will work in practice, and I claim (to some dispute) that it is correct to rely on them working; see also this thread about the interaction between atomics and volatile. But they have stronger semantics than necessary – e.g. they can't be elided if unused. |
Specifically, part of this dispute is that LLVM does not actually say that they are guaranteed to work that way, and LLVM devs were not very sympathetic to adding such a guarantee to their language. We cannot reasonably rely on things not guaranteed to us by LLVM. But yes it seems like what you want is a relaxed SIMD load here, one that's permitted to tear but not subject to the UB rules. That's the "right semantics" here, I think we all agree? |
Which SIMD reads are you talking about? IIRC, on stable Rust, the aligned load / store |
Sounds right to me.
I am talking about those intrinsics, but to be clear: there's no actual bug here, I'm just trying to work out what I can rely on. What you're saying sounds plausible, but it's super unclear to me that this is actually guaranteed, and we may need to relax the wording around what constitutes a data race, or else define these intrinsics as being "synchronised". |
No, that is certainly not the case. These all count as non-atomic accesses unless explicitly stated otherwise. The only time you may use x86-level reasoning for anything is
I see no bug here. Only and exclusively atomic operations synchronize with anything. These operations are not atomic, so the compiler can reorder, duplicate and eliminate them however it wants.
So are they volatile? Because that is the only mechanism I know to actually get HW-level semantics into the surface language in some sense. For these things, I think the x86 instruction is mostly meant as demonstration to avoid having to spell out SIMD in the abstract machine -- but from what I learned, the entire purpose of using these intrinsics is to make the compiler optimize them, which immediately implies they are not volatile and thus do not guarantee to map to anything. |
@gnzlbg That's just plain wrong. Many std::arch intrinsics map to plain old LLVM IR constructs which have all the usual rules, including loads and stores for intrinsics such as It is naively appealing to think of intrinsics as "mapping to some hardware instruction" but in reality intrinsics are and should be subject to compiler optimizations and have behavior that balances optimization power with allowing programmers to achieve good codegen. They provide access to some hardware capabilities but do not automatically import all aspects of the underlying hardware into the language semantics. That is often not even reasonably possible (eg, recall the now deprecated intrinsics for reading and writing EFLAGS). Guaranteeing specific sequences of binary code is the domain of online assembly, not of intrinsics. |
The Intel spec for
which performs a single atomic read of 16 bytes from memory. So AFAICT,
The spec does not mention "volatile" anywhere nor if these intrinsics are ordered with respect to other intrinsics or not.
Sure, AFAICT the implementation of |
If this is a correct reading of Intel's documentation, then not even ICC implements this behavior. A strict "intrinsic used => instruction occurs in binary" reading is plainly false, and even if you retreat to some kind of behavioral as-if rule (difficult to define but let's assume you manage) you'll find discrepancies such as The simplest explanation for this and many other observations is that you're interpreting intrinsics in an overly literal way. "Generate this specific instruction and graft whatever it does in hardware onto the program semantics" is almost never a reasonable approach for an optimizing compiler. It's always some middle ground between preserving ability to reason about the program and being able to generate code that makes use of the hardware functional units that the programmer wants to program. This is not some radical new idea but, as I've said and illustrated before, how intrinsics work even in the C compilers that set all the precedents we're trying to emulate in Rust. |
Would it also be unsound for LLVM to reorder two I think you are just experiencing a not-very-carefully written spec here. The spec works when looking at compiler backends, but by applying it to the surface language you are basically pretending the Rust memory model is the same as that of the hardware, when that is clearly not the case. |
Those are good questions, the spec doesn't say.
For I see your point that probably the spec only provides
FWIW for the intrinsics we do not guarantee that the same instruction sequence that intel prescribes is generated (even though many users open issues related to that), but we do guarantee that whatever we generate is semantically equivalent to whatever the Intel documentation specifies (EDIT: whatever that might be). |
The load is likely guaranteed to not tear but that does not mean it is atomic in the sense of the C11 memory model.
Crucially, this is using equivalence on the Rust Abstract Machine, not the x86 spec. And for the Rust Abstract Machine, the fact that these instructions are not "atomic" in the sense of the concurrency memory model makes a huge difference. |
FWIW by atomic I mean in the LLVM memory model sense, where the load would be atomic with at least an unordered ordering. |
I see no reason why it should be treated like that. And I see a lot of reasons it should not be treated like that, as it will inhibit optimizations. |
So IIUC what you are saying is that we do not / cannot guarantee that |
I said nothing like that. Tearing is not even a concept in the Rust Abstract Machine -- there is no way, on that level, to observe if an access teared or not. The way I see it, these intrinsics are hints to the compiler to please use certain efficient HW instructions and registers to achieve the specified effect. But the only guaranteed part is that, e.g., this thing will load 128 bytes (or whatever it is) of memory into the return value somehow. If the compiler thinks it can do that more efficiently than by emitting |
|
These intrinsics must be more than hints because as you mention we do guarantee some semantics for them (e.g. loading 16 bytes from memory), and right now, we have RFC'ed that if we expose the intrinsic we guarantee whatever the vendor spec guarantees for them. For the cases in which the intrinsic does not currently work or cannot work according to the vendor spec (e.g. the cases @rkruppe mentioned above: EFLAGS, MXCRS, etc.), we just don't provide the intrinsic. Unsafe code is allowed to and often does rely on the guaranteed semantics of the intrinsics for correctness, e.g., there is quite a bit of stable Rust crypto code out there relying on the AES intrinsics actually using the CPU crypto instructions - if those were "just hints" those intrinsics would be useless and all that code would become insecure. (EDIT: another example is how the ARM intrinsics are being used as a replacement for inline assembly, where the intrinsic semantics are "do a volatile write of value X to register Y").
For the semantics that we do guarantee, currently, we consider it correct for the compiler to emit any instruction sequence that preserves those semantics and that it is at least as efficient as the instruction sequence "recommended" (or specified, depending on how you read the spec) by the vendor. That allows const propagation, but it wouldn't allow, e.g., breaking the semantics of the program by emitting a store instead of a load, or by replacing an atomic load by two smaller loads iff the vendor specifies that the load must be atomic, etc.
Sure, iff the vendor intrinsic does not synchronize in any way, then that would be a valid program transformation. |
@gnzlbg IMO, there is a certain threshold in desire for machine control where intrinsics shouldn't be used, and (possibly inline) assembly should be used instead. But I'm not skilled enough at language lawyering to tell where exactly this point lies. To give some context, in my professional community (scientific computation), intrinsics are typically used to perform vectorization by hand when the compiler's optimizer is not smart enough to do it on its own. In this context, definining intrinsics as mere syntactic sugar for inline assembly which provides little more than automatic register allocation would be unhelpful, because it would have the effect of killing almost all other compiler optimizations, forcing even more hardware-specific micro-optimization work to be done by hand. For example, if compilers cannot be trusted to optimize a SIMD store to a memory location immediately followed by a SIMD load from the same location into nothingness, then some SIMD abstractions like linear algebra libraries cannot be made as performant as reasonably well-written custom assembly at all, at least not without resorting to horrible code generation hacks like C++'s expression templates. In this respect, I'm more sympathetic to the view expressed by @RalfJung here, although I will readily admit that many current compilers treat intrinsics in a very cautious fashion that is more aligned with the view that you are expressing here. Of course, if the vendor intrinsic spec is very narrow-minded and forces very specific hardware semantics upon the compiler, that's a different problem, and then compliant compilers to behave as you describe... |
@HadrienG2 It all boils down to what the particular intrinsic guarantees to its users. For example, the embedded WG got team blessing for adding intrinsics to There are dozens of such intrinsics exposed in |
Speaking personally, I see the existence of those intrinsics as a hint that the need for inline assembly is becoming more pressing in this particular community... But I'm well aware that finite person-power and design tradeoffs are a thing 😉 |
I think there's a need for both "do this exactly" and also "just do a semantic effect, I don't care how". And there's also a need to be mor clear about which is which. |
@Lokathor Agreed ! I think inline assembly (once stabilized) should become the tool of choice for "do this exactly", freeing up most intrinsics to just "do something that has similar semantics". |
(Who needs inline assembly when you can just link an extern object file into the build? ;3 ) |
(It allows extra optimizations for code around the inline assembly snippet, e.g. the compiler may not need to spill all registers to memory and reload them after the fact. Also, there are times where you don't want to have a CALL instruction around, such as when aiming for maximal performance or doing very low level stuff that messes with the active thread's stack) |
Why, because of timing attacks? What if the compiler (for some reason) emitted a code sequence that did not use the CPU crypto instructions but was guaranteed to be constant-time? |
That's the only thing I'm asking about: What are the semantics of the operation you are proposing?
I'm not in agreement with that but resolving that issue is orthogonal to solving @Diggsey problem. |
I do not see how it is orthogonal -- from what you said here, you seem to suggest that the existing intrinsics already solve their problem.
Basically the same as atomic That said, we cold also reasonably expect LLVM to use SIMD for that by itself and report this as a bug to LLVM. |
I'm not sure if this is what you meant, but if you use assembly to introduce a data-race, that's still a data-race.
What do you want to achieve? For example, today, you can write #[repr(align(16))]
pub struct Atomics(AtomicU64, AtomicU64);
pub extern "C" fn foo(x: &Atomics) -> __m128 {
unsafe { transmute(
[x.0.load(Relaxed), x.1.load(Relaxed)]
)}
} and LLVM not emitting a single If instead what your algorithm needs is to truly perform a single 128-bit load into a SIMD register, as @comex mentioned you can use
If your program has a data-race, volatile doesn't fix it. |
I think the point is that data races have reasonably well-defined semantics at the hardware layer, even though they are undefined at the Rust Abstract Machine layer. So a data race between two snippets of assembly code (inline or otherwise) should probably be allowed. (Which, if we take it just a bit further, also means that data races involving only volatile reads and writes should probably be allowed as well, given that volatile means "just defer to hardware semantics". But then volatile would have hardware-defined data race semantics, which is not terribly helpful in the context of a portable high-level program.) |
I could imagine a way to specify inline assembly and/or volatile that makes this work, but I have no idea if LLVM implements that. In particular for volatile, I am fairly sure that LLVM is free to reorder non-volatile operations wrt. volatile operations; that demonstrates that data races are not allowed. |
This is taking us dangerously close to #152 territory, so to avoid derailing the present thread further I will reply there. |
Unknown code is not allowed to introduce errors in the Rust abstract machine. For example, if you pass a Now, if Rust passes a So I think that a data race between unknown code is UB and we should not allow that. Now, "calling" into an inline assembly block isn't just like calling into an FFI function, because for the inline assembly block we currently can specify things like "clobbers all memory" (is that a data-race?), is "volatile", etc. that we can't specify for FFI functions (yet - one can always open an inline assembly block that does an FFI call..). If among other things no loads and stores of any kind can be re-order around certain calls into unknown code, those calls "synchronize" in certain ways. We could, e.g., have a clobber/constraint or function attribute that says that certain unknown code performs an atomic read/write of a particular variable with a particular ordering. AFAIK such a thing doesn't exist in LLVM but at the language level we could do that. (*) There are trickier cases, like say you pass a |
Data races are defined in terms of the memory model: in this case, the Rust memory model. When writing assembly that memory model is irrelevant because you're writing code for actual hardware, not an abstract machine.
I was originally interested in whether the approach used by hashbrown to efficiently scan hashmap buckets could be used to implement a concurrent hashmap (coincidentally, the exact use-case was one where elements would never be removed from the hashmap, simplifying things considerably). However, I think this question is relevant even without such a concrete example: I mean it's clear from the amount of disagreement here that this area could do with some clarification. Everybody seems to have different expectations.
I feel like this is rehashing the discussion with @comex and @Lokathor
This doesn't make much sense. How is it an error in the rust abstract machine, when it's not running on the rust abstract machine - it could be a language with an entirely different memory model? If it's unknown code, then the optimizer is not allowed to assume anything about it. Anyway, to get back on topic, it sounds like either the atomic memcpy or the atomic load/transmute option might be able to do the right thing assuming LLVM gets the requisite optimisations... It would also need to figure out that the "destination" of the memcpy can be promoted to a register, and doesn't actually need to be written back to memory. However, it's rather unsatisfying to hope that the optimiser is doing the right thing (hence why the SIMD instructions exist at all!) and I could definitely see cases where you might want to have eg. an "Acquire" memory ordering on the operation. That would be very problematic because I don't see how LLVM could ever optimise a sequence of "acquire" atomic loads into a SIMD load because the atomic loads would have slightly stronger guarantees. |
If you plan to call this assembly code from Rust, then the Rust memory model dictates how the Rust code around the assembly is optimized, and a data-race will cause a mis-optimization.
The optimizer does assume that unknown code won't introduce undefined behavior in Rust.
If the SIMD load doesn't need to be atomic, then you have many options:
none of which guarantees that the exact code you want will be generated. If the SIMD load doesn't need to be atomic but you want guarantees about what specific instruction is generated, that's what If the SIMD load needs to be atomic, then there are no options, because the hardware doesn't support atomic SIMD loads, so there is no machine code that the compiler could reasonably generate for that operation. |
Well on x86 every load is acquire, so SIMD acquire loads are a thing. |
It's really unfortunate that we use the word "atomic" for this because operations like a SIMD load are not atomic (in the regular way the word is used), but they could be equivalent to a "composite atomic operation". It's even more unfortunate because memory ordering can apply to the compound operation as a whole, but not to the atomic parts. I feel like it should be possible to express this to LLVM, eg. have "atomic memcpy" take a memory ordering that applies to the operation as a whole, but where the "atoms" are unordered relative to each other. |
@RalfJung see this comment. While that happens to be how we observe that many x86 CPUs implement this, nothing guarantees that this is the case. Also, loads from unaligned memory addresses tear on x86, so not every load behave likes that =/
ARM doesn't have hardware support for 256-bit SIMD so a generic (non-platform-specific) intrinsic for implementing such loads cannot guarantee an atomic access in general.
I think this is one of the most promising alternatives discussed. It looks like a generally useful intrinsic to have that's not just tied to SIMD, and if LLVM doesn't emit optimal code for it, it looks like it would be simpler to fix that for that intrinsic than to, e.g., try to implement optimizations that coalesce multiple atomic accesses into a single SIMD load or similar. |
Well then take the largest bitwidth they have, same question.
I can't make sense of a "compound atomic acquire" operation that has an "ordering as a whole". What is that supposed to be? Acquire is defined via the reads-from relation, but if this SIMD access does two reads there's two reads-from edges; should it synchronize via both... or what? |
For x86 your question was answered with a "No, SIMD acquire loads are not a thing". So if there is some architecture for which such loads are a thing, that would be an architecture dependent feature anyways. For ARM in particular, I have no idea of what precise semantics does the ISA guarantee for 128-bit or 64-bit SIMD loads (cc @parched). These are the docs I currently use, and I don't think they do say.
I expected that to refer to what LLVM says :
|
I can try to explain with an example (imagine we have two adjacent memory locations, A, B initialized to zero). Thread 1:
Thread 2:
Now if thread 2 observes that B=1, then it must also observe that A=1 This means that we cannot optimize thread 2 to this:
Because SIMD does not provide that guarantee.
There should be an edge from the "last effective read", but we don't specify what order (within a single thread) those reads happen in. |
@Diggsey I think a way to formalize the compound (Not sure if that definition can be made to work with |
@HadrienG2 that's a better way of explaining it, although I think I had in mind that it would synchronise if it observes any of the writes from the compound store, as that is closer to what the hardware guarantees. |
@Diggsey I would be wary of claiming that this is guaranteed to always work on hardware architectures with a weak memory model. In my opinion, a definition based on "all the writes" is safer in this respect (since "all writes" is a superset of "any write"), and it doesn't actually forbid much on the developer's and optimizer's side. |
I did not intend to claim that, I was talking about the specific case of SIMD on x86
I think that doesn't quite work because the writer may not be using SIMD. The writer may write a single atomic u32, and I would want a SIMD-based reader to acquire the changes from that writer if it sees the u32 that was written as part of the SIMD value. (Assuming everything is aligned correctly) In my example of the concurrent hash map, if I observe any of the hashes to match the hash I'm looking for, and then I try to read the corresponding value, I would want to guarantee that the value had already been written. (Assuming the writer writes the value first, and then stores the hash) |
(Continuation of my previous post which raced with your comment being published) Even if it could be made to work, a definition based on "any write" could preclude useful optimizations on a hardware with a weak memory model. Consider a hypothetical but suspiciously ARM-like 32-bit weak memory processor which is being asked to perform a 256-bit atomic memcpy with
With an "observing any write is enough to synchronize" definition, however, this compilation is unsound, and all individual operations must have acquire or release ordering, which can be much more costly in terms of memory barriers when doing large atomic memcpys. (Written after your post)
Oooh, that's nasty. IIRC, the situation on this front is that LLVM tried to specify either atomics or volatile with a writer and reader that use operations of different width, but C++11 doesn't, and it seems we aren't quite ready to accept LLVM-ish major extensions to the C++11 memory model in Rust at this point in time. That being said, your use case could be made to work under an "all writes must have been observed" defintion, if we specified atomics of heterogeneous width to work such that if all u32 writes are What would not be allowed, but is not relevant to your use case as far as I understand, is the reverse scenario of an |
Fair.
That's just element-wise atomic, it doesn't need anything new to be explained. Just a loop.
That sounds like a new beast in the world of weak memory concurrency in. Not sure if it is implementable in hardware, but it's not something LLVM exposes to my knowledge, so there are some way more fundamental questions to be answered here first. Also, everything I said here applies. I'd rather if we restricted ourselves to what actually seems feasible instead of designing some novel SIMD extensions for LLVM.^^
Ah, now that sounds somewhat familiar. There are some models for mixed-size atomics. But AFAIK (a) all of the are operational, no axiomatic models exist (which is good IMO, as mentioned elsewhere I strongly prefer operational models), but (b) they are all hardware-level models, not language-level models. That means they can be defined in terms of syntactic data dependencies and also do not have to think about some of the crazy weak behaviors that only optimizations can introduce. You can find some of these papers here. |
If you want an acquire SIMD load, how about just doing a relaxed/unordered SIMD load followed by an acquire fence ( On x86, both acquire and release fences expand to literally nothing (they're just a compiler fence). To rephrase that from another perspective, SIMD load instructions are at least as strong as a series of relaxed loads (in unspecified order) followed by an acquire fence, and SIMD store instructions are at least as strong as a release fence followed by a series of relaxed stores. Well, either that, or those instructions are unsound to use in an atomic context at all, but I don't think that's true. |
The problem is that on x86 (and probably on other archs as well), SIMD loads and stores are architecturally not guaranteed to be atomic in the sense of being indivisible, and are provably not atomic on some CPU models. They are best modeled an unordered stream of atomic load or store operations, LLVM atomic memcpy style. |
Indeed. I'm just saying that that they do have ordering "as a whole" (with respect to other instructions), which can be modeled by adding a fence before/after that stream of atomic store/load operations. |
Triage:
So there's not really anything left to be tracked here; new issues should be opened for the potentially missing features. |
The Rustonomicon defines a data race like so:
AIUI, all data races are automatically UB.
However, there is useful behaviour whose obvious implementation is forbidden in Rust with this definition. One example of this is with an atomic write, but an unsynchronised read:
(assuming thread 2 doesn't care whether it observes the changes from thread 1, or even if it observes partial changes)
For most cases, you can work around this by just doing a relaxed atomic read instead. The data race goes away and you just have to worry about normal race conditions. As a bonus, you avoid seeing partial results.
However, if you want to do a SIMD or other "wide" read then it becomes impossible in Rust. You would have to use assembly to avoid introducing UB.
My question is: if one wanted to want to support this in Rust, what would be a valid way to do it? Is this even OK to do at the assembly level? We can't introduce atomic versions of SIMD types, because SIMD is inherently not atomic. Would there need to be "volatile" versions of all SIMD operations whose only difference is that they don't introduce UB?
The text was updated successfully, but these errors were encountered: