Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD + Data races #209

Closed
Diggsey opened this issue Oct 1, 2019 · 67 comments
Closed

SIMD + Data races #209

Diggsey opened this issue Oct 1, 2019 · 67 comments
Labels
A-memory Topic: Related to memory accesses C-open-question Category: An open question that we should revisit

Comments

@Diggsey
Copy link

Diggsey commented Oct 1, 2019

The Rustonomicon defines a data race like so:

  • two or more threads concurrently accessing a location of memory
  • one of them is a write
  • one of them is unsynchronized

AIUI, all data races are automatically UB.

However, there is useful behaviour whose obvious implementation is forbidden in Rust with this definition. One example of this is with an atomic write, but an unsynchronised read:

  • thread 1: atomic write
  • thread 2: non-atomic read

(assuming thread 2 doesn't care whether it observes the changes from thread 1, or even if it observes partial changes)

For most cases, you can work around this by just doing a relaxed atomic read instead. The data race goes away and you just have to worry about normal race conditions. As a bonus, you avoid seeing partial results.

However, if you want to do a SIMD or other "wide" read then it becomes impossible in Rust. You would have to use assembly to avoid introducing UB.

My question is: if one wanted to want to support this in Rust, what would be a valid way to do it? Is this even OK to do at the assembly level? We can't introduce atomic versions of SIMD types, because SIMD is inherently not atomic. Would there need to be "volatile" versions of all SIMD operations whose only difference is that they don't introduce UB?

@Lokathor
Copy link
Contributor

Lokathor commented Oct 1, 2019

  1. Volatile doesn't affect if an operation is thread safe or not. As far as I know rust doesn't currently have atomic_volatile reads, you only get one or the other.
  2. I believe the data race definition is an LLVM level rule that gets propagated up into rust, so there's no way for rust to just set aside the rule.

@Diggsey
Copy link
Author

Diggsey commented Oct 1, 2019

Yeah, I used "volatile" for lack of a better word. Atomic would be incorrect because the results are not and cannot be atomic.

I believe the data race definition is an LLVM level rule that gets propagated up into rust, so there's no way for rust to just set aside the rule.

Ah yes, that's a problem. However, speculative reads are allowed in LLVM, they just return undef, so if they implement the "freeze" operation, you could combine them (do a speculative read and then freeze the result) to non-atomically read from a racy memory location without introducing UB.

@Lokathor
Copy link
Contributor

Lokathor commented Oct 1, 2019

You're not the first to want a "freeze" operation ;3

@comex
Copy link

comex commented Oct 2, 2019

In theory, LLVM's atomic memcpy can express the desired semantics, and a memcpy of constant size equal to a SIMD register width could then be optimized to a SIMD load. But LLVM doesn't know how to do that optimization.

Volatile loads will work in practice, and I claim (to some dispute) that it is correct to rely on them working; see also this thread about the interaction between atomics and volatile. But they have stronger semantics than necessary – e.g. they can't be elided if unused.

@RalfJung RalfJung added C-open-question Category: An open question that we should revisit A-memory Topic: Related to memory accesses labels Oct 9, 2019
@RalfJung
Copy link
Member

RalfJung commented Oct 9, 2019

Volatile loads will work in practice, and I claim (to some dispute) that it is correct to rely on them working

Specifically, part of this dispute is that LLVM does not actually say that they are guaranteed to work that way, and LLVM devs were not very sympathetic to adding such a guarantee to their language. We cannot reasonably rely on things not guaranteed to us by LLVM.

But yes it seems like what you want is a relaxed SIMD load here, one that's permitted to tear but not subject to the UB rules. That's the "right semantics" here, I think we all agree?

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 10, 2019

However, if you want to do a SIMD or other "wide" read then it becomes impossible in Rust. You would have to use assembly to avoid introducing UB.

Which SIMD reads are you talking about? IIRC, on stable Rust, the aligned load / store core::arch::x86_64 intrinsics all synchronize with Release / Acquire and are atomic because they are specified to map to some x86_64 instruction, and the semantics of that instruction hold. If LLVM is not taking their ordering into account, then we might want to open an LLVM bug.

@Diggsey
Copy link
Author

Diggsey commented Oct 10, 2019

But yes it seems like what you want is a relaxed SIMD load here, one that's permitted to tear but not subject to the UB rules. That's the "right semantics" here, I think we all agree?

Sounds right to me.

Which SIMD reads are you talking about? IIRC, on stable Rust, the aligned load / store core::arch::x86_64 intrinsics all synchronize with Release / Acquire and are atomic because they are specified to map to some x86_64 instruction, and the semantics of that instruction hold. If LLVM is not taking their ordering into account, then we might want to open an LLVM bug.

I am talking about those intrinsics, but to be clear: there's no actual bug here, I'm just trying to work out what I can rely on. What you're saying sounds plausible, but it's super unclear to me that this is actually guaranteed, and we may need to relax the wording around what constitutes a data race, or else define these intrinsics as being "synchronised".

@RalfJung
Copy link
Member

RalfJung commented Oct 10, 2019

the aligned load / store core::arch::x86_64 intrinsics all synchronize with Release / Acquire and are atomic because they are specified to map to some x86_64 instruction, and the semantics of that instruction hold.

No, that is certainly not the case. These all count as non-atomic accesses unless explicitly stated otherwise. The only time you may use x86-level reasoning for anything is volatile. I don't think you can even rely on two of these intrinsics synchronizing with each other as that would seriously constrain reordering in the optimizer -- usually the opposite of what you want with SIMD.

If LLVM is not taking their ordering into account, then we might want to open an LLVM bug.

I see no bug here. Only and exclusively atomic operations synchronize with anything. These operations are not atomic, so the compiler can reorder, duplicate and eliminate them however it wants.

they are specified to map to some x86_64 instruction

So are they volatile? Because that is the only mechanism I know to actually get HW-level semantics into the surface language in some sense. For these things, I think the x86 instruction is mostly meant as demonstration to avoid having to spell out SIMD in the abstract machine -- but from what I learned, the entire purpose of using these intrinsics is to make the compiler optimize them, which immediately implies they are not volatile and thus do not guarantee to map to anything.

@hanna-kruppe
Copy link

hanna-kruppe commented Oct 10, 2019

@gnzlbg That's just plain wrong.

Many std::arch intrinsics map to plain old LLVM IR constructs which have all the usual rules, including loads and stores for intrinsics such as _mm_load_ps. This is also consistent with how other compilers (C and C++) handle such intrinsics.

It is naively appealing to think of intrinsics as "mapping to some hardware instruction" but in reality intrinsics are and should be subject to compiler optimizations and have behavior that balances optimization power with allowing programmers to achieve good codegen. They provide access to some hardware capabilities but do not automatically import all aspects of the underlying hardware into the language semantics. That is often not even reasonably possible (eg, recall the now deprecated intrinsics for reading and writing EFLAGS). Guaranteeing specific sequences of binary code is the domain of online assembly, not of intrinsics.

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 10, 2019

These operations are not atomic, so the compiler can reorder, duplicate and eliminate them however it wants.

The Intel spec for _mm_load_ps says that it generates a movaps instruction, which is an atomic instruction (no tearing) and specifies its semantics also as:

dst[127:0] := MEM[mem_addr+127:mem_addr]

which performs a single atomic read of 16 bytes from memory.

So AFAICT, _mm_load_ps it is at least an atomic load with an Unordered ordering, and it would be unsound for LLVM to transform it into a load that tears.

So are they volatile?

The spec does not mention "volatile" anywhere nor if these intrinsics are ordered with respect to other intrinsics or not.

Many std::arch intrinsics map to plain old LLVM IR constructs which have all the usual rules, including loads and stores for intrinsics such as _mm_load_ps.

Sure, AFAICT the implementation of _mm_load_ps is incorrect, because the current implementation technically allows the load to tear, even though currently that does not happen in practice.

@hanna-kruppe
Copy link

hanna-kruppe commented Oct 10, 2019

The Intel spec for _mm_load_ps says that it generates a movaps instruction, which is an atomic instruction (no tearing) and specifies its semantics also as:

dst[127:0] := MEM[mem_addr+127:mem_addr]

which performs a single atomic read of 16 bytes from memory.

So AFAICT, _mm_load_ps it is at least an atomic load with an Unordered ordering, and it would be unsound for LLVM to transform it into a load that tears.

If this is a correct reading of Intel's documentation, then not even ICC implements this behavior. A strict "intrinsic used => instruction occurs in binary" reading is plainly false, and even if you retreat to some kind of behavioral as-if rule (difficult to define but let's assume you manage) you'll find discrepancies such as _mm_load_ps(NULL) loading just one byte from address 0 into al and then calling abort() even though that is certainly not how movaps from address 0 behaves.

The simplest explanation for this and many other observations is that you're interpreting intrinsics in an overly literal way. "Generate this specific instruction and graft whatever it does in hardware onto the program semantics" is almost never a reasonable approach for an optimizing compiler. It's always some middle ground between preserving ability to reason about the program and being able to generate code that makes use of the hardware functional units that the programmer wants to program. This is not some radical new idea but, as I've said and illustrated before, how intrinsics work even in the C compilers that set all the precedents we're trying to emulate in Rust.

@RalfJung
Copy link
Member

RalfJung commented Oct 10, 2019

So AFAICT, _mm_load_ps it is at least an atomic load with an Unordered ordering, and it would be unsound for LLVM to transform it into a load that tears.

Would it also be unsound for LLVM to reorder two _mm_load_ps to different locations? Or to remove a dead load? Or to do any other optimization that would be illegal on volatile? If the answer to any of these questions is "yes", then the semantics of _mm_load_ps is very far from that of movaps. All of these transformations can change the observable behavior of an assembly program (when observed by other assembly programs or IO registers).

I think you are just experiencing a not-very-carefully written spec here. The spec works when looking at compiler backends, but by applying it to the surface language you are basically pretending the Rust memory model is the same as that of the hardware, when that is clearly not the case.

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 10, 2019

Would it also be unsound for LLVM to reorder two _mm_load_ps to different locations? Or to remove a dead load? Or to do any other optimization that would be illegal on volatile?

Those are good questions, the spec doesn't say.

you'll find discrepancies such as _mm_load_ps(NULL) loading just one byte from address 0 into al and then calling abort() even though that is certainly not how movaps from address 0 behaves.

For _mm_load_ps(NULL), the Intel spec also provides an attempt at "operational semantics" for that intrinsic ("reads 16 bytes atomically from address 0") and this also does not match to the codegen that gets produced.

I see your point that probably the spec only provides movaps as an example of one instruction sequence that might get generated, but I doubt that the operational semantics provided in the spec for the intrinsics are just examples as well, and for _mm_load_ps I read those as performing a single atomic load.

nd even if you retreat to some kind of behavioral as-if rule (difficult to define but let's assume you manage)

FWIW for the intrinsics we do not guarantee that the same instruction sequence that intel prescribes is generated (even though many users open issues related to that), but we do guarantee that whatever we generate is semantically equivalent to whatever the Intel documentation specifies (EDIT: whatever that might be).

@RalfJung
Copy link
Member

for _mm_load_ps I read those as performing a single atomic load.

The load is likely guaranteed to not tear but that does not mean it is atomic in the sense of the C11 memory model.

but we do guarantee that whatever we generate is semantically equivalent to whatever the Intel documentation specifies (EDIT: whatever that might be).

Crucially, this is using equivalence on the Rust Abstract Machine, not the x86 spec. And for the Rust Abstract Machine, the fact that these instructions are not "atomic" in the sense of the concurrency memory model makes a huge difference.

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 10, 2019

The load is likely guaranteed to not tear but that does not mean it is atomic in the sense of the C11 memory model.

FWIW by atomic I mean in the LLVM memory model sense, where the load would be atomic with at least an unordered ordering.

@RalfJung
Copy link
Member

I see no reason why it should be treated like that. And I see a lot of reasons it should not be treated like that, as it will inhibit optimizations.

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 10, 2019

I see no reason why it should be treated like that.

So IIUC what you are saying is that we do not / cannot guarantee that _mm_load_ps does not tear ?

@RalfJung
Copy link
Member

I said nothing like that. Tearing is not even a concept in the Rust Abstract Machine -- there is no way, on that level, to observe if an access teared or not.

The way I see it, these intrinsics are hints to the compiler to please use certain efficient HW instructions and registers to achieve the specified effect. But the only guaranteed part is that, e.g., this thing will load 128 bytes (or whatever it is) of memory into the return value somehow. If the compiler thinks it can do that more efficiently than by emitting movaps (such as doing const-prop because it knows the resulting value), it can do that. If the register is spilled and the current thread does not write to the affected memory, the compiler is free to-reload the thing into the register again later (which assumes absence of data races). And so on, just like for "normal" loads.

@comex
Copy link

comex commented Oct 10, 2019

movaps is not guaranteed to be atomic at an ISA level. It is atomic in practice on most x86 CPUs, as a side effect how they implement memory caching, but the link describes a CPU that does allow it to tear. Only cmpxchg16b can perform guaranteed-atomic 128-bit accesses.

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 11, 2019

The way I see it, these intrinsics are hints to the compiler to please use certain efficient HW instructions and registers to achieve the specified effect. But the only guaranteed part is that, e.g., this thing will load 128 bytes (or whatever it is) of memory into the return value somehow.

These intrinsics must be more than hints because as you mention we do guarantee some semantics for them (e.g. loading 16 bytes from memory), and right now, we have RFC'ed that if we expose the intrinsic we guarantee whatever the vendor spec guarantees for them. For the cases in which the intrinsic does not currently work or cannot work according to the vendor spec (e.g. the cases @rkruppe mentioned above: EFLAGS, MXCRS, etc.), we just don't provide the intrinsic. Unsafe code is allowed to and often does rely on the guaranteed semantics of the intrinsics for correctness, e.g., there is quite a bit of stable Rust crypto code out there relying on the AES intrinsics actually using the CPU crypto instructions - if those were "just hints" those intrinsics would be useless and all that code would become insecure. (EDIT: another example is how the ARM intrinsics are being used as a replacement for inline assembly, where the intrinsic semantics are "do a volatile write of value X to register Y").

If the compiler thinks it can do that more efficiently than by emitting movaps (such as doing const-prop because it knows the resulting value), it can do that.

For the semantics that we do guarantee, currently, we consider it correct for the compiler to emit any instruction sequence that preserves those semantics and that it is at least as efficient as the instruction sequence "recommended" (or specified, depending on how you read the spec) by the vendor. That allows const propagation, but it wouldn't allow, e.g., breaking the semantics of the program by emitting a store instead of a load, or by replacing an atomic load by two smaller loads iff the vendor specifies that the load must be atomic, etc.

If the register is spilled and the current thread does not write to the affected memory, the compiler is free to-reload the thing into the register again later (which assumes absence of data races)

Sure, iff the vendor intrinsic does not synchronize in any way, then that would be a valid program transformation.

@HadrienG2
Copy link

HadrienG2 commented Oct 11, 2019

@gnzlbg IMO, there is a certain threshold in desire for machine control where intrinsics shouldn't be used, and (possibly inline) assembly should be used instead. But I'm not skilled enough at language lawyering to tell where exactly this point lies.

To give some context, in my professional community (scientific computation), intrinsics are typically used to perform vectorization by hand when the compiler's optimizer is not smart enough to do it on its own. In this context, definining intrinsics as mere syntactic sugar for inline assembly which provides little more than automatic register allocation would be unhelpful, because it would have the effect of killing almost all other compiler optimizations, forcing even more hardware-specific micro-optimization work to be done by hand.

For example, if compilers cannot be trusted to optimize a SIMD store to a memory location immediately followed by a SIMD load from the same location into nothingness, then some SIMD abstractions like linear algebra libraries cannot be made as performant as reasonably well-written custom assembly at all, at least not without resorting to horrible code generation hacks like C++'s expression templates.

In this respect, I'm more sympathetic to the view expressed by @RalfJung here, although I will readily admit that many current compilers treat intrinsics in a very cautious fashion that is more aligned with the view that you are expressing here.

Of course, if the vendor intrinsic spec is very narrow-minded and forces very specific hardware semantics upon the compiler, that's a different problem, and then compliant compilers to behave as you describe...

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 11, 2019

@HadrienG2 It all boils down to what the particular intrinsic guarantees to its users. For example, the embedded WG got team blessing for adding intrinsics to core::arch whose guaranteed semantics are "emits ASM instruction X operating on registers Y and Z in a volatile way" with the objective of being able to write embedded Rust on stable without having to wait for inline assembly (e.g. instead of writing inline assembly, you just call intrinsics instead... which AFAICT doesn't necessarily work but if you only need one instruction that's fine). Saying that such semantics are a "hint" and that the compiler is allowed to generate a different instruction that operates on different registers makes them all useless.

There are dozens of such intrinsics exposed in core::arch::arm/aarch64 (e.g. see this module: https://github.com/rust-lang/stdarch/tree/master/crates/core_arch/src/acle).

@HadrienG2
Copy link

HadrienG2 commented Oct 11, 2019

Speaking personally, I see the existence of those intrinsics as a hint that the need for inline assembly is becoming more pressing in this particular community... But I'm well aware that finite person-power and design tradeoffs are a thing 😉

@Lokathor
Copy link
Contributor

I think there's a need for both "do this exactly" and also "just do a semantic effect, I don't care how". And there's also a need to be mor clear about which is which.

@HadrienG2
Copy link

@Lokathor Agreed ! I think inline assembly (once stabilized) should become the tool of choice for "do this exactly", freeing up most intrinsics to just "do something that has similar semantics".

@Lokathor
Copy link
Contributor

(Who needs inline assembly when you can just link an extern object file into the build? ;3 )

@HadrienG2
Copy link

HadrienG2 commented Oct 11, 2019

(It allows extra optimizations for code around the inline assembly snippet, e.g. the compiler may not need to spill all registers to memory and reload them after the fact. Also, there are times where you don't want to have a CALL instruction around, such as when aiming for maximal performance or doing very low level stuff that messes with the active thread's stack)

@comex
Copy link

comex commented Oct 11, 2019

Unsafe code is allowed to and often does rely on the guaranteed semantics of the intrinsics for correctness, e.g., there is quite a bit of stable Rust crypto code out there relying on the AES intrinsics actually using the CPU crypto instructions - if those were "just hints" those intrinsics would be useless and all that code would become insecure.

Why, because of timing attacks? What if the compiler (for some reason) emitted a code sequence that did not use the CPU crypto instructions but was guaranteed to be constant-time?

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 12, 2019

That particular comment was proposing a new, not-yet-existing operation to solve the OP's problem.

That's the only thing I'm asking about: What are the semantics of the operation you are proposing?

Aren't we still is the business of trying to clarify the semantics of the currently existing operations, or are we in agreement for that?

I'm not in agreement with that but resolving that issue is orthogonal to solving @Diggsey problem.

@RalfJung
Copy link
Member

I'm not in agreement with that but resolving that issue is orthogonal to solving @Diggsey problem.

I do not see how it is orthogonal -- from what you said here, you seem to suggest that the existing intrinsics already solve their problem.

What are the semantics of the operation you are proposing?

Basically the same as atomic memcpy, but with a hint to the compiler to please use a vector load. And "atomic memcpy" has the semantics of copying the elements in a loop using relaxed accesses. (This is the Rust semantics. In LLVM the accesses are "unordered atomic", which is weaker than "relaxed atomic", but Rust currently does not have "unordered".)

That said, we cold also reasonably expect LLVM to use SIMD for that by itself and report this as a bug to LLVM.

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 12, 2019

@Diggsey

For most cases, you can work around this by just doing a relaxed atomic read instead. The data race goes away and you just have to worry about normal race conditions. As a bonus, you avoid seeing partial results.

However, if you want to do a SIMD or other "wide" read then it becomes impossible in Rust. You would have to use assembly to avoid introducing UB.

I'm not sure if this is what you meant, but if you use assembly to introduce a data-race, that's still a data-race.

My question is: if one wanted to want to support this in Rust, what would be a valid way to do it? Is this even OK to do at the assembly level? We can't introduce atomic versions of SIMD types, because SIMD is inherently not atomic.

What do you want to achieve?

For example, today, you can write

#[repr(align(16))]
pub struct Atomics(AtomicU64, AtomicU64);

pub extern "C" fn foo(x: &Atomics) -> __m128 {
    unsafe { transmute(
        [x.0.load(Relaxed), x.1.load(Relaxed)]
    )}
}

and LLVM not emitting a single movaps for this looks like an LLVM bug (https://llvm.godbolt.org/z/57zkef). This extends to 256-bit and 512-bit vectors using vmovaps instead. IIUC what @comex proposes here might be a more robust way for enabling this optimization (EDIT: which LLVM also doesn't do, yet), but if you only care about relaxed loads, something like the code above should work fine once the bug is fixed.

If instead what your algorithm needs is to truly perform a single 128-bit load into a SIMD register, as @comex mentioned you can use core::arch::x86_64::cmpxchg16b for that, however, there is no way to do the same for 256-bit and 512-bit vector. This isn't a Rust limitation, but a hardware limitation.

Would there need to be "volatile" versions of all SIMD operations whose only difference is that they don't introduce UB?

If your program has a data-race, volatile doesn't fix it.

@HadrienG2
Copy link

HadrienG2 commented Oct 12, 2019

I'm not sure if this is what you meant, but if you use assembly to introduce a data-race, that's still a data-race.

I think the point is that data races have reasonably well-defined semantics at the hardware layer, even though they are undefined at the Rust Abstract Machine layer. So a data race between two snippets of assembly code (inline or otherwise) should probably be allowed.

(Which, if we take it just a bit further, also means that data races involving only volatile reads and writes should probably be allowed as well, given that volatile means "just defer to hardware semantics". But then volatile would have hardware-defined data race semantics, which is not terribly helpful in the context of a portable high-level program.)

@RalfJung
Copy link
Member

So a data race between two snippets of assembly code (inline or otherwise) should probably be allowed.

I could imagine a way to specify inline assembly and/or volatile that makes this work, but I have no idea if LLVM implements that. In particular for volatile, I am fairly sure that LLVM is free to reorder non-volatile operations wrt. volatile operations; that demonstrates that data races are not allowed.

@HadrienG2
Copy link

This is taking us dangerously close to #152 territory, so to avoid derailing the present thread further I will reply there.

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 12, 2019

@HadrienG2

I think the point is that data races have reasonably well-defined semantics at the hardware layer, even though they are undefined at the Rust Abstract Machine layer. So a data race between two snippets of assembly code (inline or otherwise) should probably be allowed.

Unknown code is not allowed to introduce errors in the Rust abstract machine. For example, if you pass a &T to some unknown code, and the code writes through it, that would be an error. (*)

Now, if Rust passes a &T to some Atomic_ to two different threads of execution, each of which uses unknown code to perform "unsynchronized" access, that would also be an error in the Rust abstract machine. If we were to allow that, you could just implement an intrinsic that allows performing unsynchronized accesses on top of unknown code.

So I think that a data race between unknown code is UB and we should not allow that.

Now, "calling" into an inline assembly block isn't just like calling into an FFI function, because for the inline assembly block we currently can specify things like "clobbers all memory" (is that a data-race?), is "volatile", etc. that we can't specify for FFI functions (yet - one can always open an inline assembly block that does an FFI call..). If among other things no loads and stores of any kind can be re-order around certain calls into unknown code, those calls "synchronize" in certain ways. We could, e.g., have a clobber/constraint or function attribute that says that certain unknown code performs an atomic read/write of a particular variable with a particular ordering. AFAIK such a thing doesn't exist in LLVM but at the language level we could do that.

(*) There are trickier cases, like say you pass a &mut bool to unknown code, and that code writes 42 to it, but before returning back to Rust, it writes 0 to it. I think this is also an error in the Rust abstract machine, and if a Rust binary using unknown code were running on top of something valgrind-like, we would be able to tell.

@HadrienG2
Copy link

@gnzlbg Please correct me if I'm wrong, but this sounds like a continuation of the "volatile and data race" topics that I've brought back to #152 , so I'll reply there with a full quote.

@Diggsey
Copy link
Author

Diggsey commented Oct 13, 2019

I'm not sure if this is what you meant, but if you use assembly to introduce a data-race, that's still a data-race.

Data races are defined in terms of the memory model: in this case, the Rust memory model. When writing assembly that memory model is irrelevant because you're writing code for actual hardware, not an abstract machine.

What do you want to achieve?
...

I was originally interested in whether the approach used by hashbrown to efficiently scan hashmap buckets could be used to implement a concurrent hashmap (coincidentally, the exact use-case was one where elements would never be removed from the hashmap, simplifying things considerably).

However, I think this question is relevant even without such a concrete example: I mean it's clear from the amount of disagreement here that this area could do with some clarification. Everybody seems to have different expectations.

If your program has a data-race, volatile doesn't fix it.

I feel like this is rehashing the discussion with @comex and @Lokathor

Now, if Rust passes a &T to some Atomic_ to two different threads of execution, each of which uses unknown code to perform "unsynchronized" access, that would also be an error in the Rust abstract machine. If we were to allow that, you could just implement an intrinsic that allows performing unsynchronized accesses on top of unknown code.

So I think that a data race between unknown code is UB and we should not allow that.

This doesn't make much sense. How is it an error in the rust abstract machine, when it's not running on the rust abstract machine - it could be a language with an entirely different memory model? If it's unknown code, then the optimizer is not allowed to assume anything about it.

Anyway, to get back on topic, it sounds like either the atomic memcpy or the atomic load/transmute option might be able to do the right thing assuming LLVM gets the requisite optimisations... It would also need to figure out that the "destination" of the memcpy can be promoted to a register, and doesn't actually need to be written back to memory.

However, it's rather unsatisfying to hope that the optimiser is doing the right thing (hence why the SIMD instructions exist at all!) and I could definitely see cases where you might want to have eg. an "Acquire" memory ordering on the operation. That would be very problematic because I don't see how LLVM could ever optimise a sequence of "acquire" atomic loads into a SIMD load because the atomic loads would have slightly stronger guarantees.

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 13, 2019

Data races are defined in terms of the memory model: in this case, the Rust memory model. When writing assembly that memory model is irrelevant because you're writing code for actual hardware, not an abstract machine

If you plan to call this assembly code from Rust, then the Rust memory model dictates how the Rust code around the assembly is optimized, and a data-race will cause a mis-optimization.

If it's unknown code, then the optimizer is not allowed to assume anything about it.

The optimizer does assume that unknown code won't introduce undefined behavior in Rust.

However, it's rather unsatisfying to hope that the optimiser is doing the right thing (hence why the SIMD instructions exist at all!) and I could definitely see cases where you might want to have eg. an "Acquire" memory ordering on the operation. That would be very problematic because I don't see how LLVM could ever optimise a sequence of "acquire" atomic loads into a SIMD load because the atomic loads would have slightly stronger guarantees.

If the SIMD load doesn't need to be atomic, then you have many options:

  • atomic memcpy
  • multiple atomic loads/stores
  • compiler-fences + __mm_load_ps

none of which guarantees that the exact code you want will be generated. If the SIMD load doesn't need to be atomic but you want guarantees about what specific instruction is generated, that's what asm! is for, so with appropriate synchronization (compiler-fences or the right constraints) you might be able to use that.

If the SIMD load needs to be atomic, then there are no options, because the hardware doesn't support atomic SIMD loads, so there is no machine code that the compiler could reasonably generate for that operation.

@RalfJung
Copy link
Member

If the SIMD load needs to be atomic, then there are no options, because the hardware doesn't support atomic SIMD loads, so there is no machine code that the compiler could reasonably generate for that operation.

Well on x86 every load is acquire, so SIMD acquire loads are a thing.
But for other platforms, I don't know. I mean, does hardware even specify that e.g. on ARM, a 256bit-wide SIMD load is actually fully cache coherent and cannot observe partial writes?

@Diggsey
Copy link
Author

Diggsey commented Oct 14, 2019

If the SIMD load needs to be atomic, then there are no options, because the hardware doesn't support atomic SIMD loads

It's really unfortunate that we use the word "atomic" for this because operations like a SIMD load are not atomic (in the regular way the word is used), but they could be equivalent to a "composite atomic operation". It's even more unfortunate because memory ordering can apply to the compound operation as a whole, but not to the atomic parts.

I feel like it should be possible to express this to LLVM, eg. have "atomic memcpy" take a memory ordering that applies to the operation as a whole, but where the "atoms" are unordered relative to each other.

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 14, 2019

Well on x86 every load is acquire, so SIMD acquire loads are a thing.

@RalfJung see this comment. While that happens to be how we observe that many x86 CPUs implement this, nothing guarantees that this is the case. Also, loads from unaligned memory addresses tear on x86, so not every load behave likes that =/

But for other platforms, I don't know. I mean, does hardware even specify that e.g. on ARM, a 256bit-wide SIMD load is actually fully cache coherent and cannot observe partial writes?

ARM doesn't have hardware support for 256-bit SIMD so a generic (non-platform-specific) intrinsic for implementing such loads cannot guarantee an atomic access in general.

@Diggsey

I feel like it should be possible to express this to LLVM, eg. have "atomic memcpy" take a memory ordering that applies to the operation as a whole, but where the "atoms" are unordered relative to each other.

I think this is one of the most promising alternatives discussed. It looks like a generally useful intrinsic to have that's not just tied to SIMD, and if LLVM doesn't emit optimal code for it, it looks like it would be simpler to fix that for that intrinsic than to, e.g., try to implement optimizations that coalesce multiple atomic accesses into a single SIMD load or similar.

@RalfJung
Copy link
Member

ARM doesn't have hardware support for 256-bit SIMD so a generic (non-platform-specific) intrinsic for implementing such loads cannot guarantee an atomic access in general.

Well then take the largest bitwidth they have, same question.

I feel like it should be possible to express this to LLVM, eg. have "atomic memcpy" take a memory ordering that applies to the operation as a whole, but where the "atoms" are unordered relative to each other.

I can't make sense of a "compound atomic acquire" operation that has an "ordering as a whole". What is that supposed to be? Acquire is defined via the reads-from relation, but if this SIMD access does two reads there's two reads-from edges; should it synchronize via both... or what?

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 14, 2019

Well then take the largest bitwidth they have, same question.

For x86 your question was answered with a "No, SIMD acquire loads are not a thing". So if there is some architecture for which such loads are a thing, that would be an architecture dependent feature anyways. For ARM in particular, I have no idea of what precise semantics does the ISA guarantee for 128-bit or 64-bit SIMD loads (cc @parched). These are the docs I currently use, and I don't think they do say.

I can't make sense of a "compound atomic acquire" operation that has an "ordering as a whole".

I expected that to refer to what LLVM says :

the copy between buffers uses a sequence of unordered atomic load/store operations that are a positive integer multiple of the element_size in size.

[...] This intrinsic does not provide any additional ordering guarantees over those provided by a set of unordered loads from the source location and stores to the destination.

@Diggsey
Copy link
Author

Diggsey commented Oct 14, 2019

I can't make sense of a "compound atomic acquire" operation that has an "ordering as a whole". What is that supposed to be?

I can try to explain with an example (imagine we have two adjacent memory locations, A, B initialized to zero).

Thread 1:

  • Write 1 to A (Release)
  • Write 1 to B (Release)

Thread 2:

  • Read B (Acquire)
  • Read A (Acquire)

Now if thread 2 observes that B=1, then it must also observe that A=1

This means that we cannot optimize thread 2 to this:

  • SIMD read (A, B)

Because SIMD does not provide that guarantee.

Acquire is defined via the reads-from relation, but if this SIMD access does two reads there's two reads-from edges; should it synchronize via both... or what?

There should be an edge from the "last effective read", but we don't specify what order (within a single thread) those reads happen in.

@HadrienG2
Copy link

HadrienG2 commented Oct 14, 2019

@Diggsey I think a way to formalize the compound Acquire/Release semantics that you have in mind without breaking SIMD support is to say that code doing a compound Acquire load synchronizes with code doing a compound Release store only if it observes all the writes from said compound store.

(Not sure if that definition can be made to work with SeqCst though.)

@Diggsey
Copy link
Author

Diggsey commented Oct 14, 2019

@HadrienG2 that's a better way of explaining it, although I think I had in mind that it would synchronise if it observes any of the writes from the compound store, as that is closer to what the hardware guarantees.

@HadrienG2
Copy link

HadrienG2 commented Oct 14, 2019

@Diggsey I would be wary of claiming that this is guaranteed to always work on hardware architectures with a weak memory model. In my opinion, a definition based on "all the writes" is safer in this respect (since "all writes" is a superset of "any write"), and it doesn't actually forbid much on the developer's and optimizer's side.

@Diggsey
Copy link
Author

Diggsey commented Oct 14, 2019

I did not intend to claim that, I was talking about the specific case of SIMD on x86

In my opinion, a definition based on "all the writes" is safer in this respect, and doesn't actually forbid much on the developer's and optimizer's side.

I think that doesn't quite work because the writer may not be using SIMD. The writer may write a single atomic u32, and I would want a SIMD-based reader to acquire the changes from that writer if it sees the u32 that was written as part of the SIMD value. (Assuming everything is aligned correctly)

In my example of the concurrent hash map, if I observe any of the hashes to match the hash I'm looking for, and then I try to read the corresponding value, I would want to guarantee that the value had already been written. (Assuming the writer writes the value first, and then stores the hash)

@HadrienG2
Copy link

HadrienG2 commented Oct 14, 2019

(Continuation of my previous post which raced with your comment being published)

Even if it could be made to work, a definition based on "any write" could preclude useful optimizations on a hardware with a weak memory model.

Consider a hypothetical but suspiciously ARM-like 32-bit weak memory processor which is being asked to perform a 256-bit atomic memcpy with Acquire/Release synchronization. With an "all the writes must have been observed in order to synchronize" definition, this pseudo-assembly is legal:

// Thread 1
st.relaxed r1 -> [target]
st.relaxed r2 -> [target+4]
st.relaxed r3 -> [target+8]
st.relaxed r4 -> [target+12]
st.relaxed r5 -> [target+16]
st.relaxed r6 -> [target+20]
st.relaxed r7 -> [target+24]
st.release r8 -> [target+28]

// Thread 2
ld.acquire [target+28] -> r8
ld.relaxed [target] -> r1
ld.relaxed [target+4] -> r2
ld.relaxed [target+8] -> r3
ld.relaxed [target+12] -> r4
ld.relaxed [target+16] -> r5
ld.relaxed [target+20] -> r6
ld.relaxed [target+24] -> r7

With an "observing any write is enough to synchronize" definition, however, this compilation is unsound, and all individual operations must have acquire or release ordering, which can be much more costly in terms of memory barriers when doing large atomic memcpys.


(Written after your post)

In my opinion, a definition based on "all the writes" is safer in this respect, and doesn't actually forbid much on the developer's and optimizer's side.

I think that doesn't quite work because the writer may not be using SIMD. The writer may write a single atomic u32, and I would want a SIMD-based reader to acquire the changes from that writer if it sees the u32 that was written as part of the SIMD value. (Assuming everything is aligned correctly)

Oooh, that's nasty. IIRC, the situation on this front is that LLVM tried to specify either atomics or volatile with a writer and reader that use operations of different width, but C++11 doesn't, and it seems we aren't quite ready to accept LLVM-ish major extensions to the C++11 memory model in Rust at this point in time.

That being said, your use case could be made to work under an "all writes must have been observed" defintion, if we specified atomics of heterogeneous width to work such that if all u32 writes are Release and the compound load is Acquire, then the compound load synchronizes with every individual u32 Release store that it observed.

What would not be allowed, but is not relevant to your use case as far as I understand, is the reverse scenario of an u32 Acquire load synchronizing with a partial view of a 256-bit Release store.

@RalfJung
Copy link
Member

RalfJung commented Oct 14, 2019

For x86 your question was answered with a "No, SIMD acquire loads are not a thing".

Fair.

I expected that to refer to what LLVM says :

That's just element-wise atomic, it doesn't need anything new to be explained. Just a loop.

There should be an edge from the "last effective read", but we don't specify what order (within a single thread) those reads happen in.

That sounds like a new beast in the world of weak memory concurrency in. Not sure if it is implementable in hardware, but it's not something LLVM exposes to my knowledge, so there are some way more fundamental questions to be answered here first. Also, everything I said here applies.

I'd rather if we restricted ourselves to what actually seems feasible instead of designing some novel SIMD extensions for LLVM.^^

I think that doesn't quite work because the writer may not be using SIMD. The writer may write a single atomic u32, and I would want a SIMD-based reader to acquire the changes from that writer if it sees the u32 that was written as part of the SIMD value. (Assuming everything is aligned correctly)

Ah, now that sounds somewhat familiar. There are some models for mixed-size atomics. But AFAIK (a) all of the are operational, no axiomatic models exist (which is good IMO, as mentioned elsewhere I strongly prefer operational models), but (b) they are all hardware-level models, not language-level models. That means they can be defined in terms of syntactic data dependencies and also do not have to think about some of the crazy weak behaviors that only optimizations can introduce. You can find some of these papers here.

@comex
Copy link

comex commented Oct 15, 2019

If you want an acquire SIMD load, how about just doing a relaxed/unordered SIMD load followed by an acquire fence (std::sync::atomic::fence)? Similarly, if you want a release store, do a relaxed/unordered store preceded by a release fence.

On x86, both acquire and release fences expand to literally nothing (they're just a compiler fence). To rephrase that from another perspective, SIMD load instructions are at least as strong as a series of relaxed loads (in unspecified order) followed by an acquire fence, and SIMD store instructions are at least as strong as a release fence followed by a series of relaxed stores. Well, either that, or those instructions are unsound to use in an atomic context at all, but I don't think that's true.

@HadrienG2
Copy link

HadrienG2 commented Oct 15, 2019

The problem is that on x86 (and probably on other archs as well), SIMD loads and stores are architecturally not guaranteed to be atomic in the sense of being indivisible, and are provably not atomic on some CPU models. They are best modeled an unordered stream of atomic load or store operations, LLVM atomic memcpy style.

@comex
Copy link

comex commented Oct 15, 2019

Indeed. I'm just saying that that they do have ordering "as a whole" (with respect to other instructions), which can be modeled by adding a fence before/after that stream of atomic store/load operations.

@RalfJung
Copy link
Member

Triage:

  • we don't want SIMD accesses to behave any different from other accesses when it comes to data races
  • if there is a need for SIMD atomic accesses, we should figure out how to expose them (that's part of the larger question of "atomic accesses for more types", and largely t-libs)
  • we can have masked SIMD reads where the unmasked parts are not read as far as the AM is concerned, they just contain poison, so there's no race on the unmasked part -- also mostly a t-libs thing

So there's not really anything left to be tracked here; new issues should be opened for the potentially missing features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-memory Topic: Related to memory accesses C-open-question Category: An open question that we should revisit
Projects
None yet
Development

No branches or pull requests

8 participants