-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-temporal stores (and _mm_stream operations in stdarch) break our memory model #114582
Comments
I should have also Cc @rust-lang/opsem |
I believe there are additional options: C) have the backend lower the non-temporal store to either
whichever seems more beneficial. The sfence can be deferred up to the next atomic release (or stronger) and things that might contain one. D) add a |
It's not clear to me that this "breaks the model" as opposed to "extends the model". Why can't we have nontemporal stores as an explicit part of the atomics model, and say that they are not necessarily synchronized by a release-acquire pair to another thread? |
Is "deferred sfence" a thing we can tell LLVM to do? Also I think if we lower _mm_stream_ps in a way that includes an sfence, people will be very surprised. We should rather not expose _mm_stream_ps than expose something that's not the same operation.
No that doesn't work. We still need to explain how and why code has UB when it violates safety. Our current memory model has no way to make this code be UB. This is an intrinsic, not a library function. It is specified by the operational semantics, not a bunch of axioms in the "safety" comment. The safety comment is merely reflecting the constraints that arise from the operational semantics. Making the safety comment the spec would mean making the spec axiomatic and that is something we certainly don't want to do.
We can, but then we'll have to develop our own model. The C++ model does not have such stores. And process-wise, we certainly can't have a library feature like stdarch just change the memory model. That requires an RFC. It seems fairly clear that the impact of |
Not that I know. But it may be something llvm should be doing when a store is annotated with
I assume long as the sfence is sunk far enough they might not care. Just like mixing AVX and SSE can result in VZEROUPPER being inserted by the backend.
I see. That distinction isn't always obvious. But it makes sense if we want the compiler to optimize around the intrinsics more than around FFI calls.
Can we say we use the C++ model with the modification that an axiom (all stores are ordered with release operations) is turned into a requirement (all stores must be ordered with release operations). That seems minimally invasive. |
Vendor intrinsics are sort of halfway between standard Rust and inline asm. The semantics of the vendor intrinsic is whatever asm the vendor says it is, and it's the responsibility of the user of the intrinsic to understand what that means (or doesn't) on the Rust AM and use the intrinsics in a way consistent on the AM. Ideally, writing Is this complicated by the extra function boundaries imposed by us exposing vendor intrinsics as Slightly reinterpreting: it's perfectly acceptable, even expected for Miri to error when encountering vendor intrinsics. Using them steps outside the AM in a similar manner to inline asm, and as such you're relying on target specifics for coherency and sacrificing the ability to use AM-level sanitizers like Miri except on a best-effort basis. Footnotes
|
FFI calls don't help, FFI calls are only allowed to perform operations that could also be performed by Rust code (insofar as Rust-controlled state is affected). Getting the program into a state where release/acquire synchronization does not work any more is not allowed in any way, not even via FFI. Like, imagine the following code: static mut DATA: usize = 0;
static INIT: AtomicBool = AtomicBool::new(false);
thread::spawn(|| {
while INIT.load(Acquire) == false {}
let data = DATA;
if data != data { unreachable_unchecked(); }
});
some_function(&mut DATA, 42);
INIT.store(true, Release); This code is obviously sound. The compiler is allowed to change this code such that the spawned thread becomes while INIT.load(Acquire) == false {}
if DATA != DATA { unreachable_unchecked(); } However, if Therefore, it is UB to leave an inline assembly block or FFI operation with any "pending non-temporal stores that haven't been guarded by a fence yet". The correctness of compilation relies on this assumption.
No, that's not how defining a memory model works. You'll need to define a new kind of memory access that's even weaker than non-atomic accesses and adjust the entire model to account for them.
That doesn't work, they need to be phrased in terms of the Rust AM. Lucky enough they are mostly about what happens to certain values of SIMD type, so the vendor semantics directly translate to Rust AM semantics. But when it comes to synchronization, this approach falls flat. If
There's no way around that, but I hope we don't have many intrinsics that have global synchronization effects.
Even vendor intrinsics are subject to the rule that governs all FFI and inline assembly: you can only do things to Rust-visible state that you could also have done in Rust. |
I hold that while nontemporal_store as intended is not covered by the memory model, it is a) not a huge extension to add it to C/C++'s (though that's ugly to do for a single special case) b) there's a reasonable way you can model what users actually would use it for inside the existing model c) as other people said, no user cares about the purity of the rust AM when doing this, even if you do. The potential extension to the current C++ intro.races (using C++ mainly because there's nicer html for it) is something like:
Obviously this makes happens before even more nontransitive, which is ugly, and it's totally possible I have the details wrong the same way every C/C++ spec has done, but it's really not implausible to do. Alternatively within the existing rules you can completely ignore hardware and say that they act like this (pretending nontemporal_store is monomorphic on u8): // blah blah boilerplate
#[derive(Clone, Copy)]
struct SendMe<T>(T);
unsafe impl<T> Send for SendMe<T> {}
unsafe impl<T> Sync for SendMe<T> {}
#[thread_local]
static mut BUFFER: Option<SendMe<&'static Mutex<Vec<(*mut u8, u8)>>>> = None;
pub unsafe fn nontemporal_store(ptr: *mut u8, val: u8) {
let buf = if let Some(b) = BUFFER {
b
} else {
let b = SendMe(&*Box::leak(Box::new(Mutex::new(Vec::new()))));
BUFFER = Some(b);
b
};
std::thread::spawn(move || {
std::thread::sleep_ms(rand::random());
let buf = buf;
do_sfence(&buf.0);
});
buf.0.lock().unwrap().push((ptr, val));
}
pub fn sfence() {
unsafe {
if let Some(b) = BUFFER {
do_sfence(&b.0)
}
}
}
fn do_sfence(buffer: &'static Mutex<Vec<(*mut u8, u8)>>) {
let mut buffer = buffer.lock().unwrap();
for (ptr, val) in buffer.drain(..) {
unsafe { *ptr = val; }
}
} ie you do vaguely what the cpu will actually do - buffer all nontemporal stores until (some indeterminate time later, or you do a fencing operation). This does not permit |
While this may be true for intrinsics to be coherently usable from [language], it's not quite true in practice; the Intel documentation is that I think my position can mostly be summed up as that if using the vendor intrinsic is (available and) immediate UB in GCC C, GCC C++, LLVM C, LLVM C++, and/or MSVC C++1 (i.e. due to breaking the atomic memory model), it's "fine" if using the intrinsic is immediate UB in rustc Rust. The behavior of the vendor intrinsic is the same as it is in C: an underspecified and incoherent extension to the memory model that might or might not work (i.e. UB). Though given that presumably the intrinsic should work as intended on Intel's compiler, and I think the Intel oneAPI C++ Compiler is an LLVM, it's presumably handled "good enough" in LLVM for Intel's purposes. I'd be fine with marking vendor intrinsics as "morally questionable" (for lack of a better descriptor) and potentially even deprecating them if the backend can't handle them coherently2, but I wouldn't want us to neuter them to use different operations. It's assumed at a level equivalent to inline asm that if a developer is using vendor intrinsics that they know what they're doing. It's "just" (giant scare quotes) that knowing what you're doing with My justification for not neutering the implementation is roughly that with Especially if this family of vendor intrinsics is UB in LLVM C++, this feels like a question that needs to bubble up to Intel's compiler team on how they think the intrinsic should be handled. Because this is a vendor intrinsic, it should behave however the vendor says it should, but we'd absolutely be justified to put a warning on it if it's fundamentally broken. But we shouldn't change how it behaves without input from the vendor. I see maybe 2½+1 resolutions:
but hard choosing one without vendor input seems improper. I find it interesting that NT would be "easier" for Rust if it weren't same-thread consistent, because then it could probably be modeled as an access from an anonymous thread (i.e. exempted from the sequenced-before relation), as modeled by talchas. But it is, so this observation is mostly irrelevant. I make no attempt to say whether the relaxation of the model is accurate, nor whether it breaks the existing proof body built on the current model4. Footnotes
|
Oh yes, I originally wrote that as consuming the buffer and forgot to shift it to drain. (fixed) |
Since I got somewhat nerd-sniped: A spitball extension to the cppreference memory ordering description.Ultimately I think this came out in similar effect to talchas's two rules, but I worked it out a bit further. I believe the most straightforward way to extend the C++ atomic model to support nontemporal writes would be to split off a concept of weakly sequenced-before for nontemporal writes in the causal chain, like how happens-before is split to handle consume causality. The accuracy of my extension is predicated on my understanding of the happens-before relation qualifiers being accurate.Namely, that inter-thread happens-before is happens-before which includes crossing a thread boundary, simply happens-before excludes consume operations (the dependency-ordered-before relation), and strongly happens-before only matters for establishing the global seq_cst ordering As such I'm not particularly interested in sussing out the exact difference of strongly happens-before and simply happens-before, since as Though I am part of the heathen group which would prefer acq_rel be permitted for non-rmw operations, with the semantics of choosing acq/rel as appropriate based on if it's a load/store. Maybe that lessens the validity of my opinions somewhat, but I legitimately believe that a significant part of the reason that "just use seq_cst" is a thing is not because seq_cst is stronger than acq_rel (I find that the added guarantees actually make the behavior harder to reason about, and it doesn't even make an interleaving based understanding of many thread interactions any less wrong) but rather more because you can just use seq_cst everywhere, instead of needing to choose between acq, rel, or acq_rel based on if the operation is a load, store, or rmw. It would make confusion around (the non-existence of) "release loads" and "acquire stores" worse, but I personally think the benefit of being able to use acq_rel everywhere when consistently using acq_rel syncronization outweighs that, since in that paradigm the use of an aquire or release rmw operation being part-relaxed is more immediately evident. We define Downstream sequenced-before requirements are adjusted as:
(This is the dangerous part. The notes about consume ordering w.r.t. simply happens-before and strongly happens-before w.r.t. consume should now say consume and nontemporal. That nontemporal ends up handled similarly to how consume is (i.e. the split between transitive inter-thread happens-before and non-transitive happens-before) I think makes the problems fairly clear, given the known problems with consume ordering. But if consume is formally coherent in the current definition, even if no compiler actually treats it as weaker than acq_rel, then nontemporal can be coherent with this (or a similar) formalization, even if no compiler will actually treat it as weaker than a standard write.) Why does carries a dependency into exclude nontemporal writes? It only matters with release-consume synchronization, but if a chain of weakly sequenced-before This doesn't cover fences, but neither does the cppreference memory ordering description. If my understanding of fences is accurate — that an atomic fence (enters Phrasing other fences in these terms:Given atomic store (any order)
(In the above, and binds more tightly than NB: since atomic operations (including fences) are not nontemporal writes, the weakly sequenced-before and sequenced-before relations between them are definitionally identical. A primary benefit of adding nontemporal operations in this manner is that it should be trivially true that if no nontemporal operations are involved, nothing changes. This means that the extension cannot make the situation any worse for people relying on the memory order guarantees than it currently is; at worst nontemporal operations go from immediate UB to broken in a different manner. It could still of course muck things up for people in charge of maintaining the memory order (compilers), but I believe talchas's surface language The only terms which are impacted are happens-before, visible (side effect), visible sequence of side-effects, and the newly added weakly sequenced-before. The chance of me ever actually proposing this extension anywhere is effectively 0. But I do, to the full extent legally possible, waive my copyright for the contents of this post/comment, releasing it into the public domain. If you want to do something with this wording, feel free. However, even if the above would potentially work as a definition, what it definitely does not show is whether nontemporal stores are currently nonsense (given compilers' models) and "miscompiled" (given this model) by existing compiler transforms. I think it's probably fine — a code transformation relying on weakly sequenced-before (nontemporal) being sequenced-before would necessarily require inter-thread reasoning, and I don't think any existing compiler transforms do such, since "no sane compiler would optimize atomics" — but that's an extremely strong assertion to make. Separately, and actually somewhat relevant — Rust might still not be able to simply use the C++20 memory model completely as-is if we permit mixed-size atomic operations. Though of course, even if the formal C++ description leaves such as a UB data race, the extension to make them at least not race is fairly trivial, and IIRC I don't think anyone expects them to actually synchronize. Though I do think people would expect mixed-size atomic operations to be coherent w.r.t. modification order, and those rules talk about some "atomic object |
because LLVM currently compiles all but sequentially-consistent fences to no-ops on x86, I think it currently miscompiles non-temporal stores if they're anything remotely like normal stores, because you can have acquire/release fences as much as you please and LLVM will happily compile them to nothing. imho the extra guarantees provided by sequential consistency shouldn't be required to make non-temporal stores behave. |
there's also non-temporal loads which seem to also be just as much fun! |
Non-temporal loads, in practice, are effectively normal loads, because their optimization is so conditional it can rarely trigger according to the letter of the ISA (it requires close reading of the fine print to tease this out). Because it is such a theoretical optimization, my understanding is it is largely unimplemented, and the one vestige is that the |
Yes, you need the instructions that happen to be generated by sequential consistency fences, or explicit calls to arch intrinsics/asm like sfence. If you wanted nontemporal stores to behave like normal stores in the language model, you wouldn't need a fence at all; requiring a release fence and pessimizing its codegen for other cases would be a weird worst of all worlds imo. (Requiring a seqcst fence that is in practice always going to generate an acceptable instruction or requiring an explicit sfence both seem fine to me) And yeah, for loads note the "if the memory source is WC (write combining) memory type" in the instruction description - WC is not the normal memory type, it's the weak memory type and if your rust code built for x86_64-any-target-at-all has access to any memory of that type it's already broken wrt synchronization. (NT stores just treat any memory like it's WC) |
well, we still need to be able to have Rust properly handle it because WC memory is often returned by 3D graphics APIs (e.g. Vulkan) for memory-mapped video memory. |
Well uh that'll be fun if you wanted to program to spec, because any store there is basically a nontemporal store and isn't flagged as such in any way to the compiler. It'll work of course so long as you either sfence manually in the right places or never try to have another thread do the commit (which presumably will do the right sync, but needs to happen on the cpu that did the WC write), since any memory given to you like that will be known by the compiler to be exposed, and asm sfence/etc would be marked as touching all exposed memory. (Don't make an &mut of it though probably? Who knows what llvm thinks the rules around noalias + asm-memory are) |
It requires an entire new access class, so it is a bigger change than anything that happened since 2011. It also requires splitting "synchronizes-with" into "synchronization through seqcst fences" (which does synchronize nontemporal accesses) and everything else (which does not synchronize nontemporal accesses). So this is a huge change to the model, and all consistency theorems (like DRF), compiler transformations, lowering schemes etc will have to be carefully reconsidered. We currently have proofs showing that the x86 instructions compilers use to implement memory operations give the right semantics. It would be a mistake to lower our standards below this, mistakes have been made in the past. It is just false to claim that this is a simple extension. Any attempt to do this needs to start with something like https://plv.mpi-sws.org/scfix/paper.pdf, the formulation of the model in the standard is just too vague to be reliable. (According to the authors of that paper, the standard doesn't even reflect the SC fix properly even though that was the intention of the standard authors.) The C++ memory model needed many rounds of iteration with formal memory model researchers to get into its current state. It still has one major issue (out-of-thin-air), but the original version (before Mark Betty's thesis) was a lot worse and even after that some things still had to be fixed (like SCfix). Anyone who claims they can just quickly modify this model and be sure not to reintroduce any of these problems is vastly underestimating the complexity of weak memory models.
Again that just doesn't work. We have a very clear rule for inline assembly and FFI: they can only do things that Rust code could also have done, insofar as Rust-visible state is affected. It is completely and utterly meaningless to take an operation from one semantics and just take it into another, completely different semantics. |
Well they don't say much about how nontemporal stores are supposed to behave, so it's unclear if they are miscompiling or just implementing a surprising (and unwritten) spec. I opened llvm/llvm-project#64521 to find out. |
To be clear, the point I'm making is actually that if the definition as "does this assembly sequence" has busted and unusable semantics on the AM, then the vendor intrinsic is busted and unusable, which is IIUC in agreement with you. The main point where I disagree is that, given Intel has defined a busted and unusable intrinsic, the correct behavior is for us to expose the intrinsic as-is, busted and unusable though it may be, potentially with an editorial note saying as much. basically saying the same thing again twice over but I didn't want to delete itWe don't specify the behavior of vendor intrinsics, the vendor does, and if the vendor specified an unusable intrinsic, that's on them to fix, not us. Changing the implementation of the intrinsic to be busted but still usable would be exposing an incorrect intrinsic. I fully agree that our current semantics don't support the intrinsic, though I do wish it was possible to temporarily suspend the AM's knowledge of a memory location in regions that a spurious or reordered load can't reach, in order to make it, well, not supported, but marginally less unsupported. But, and despite understanding the implications of UB, I do believe "it's always UB (but might accidentally happen to do what you want sometimes)" (the current situation) is preferable to not doing what the intrinsic says on the tin because what it says is "causes UB" (but obfuscated). (Though the intrinsic might actually be soundly usable on a pointer which is never deferenced, retagged, nor otherwise accessed from Rust code, such that the Rust AM is never aware of the AM-invalid allocated object even existing. But such isn't anywhere near practical so is mostly irrelevant even if potentially sound.) on a memory model extension
While this is true of the Intel semantics, I would personally expect the memory model to not include sfence effects into a seqcst fence (my spitball formalization deliberately doesn't) and require users to either write both or to rely on the target architecture having a strong memory ordering plus "no sane compiler would optimize atomics." However, further discussion on extending the model to support NT stores (as opposed to documenting/mitigating for what we have currently) should probably stick to Zulip. Any such movement is at best years out. I only included this point here since this post has other (maybe) useful content. |
(I have edited the OP to explicitly state that adjusting the memory model is an option in principle, but not one I consider realistic.)
I don't see why we would expose a busted intrinsic. We already have inline assembly, and these days it is stable (it was not when stdarch got stabilized), so I think we are totally free to tell Intel that we won't expose operations that don't compose with the rest of our language -- for reasons that are entirely on them, since they told the world that their hardware is TSO(-like) and then also added operations which subvert TSO. If these intrinsics weren't stable yet, is anyone seriously suggesting we would have stabilized them, knowing what we do now? I would be shocked if that were the case. The intrinsic even says in its doc comment that this will likely never be stabilized! Sadly whoever implemented |
I wonder if there is some way that we can argue that
is equivalent to a single inline assembly block with both of these operations. The CPU is in a strange state between those two inline assembly blocks, but can we formulate some conditions under which that strange state cannot have negative side-effects? IOW, can we weaken the rule that inline asm can only do things which Rust could have done, to something where one inline asm block gets the machine into a state that does not fully match what Rust could have done (but the difference only affects a few operations), and then a 2nd inline asm block fixes up the state? Basically what's important is that there is no release operation between the two intrinsics. If the programmer ensures this is the case, then -- can we ensure the compiler doesn't introduce such operations? At the very least this means ensuring the inline asm blocks don't get reordered with release operations, but is that sufficient? I am a bit worried this might be too fragile, but OTOH a principle of "it's fine if machine state the Rust AM doesn't currently need is inconsistent" does seem useful. It's just hard for me to predict the impact of this on optimizations. |
On Thu, Aug 10, 2023 at 11:27:32PM -0700, Ralf Jung wrote:
But actually my proposal wouldn't even make it so that doing two MOVNT to the same location would be UB -- it would just be non-deterministic which write would win.
That's not a fatal flaw; sometimes being non-deterministic for
performance is OK, as long as there's a well-defined way to be
deterministic if you want to be, such as by adding additional barriers.
|
Yeah but it was my intent for that to be UB.^^ (And barriers make it defined, of course.) That is achieved by my later proposal, which can be summarized very succinctly: a nontemporal store is like starting a new background thread that will do the actual store using non-atomic ordering at some point in the future. The fence is waiting for all background threads that were spawned by nontemporal stores of the current thread. The hope is that all the behavior of nontemporal stores can be described with this model. If we want to define that behavior even without fences we should go with something closer to @talchas' original proposal. This can be summarized as: there is a per-thread write-back buffer for nontemporal stores. A background thread flushes the entire buffer at non-deterministic intervals using non-atomic writes. (There is no partial flushing, it's always flushing the entire thing.) The fence flushes the entire buffer right then and there. In both cases, |
The According to Intel documentation, it looks like a non-temporal store, even though Rust documentation does not mention it.
|
|
Proposal: Can we lint on these intrinsics for now (by marking them as deprecated, or implementing a dedicated lint), warning people that they are unsound, and figure out how to address them longer term in the future? This was discussed in the lang team meeting. We recognize that there needs to be more in-depth discussion on this, but a lint would be a useful band-aid for the future. If, on the other hand, we agree that it is possible to use these lints safely, we should update the documentation on how to do that, and consider adding a lint that looks for unsound uses of them. |
If we do add a lint, we should make sure to tell people that if they do want to live on the wild side and use these (as clearly some people do), that they need to add the |
I'm still in favor of the approach described above:
|
@rustbot labels -I-lang-nominated I'm going to unnominate because, while this issue is important, T-lang is not likely to have much more to say about it without a document that explores the issue and the feasible resolutions in some detail. Or additionally or alternatively, T-lang would likely have something more to say about a PR adding e.g. a lint (as suggested above) or about a specific addition to the documentation. Please nominate any of those things, or re-nominate this issue if there's new information and a specific question that T-lang might be able to answer. |
This comment (linked from the top comment) was supposed to be such a "document" / summary. |
Sorry but, as far as I can tell, "a document that explores the issue in detail" means nothing if the information provided so far is insufficient. I am assuming that T-lang does not literally mean they need us to copy and paste that comment's body text into a HackMD link. May we have the pleasure of T-lang specifying what they feel is actually needed in terms of "exploring the issue in detail"? Admittedly, the text offered is incomplete. @tmandry I haven't said it yet in this thread, but the problem actually does not start or end with these intrinsics, so asking about linting on them alone is a distraction. The property attributed to "non-temporal stores" is a property that can apply to anywhere in memory if the page is mapped appropriately, and sometimes such interactions can happen for graphics API code which exposes the ability to intentionally map memory pages appropriately to userspace, and which is used to obtain high-performance graphics (in the sense that for the past 10-20 years we have had "high-performance graphics" on computers). Because it makes a lot of things a great deal more efficient for certain write-heavy sequences. Unlike "you can open |
Yeah, that's fair. I think my preferred way to proceed is a PR against stdarch that
Once we have that PR we can ask t-lang for FCP on that I guess. I'd be happy to write the docs but I'd probably need help with the inline asm part. |
That sounds like a reasonable way forward to me, @RalfJung. For the larger questions, it sounds like we should probably have a design meeting (or an RFC) dedicated to the topic. |
I've opened rust-lang/stdarch#1534 to update the stdarch docs. However I need help updating the implementations of these intrinsics to move away from the Where can I find out which instruction-specific LLVM functions exist? (I mean functions like |
And the last step, moving away from LLVM's |
i missed most of the existence of this issue, but thank you for also updating the documentation on i'd also like to, after the fact, thank folks here for concluding on the inline assembly implementation for
to which i very strongly agree. i have a very specific expectation for i realize the conversation here is focused mostly on the unsoundness of the existing implementation of these intrinsics, but i'd like to emphasize as a user of these instructions that their primary value is for the microarchitectural hardware effect. it's not particularly useful to any use i've seen that these stores are write-combining and weakly ordered, or that corresponding non-temporal loads are also WC and weakly-ordered. in fact (opinion warning) it's something of an annoying x86/x86_64 implementation detail rather than an expected or desired behavior. the primary value of these intrinsics is to avoid pulling known-unlikely data into caches only because it happened to be needed once. using write-combining semantics to achieve that effect is, in my understanding, mostly a convenience to reuse an existing x86 memory type, rather than invent a new type used only for a handful of instructions. so, i hope this underlines why simply implementing these intrinsics as regular loads an stores would be so surprising. one other earlier comment i'd like to understand better, on a hypothetical alternative to having simply stabilized these intrinsics a while ago:
would it be correct to phrase this a little more precisely as "after a nontemporal store and until the next sfence, any concurrent access to the memory nontemporally stored to is UB"? this is an important distinction as where i've typically used non-temporal instructions is to write to a buffer that i may not ever read again from the Rust function doing writing in the first place. in those cases it's on me to ensure (sfence or other target-appropriate) ordering guarantees, but it would not result in UB in terms of the Rust AM? of course, "simply integrate the Rust AM, cross-language requirements, and subtle target architecture semantics and make sure you got it all right" is a tall order and part of why i cannot, for example, Just Point To where i do or would use and one other thought on why these are useful intrinsics to have, re.
Enhanced REP MOVSB, and the related finally, a kind of meta thought on the entire notion of these instrinsics as someone who would use them: in 2022 i'd considered using
if i'd known to look at that all said, rust-lang/stdarch#1534 and rust-lang/stdarch#1541 seem quite reasonable, so thanks again. |
I am not sure we have concluded yet. rust-lang/stdarch#1541 is still open and I honestly don't know if people generally think that this is a good idea or not. I'd prefer if we could use an LLVM intrinsic like for most of the other vendor intrinsics, but I don't know if LLVM would even accept taking such an intrinsic given that they have this
What you were quoting was an earlier, different proposal. What you said is very different from the text you were quoting. So no it wouldn't be correct to view what you said as a rephrasing of what you quoted. When I wrote "release operations are UB" I meant exactly that, I don't see how what you are saying is even similar to what I said. But anyway please let's not discuss an outdated proposal that has since been overhauled. The updated proposal is what landed in rust-lang/stdarch#1534. |
Rollup merge of rust-lang#128149 - RalfJung:nontemporal_store, r=jieyouxu,Amanieu,Jubilee nontemporal_store: make sure that the intrinsic is truly just a hint The `!nontemporal` flag for stores in LLVM *sounds* like it is just a hint, but actually, it is not -- at least on x86, non-temporal stores need very special treatment by the programmer or else the Rust memory model breaks down. LLVM still treats these stores as-if they were normal stores for optimizations, which is [highly dubious](llvm/llvm-project#64521). Let's avoid all that dubiousness by making our own non-temporal stores be truly just a hint, which is possible on some targets (e.g. ARM). On all other targets, non-temporal stores become regular stores. ~~Blocked on rust-lang/stdarch#1541 propagating to the rustc repo, to make sure the `_mm_stream` intrinsics are unaffected by this change.~~ Fixes rust-lang#114582 Cc `@Amanieu` `@workingjubilee`
…ieu,Jubilee nontemporal_store: make sure that the intrinsic is truly just a hint The `!nontemporal` flag for stores in LLVM *sounds* like it is just a hint, but actually, it is not -- at least on x86, non-temporal stores need very special treatment by the programmer or else the Rust memory model breaks down. LLVM still treats these stores as-if they were normal stores for optimizations, which is [highly dubious](llvm/llvm-project#64521). Let's avoid all that dubiousness by making our own non-temporal stores be truly just a hint, which is possible on some targets (e.g. ARM). On all other targets, non-temporal stores become regular stores. ~~Blocked on rust-lang/stdarch#1541 propagating to the rustc repo, to make sure the `_mm_stream` intrinsics are unaffected by this change.~~ Fixes rust-lang/rust#114582 Cc `@Amanieu` `@workingjubilee`
We definitely need fences there, see: https://doc.rust-lang.org/core/arch/x86/fn._mm_sfence.html Even more interesting, the discussion on NT stores in Rust: rust-lang/rust#114582 And on broken NT stores in LLVM: llvm/llvm-project#64521 Oh boy ...
…ieu,Jubilee nontemporal_store: make sure that the intrinsic is truly just a hint The `!nontemporal` flag for stores in LLVM *sounds* like it is just a hint, but actually, it is not -- at least on x86, non-temporal stores need very special treatment by the programmer or else the Rust memory model breaks down. LLVM still treats these stores as-if they were normal stores for optimizations, which is [highly dubious](llvm/llvm-project#64521). Let's avoid all that dubiousness by making our own non-temporal stores be truly just a hint, which is possible on some targets (e.g. ARM). On all other targets, non-temporal stores become regular stores. ~~Blocked on rust-lang/stdarch#1541 propagating to the rustc repo, to make sure the `_mm_stream` intrinsics are unaffected by this change.~~ Fixes rust-lang/rust#114582 Cc `@Amanieu` `@workingjubilee`
I recently discovered this funny little intrinsic with the great comment
Unfortunately, the comment is wrong: this has become stable, through vendor intrinsics like
_mm_stream_ps
.Why is that a problem? Well, turns out non-temporal stores completely break our memory model. The following assertion can fail under the current compilation scheme used by LLVM:
The assertion can fail because the CPU may order MOVNT after later MOV (for different locations), so the nontemporal_store might occur after the release store. Sources for this claim:
This is a big problem -- we have a memory model that says you can use release/acquire operations to synchronize any (including non-atomic) memory accesses, and we have memory accesses which are not properly synchronized by release/acquire operations.
So what could be done?
_mm_stream
intrinsics without it and mark them as deprecated to signal that they don't match the expected semantics of the underlying hardware operation. People should use inline assembly instead and then it is their responsibility to have an sfence at the end of their asm block to restore expected synchronization behavior.Thanks a lot to @workingjubilee and @the8472 for their help in figuring out the details of nontemporal stores.
Cc @rust-lang/lang @Amanieu
Also see the nomination comment here.
The text was updated successfully, but these errors were encountered: