-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] AtomicPerByte (aka "atomic memcpy") #3301
base: master
Are you sure you want to change the base?
Changes from all commits
d5393a7
e864d8d
d12abe9
eb68c3a
e802133
69fc8b5
6fae147
af86156
520ab88
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,291 @@ | ||
- Feature Name: `atomic_memcpy` | ||
- Start Date: 2022-08-14 | ||
- RFC PR: [rust-lang/rfcs#3301](https://github.com/rust-lang/rfcs/pull/3301) | ||
- Rust Issue: [rust-lang/rust#0000](https://github.com/rust-lang/rust/issues/0000) | ||
|
||
# Summary | ||
|
||
This is a proposal to add `AtomicPerByte<T>`, to represent _tearable atomics_. | ||
This makes it possible to properly implement a _sequence lock_ in Rust. | ||
|
||
# The Problem | ||
|
||
It's currently not possible to implement an efficient and perfectly | ||
(theoretically) correct sequence lock in Rust. | ||
|
||
Unlike most locking mechanisms, a sequence lock doesn't prevent a race | ||
to access the data it projects. | ||
Instead, it detects a race only after the load operation already happened, | ||
and retries it if the load operation raced with a write operation. | ||
|
||
A sequence lock in Rust looks something like this: | ||
|
||
```rust | ||
// Incomplete example | ||
|
||
pub struct SeqLock<T> { | ||
seq: AtomicUsize, | ||
data: UnsafeCell<T>, | ||
} | ||
|
||
unsafe impl Sync<T: Copy + Send> for SeqLock<T> {} | ||
|
||
impl<T: Copy> SeqLock<T> { | ||
/// Safety: Only call from one thread. | ||
pub unsafe fn write(&self, value: T) { | ||
self.seq.fetch_add(1, Relaxed); | ||
write_data(&mut self.data, value, Release); | ||
self.seq.fetch_add(1, Release); | ||
} | ||
|
||
pub fn read(&self) -> T { | ||
loop { | ||
let s1 = self.seq.load(Acquire); | ||
let data = read_data(&self.data, Acquire); | ||
let s2 = self.seq.load(Relaxed); | ||
if s1 & 1 == 0 && s1 == s2 { | ||
return unsafe { assume_valid(data) }; | ||
} | ||
} | ||
} | ||
} | ||
``` | ||
|
||
The `write_data` and `read_data` calls can happen concurrently. | ||
The `write` method increments the counter before and after, | ||
such that the counter is odd during `write_data`. | ||
The `read` function will repeat `read_data` until | ||
the counter was identical and even both before and after reading. | ||
This way, `assume_valid` is only ever called on data that was | ||
not the result of a race. | ||
|
||
A big question is how to implement `write_data`, `read_data`, and `assume_valid` | ||
in Rust in an efficient way while satisfying the memory model. | ||
|
||
The somewhat popular `seqlock` crate and similar implementations found in the ecosystem | ||
all use a regular non-atomic write (preceded by an atomic fence) for writing, | ||
and `ptr::read_volatile` (followed by an atomic fence) for reading. | ||
This "works" "fine", but is technically undefined behavior. | ||
m-ou-se marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The C++ and Rust memory model doesn't allow for data races, | ||
so doesn't allow for a data race to be detected after the fact; | ||
that's too late. | ||
|
||
All of the data would have to be written and read through | ||
atomic operations to prevent a data race. | ||
We don't need the atomicity of the data as a whole though; | ||
it's fine if there's tearing, since we re-start on a race anyway. | ||
|
||
Additionally, memory fences technically only "interact" with atomic operations, not with volatile operations. | ||
|
||
# The C++ Solution | ||
|
||
C++'s [P1478] proposes the addition of these two functions to the C++ standard | ||
library to solve this problem: | ||
|
||
```cpp | ||
void *atomic_load_per_byte_memcpy(void *dest, const void *source, size_t, memory_order); | ||
void *atomic_store_per_byte_memcpy(void *dest, const void *source, size_t, memory_order); | ||
``` | ||
|
||
The first one is effectively a series of `AtomicU8::load`s followed by a memory fence, | ||
while the second one is basically a memory fence followed by series of `AtomicU8::store`s. | ||
Except the implementation can be much more efficient. | ||
The implementation is allowed to load/store the bytes in any order, | ||
and doesn't have to operate on individual bytes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The "load/store bytes in any order" part is quite tricky, and I think means that the specification needs to be more complicated to allow for that. I was originally thinking this would be specified as a series of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For a memcpy (meaning the two regions are exclusive) you generally want to copy using increasing address order ("forward") on all hardware I've ever heard of. Even if a forward copy isn't faster (which it often is), it's still the same speed as a reverse copy. I suspect the "any order is allowed" is just left in as wiggle room for potentially strange situations where somehow a reverse order copy would improve performance. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
A loop of relaxed load/store operations followed/preceded by an acquire/release fence already effectively allows for the relaxed operations to happen in any order, right?
In the C++ paper they are basically as:
and
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, relaxed loads/stores to different locations can be reordered, so specifying their order is moot under the as-if rule.
Hm... but usually fences and accesses are far from equivalent. If we specify them like this, calling code can rely on the presence of these fences. For example changing a 4-byte atomic acquire memcpy to an AtomicU32 acquire load would not be correct (even if we know everything is initialized and aligned etc). Fence make all preceding/following relaxed accesses potentially induce synchronization, whereas release/acquire accesses only do that for that particular access. |
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are the rules for mixed accesses? Can I use an atomic memcpy to access memory that is, at the same time, also accessed via an Also see rust-lang/miri#2303. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think mixed accesses have to be forbidden still. Even if we want to support them in a future memory model (unclear), I believe the C++ spec for this was pretty careful to forbid them. Currently allowing them will also break things like tsan, if nothing else... and the fact that they're also more or less forbidden by even Intel's arch manual makes allowing them in the memory model hard to argue for. Anyway, the current API doesn't allow anything like this, but I suppose I agree this should be documented as intentional. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. imho mixed atomics should be fine...wasm and javascript have them after all, even on intel. imho C++ doesn't have them because they didn't want to go through the hassle of specifying how they'd work and C++ has TBAA, making mixing between two non-char1 types illegal even with normal load/stores on a single thread. Footnotes
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mixed atomics also don't work with locks, right? C++ allows locks for it's atomic implementation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
From the x86 architecture manual 3A/I 8.1.2.2:
It does work in practice though, for the most part. I'm not sure we want to rely on this given that the architecture manual says not to do it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree for now we should keep ruling them out, because I assume that's what C++ does -- and that should probably be explicitly documented for atomic memcpy, because that operation is much more likely to be used in unsafe type-punning code than the regular Atomic types.
Would be interesting to see what wasm/JS engines do with them on intel, if they just ignore what intel says in their manual or if they took some precautions. But that is probably a separate discussion. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The proposed atomic memcpy in C++ takes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Apparently this is mostly reliable: rust-lang/unsafe-code-guidelines#345 |
||
The memory order can only be relaxed, acquire (for load), and release (for store). | ||
Sequentially consistent ordering for these operations is disallowed, | ||
since it's not obvious what that means for these tearable operations. | ||
|
||
# The Rust Solution | ||
|
||
While C++'s solution can be easily copy-pasted into Rust with a nearly identical signature, | ||
it wouldn't fit with the rest of our atomic APIs. | ||
|
||
All our atomic operations happen through the `Atomic*` types, | ||
and we don't have atomic operations that operate on raw pointers. | ||
(Other than as unstable intrinsics.) | ||
|
||
Adding this functionality as a variant on `copy_nonatomic`, similar to the C++ solution, | ||
would not be very ergonomic an can easily result in subtle bugs causing undefined behavior. | ||
|
||
Instead, I propose to add a `AtomicPerByte<T>` type | ||
similar to our existing atomic types: a `Sync` storage for a `T` | ||
that can be written to and read from by multiple threads concurrently. | ||
|
||
The `SeqLock` implementation above would use this type instead of an `UnsafeCell`. | ||
It'd no longer need an unsafe `Sync` implementation, | ||
since the `AtomicPerByte<T>` type can be shared between threads safely. | ||
|
||
This type has a (safe!) `store` method consuming a `T`, | ||
and a (safe!) `load` method producing a `MaybeUninit<T>`. | ||
The `MaybeUninit` type is used to represent the potentially invalid state | ||
the data might be in, since it might be the result of tearing during a race. | ||
|
||
Only after confirming that there was no race and the data is valid | ||
can one safely use `MaybeUninit::assume_init` to get the actual `T` out. | ||
|
||
```rust | ||
pub struct SeqLock<T> { | ||
seq: AtomicUsize, | ||
data: AtomicPerByte<T>, | ||
} | ||
|
||
impl<T: Copy> SeqLock<T> { | ||
/// Safety: Only call from one thread. | ||
pub unsafe fn write(&self, value: T) { | ||
self.seq.fetch_add(1, Relaxed); | ||
self.data.store(value, Release); | ||
self.seq.fetch_add(1, Release); | ||
} | ||
|
||
pub fn read(&self) -> T { | ||
loop { | ||
let s1 = self.seq.load(Acquire); | ||
let data = self.data.load(Acquire); | ||
let s2 = self.seq.load(Relaxed); | ||
if s1 & 1 == 0 && s1 == s2 { | ||
return unsafe { data.assume_init() }; | ||
} | ||
} | ||
} | ||
} | ||
``` | ||
|
||
# Full API Overview | ||
|
||
The `AtomicPerByte<T>` type can be thought of as | ||
the `Sync` (data race free) equivalent of `MaybeUninit<T>`. | ||
It can contain a `T`, but it might be invalid in various ways | ||
due to concurrent store operations. | ||
Its interface resembles a mix of the interfaces of `MaybeUninit` and the atomic types. | ||
|
||
```rust | ||
#[repr(transparent)] | ||
struct AtomicPerByte<T> { inner: UnsafeCell<MaybeUninit<T>> } | ||
|
||
unsafe impl<T: Send> Sync for AtomicPerByte<T> {} | ||
|
||
impl<T> AtomicPerByte<T> { | ||
pub const fn new(value: T) -> Self; | ||
pub const fn uninit() -> Self; | ||
|
||
pub fn store(&self, value: T, ordering: Ordering); | ||
m-ou-se marked this conversation as resolved.
Show resolved
Hide resolved
|
||
pub fn load(&self, ordering: Ordering) -> MaybeUninit<T>; | ||
|
||
pub fn store_from(&self, src: &MaybeUninit<T>, ordering: Ordering); | ||
pub fn load_to(&self, dest: &mut MaybeUninit<T>, ordering: Ordering); | ||
|
||
pub fn store_from_slice(this: &[Self], src: &[MaybeUninit<T>], ordering: Ordering); | ||
pub fn load_to_slice(this: &[Self], dest: &mut [MaybeUninit<T>], ordering: Ordering); | ||
|
||
pub const fn into_inner(self) -> MaybeUninit<T>; | ||
|
||
pub const fn as_ptr(&self) -> *const T; | ||
pub const fn as_mut_ptr(&self) -> *mut T; | ||
|
||
pub const fn get_mut(&mut self) -> &mut MaybeUninit<T>; | ||
pub const fn get_mut_slice(this: &mut [Self]) -> &mut [MaybeUninit<T>]; | ||
|
||
pub const fn from_mut(value: &mut MaybeUninit<T>) -> &mut Self; | ||
pub const fn from_mut_slice(slice: &mut [MaybeUninit<T>]) -> &mut [Self]; | ||
} | ||
|
||
impl<T> Debug for AtomicPerByte<T>; | ||
impl<T> From<MaybeUninit<T>> for AtomicPerByte<T>; | ||
``` | ||
|
||
Note how the entire interface is safe. | ||
All potential unsafety is captured by the use of `MaybeUninit`. | ||
|
||
The load functions panic if the `ordering` is not `Relaxed` or `Acquire`. | ||
The store functions panic if the `ordering` is not `Relaxed` or `Release`. | ||
The slice functions panic if the slices are not of the same length. | ||
|
||
# Drawbacks | ||
|
||
- In order for this to be efficient, we need an additional intrinsic hooking into | ||
special support in LLVM. (Which LLVM needs to have anyway for C++.) | ||
Comment on lines
+208
to
+209
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do you plan to implement this until LLVM implements this? I don't think it is necessary to explain the implementation details in the RFC, but if we provide an unsound implementation until the as yet unmerged C++ proposal is implemented in LLVM in the future, that seems to be a problem. (Also, if the language provides the functionality necessary to implement this soundly in Rust, the ecosystem can implement this soundly as well without inline assembly.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I haven't looked into the details yet of what's possible today with LLVM. There's a few possible outcomes:
I'm not fully sure yet which of these are feasible. |
||
|
||
- It's not immediately obvious this type behaves like a `MaybeUninit`, | ||
making it easy to forget to manually drop any values that implement `Drop`. | ||
|
||
This could be solved by requiring `T: Copy`, or by using a better name for this type. (See alternatives below.) | ||
|
||
Very clear documentation might be enough, though. | ||
|
||
- `MaybeUninit<T>` today isn't as ergonomic as it should be. | ||
|
||
For a simple `Copy` type like `u8` it might be nicer to be able to use types like `&[u8]` | ||
rather than `&[MaybeUninit<u8>]`, etc. | ||
(But that's a larger problem affecting many other things, like `MaybeUninit`'s interface, | ||
`Read::read_buf`, etc. Maybe this should be solved separately.) | ||
|
||
# Alternatives | ||
|
||
- Instead of a type, this could all be just two functions on raw pointers, | ||
such as something like `std::ptr::copy_nonoverlaping_load_atomic_per_byte`. | ||
|
||
This means having to use `UnsafeCell` and more unsafe code wherever this functionality is used. | ||
|
||
It'd be inconsistent with the other atomic operations. | ||
We don't have e.g. `std::ptr::load_atomic` that operates on pointers either. | ||
Comment on lines
+227
to
+233
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would still be convenient to have both the
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thinking more about this, I believe I got two things mixed up! The Does
otherwise this could lead to mixing non-perfectly-overlapping atomic accesses in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would expect There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With "element-wise atomic copy" I mean whether the internal implementation of To complete the example of how this could be unsound when If "perform a single atomic There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't think any of that is even defined in the spec; these are all unspecified implementation details. The only alignment that I feel like I don't understand the question, since this seems obvious. Or are you repeating #3301 (comment) here? I am assuming this thread is disjoint from the one where we are discussing the x86 "concurrent atomic accesses to the not-perfectly-overlapping locations" problem. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the question here is different than the linked comment though touching on similar areas. By my understanding it is: Can Consider the following situation, explained rather much but i'd rather overexplain than introduce more confusion by underexplaining: static ARRAY: [AtomicPerByte<u16>; LEN] = [AtomicPerByte::new(0); LEN];
fn thread_1() {
let new = [MaybeUninit::new(1u16); LEN];
AtomicPerByte::store_from_slice(&ARRAY, &new, Ordering::Release);
} If
However, this is unsound, because another thread could store to a non-exactly overlapping slice: fn thread_2() {
let new = [MaybeUninit::new(2u16); 3];
let lower = &ARRAY[0..3];
AtomicPerByte::store_from_slice(&lower, &new, Ordering::Release);
} In the case that ARRAY happened to be 8-byte aligned, (and
Note that on Phrased differently:
Thus,
is an argument to add additional functions unsafe fn atomic_load_per_byte_memcpy<T>(src: &[AtomicPerByte<T>], dest: &mut [T], ordering: Ordering);
unsafe fn atomic_store_per_byte_memcpy<T>(src: &[T], dest; &[AtomicPerByte<T>], ordering: Ordering); that have the additional requirement that no non-exactly-overlapping slice of the provided slice of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Despite what the manual says, Intel has a "verbal" guarantee that non-exactly-overlapping atomics is fine: rust-lang/unsafe-code-guidelines#345 (comment). It may tear across cache lines (which is fine) but will not put the CPU into an undefined state. This situation is unlikely to change given that need to keep C++'s byte-wise memcpy working too. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
While this is true, in that linked issue and elsewhere in this pr this has been discussed and my readings of these discussions is a disposition to respect the manual and treat it as undefined, even if it works in practice. If that changes, then that does solve this issue, but otherwise this remains a problem (though it could be not solved at all or a solution could be postponed) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That mailing list post was sent on 6 April, all prior discussions were done without this piece of information. The only comment on this RFC on non-exactly-overlapping access since then was yours, so I don't think we can say there's a disposition yet, especially in light of the new information. In general I don't think "stick to the documentation" is a good approach when dealing with proprietary platforms, otherwise we can't ship anything for macOS at all. If the vendor is receptive, we should nag them to change the documentation like @m-ou-se has done before with MS MicrosoftDocs/sdk-api#447, but even if they don't change it, we shouldn't have our hands tied by it. In this case, the C++ proposal is committed to allowing multi-byte access optimisations, and they are privy to more information from vendors, so I think sticking to their approach should be safe. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it's a totally reasonable (and probably good) resolution to this and other issues. I'm not trying to argue against (or necessarily for) that, but rather to clearly explain the issue that was raised in this particular sub thread, over which there seemed to be confusion that was not resolved. |
||
|
||
- Require `T: Copy` for `AtomicPerByte<T>`, such that we don't need to worry about | ||
duplicating non-`Copy` data. | ||
|
||
There are valid use cases with non-`Copy` data, though, such as [in crossbeam-deque](https://github.com/crossbeam-rs/crossbeam/blob/2d9e7e0f81d3dd3efb1975b6379ea8b05fcf9bdd/crossbeam-deque/src/deque.rs#L60-L78). | ||
Also, not all "memcpy'able" data is always marked as `Copy` (e.g. to prevent implicit copies). | ||
|
||
- Leave this to the ecosystem, outside of the standard library. | ||
|
||
Since this requires special compiler support, a community crate would | ||
have to use (platform specific) inline assembly | ||
or (probably technically unsound) hacks like volatile operations. | ||
|
||
- Use a new `MaybeTorn<T>` instead of `MaybeUninit<T>`. | ||
|
||
`AtomicPerByte` doesn't _have_ to support uninitialized bytes, | ||
but it does need a wrapper type to represent potentially torn values. | ||
|
||
If Rust had a `MaybeTorn<T>`, we could make it possible to load types like `[bool; _]` or even `f32` without any unsafe code, | ||
since, for those types, combining bytes from different values always results in a valid value. | ||
|
||
However, the use cases for this are very limited, it would require a new trait to mark the types for which this is valid, | ||
and it makes the API a lot more complicated or verbose to use. | ||
|
||
Also, such a API for safely handling torn values can be built on top of the proposed API, | ||
so we can leave that to a (niche) ecosystem crate. | ||
|
||
- Don't allow an uninitialized state. | ||
|
||
Even if we use `MaybeUninit<T>` to represent a 'potentially torn value', | ||
we could still attempt to design an API where we do not allow an uninitialized state. | ||
|
||
It might seem like that results in a much simpler API with `MaybeUninit<T>` replaced by `T` in | ||
methods like `into_inner()` and `get_mut()`, but that is not the case: | ||
|
||
As long as `store()` can be called concurrently by multiple threads, | ||
it is not only the `load()` method that can result in a torn value, | ||
since the `AtomicPerByte<T>` object itself might end up storing a torn value. | ||
|
||
Therefore, even if we disallow uninitialized values, | ||
every method will still have `MaybeUninit<T>` in its signature, | ||
at which point we lose basically all benefits of removing the uninitialized state. | ||
|
||
Removing the uninitialized state does result in a big downside for users who need to add that state back, | ||
as the interface of a `AtomicPerByte<MaybeUninit<T>>` would result in doubly wrapped `MaybeUninit<MaybeUninit<T>>` in many places, | ||
which is can be quite unergonomic and confusing. | ||
|
||
# Unresolved questions | ||
|
||
- Should we require `T: Copy`? | ||
m-ou-se marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
There might be some valid use cases for non-`Copy` data, | ||
but it's easy to accidentally cause undefined behavior by using `load` | ||
to make an extra copy of data that shouldn't be copied. | ||
|
||
- Naming: `AtomicPerByte`? `TearableAtomic`? `NoDataRace`? `NotQuiteAtomic`? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given these options and considering what the C++ paper chose, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
[P1478]: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1478r7.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's something very subtle here that I had not appreciated until a few weeks ago: we have to ensure that the
load
here cannot return an outdated value that would prevent us from noticing a seqnum bump.The reason this is the case is that if there is a concurrent
write
, and if anypart of
data
reads from that write, then we have a release-acquire pair, so then we are guaranteed to see at least the firstfetch_add
fromwrite
, and thus we will definitely see a version conflict. OTOH if thes1
reads-from some secondfetch_add
inwrite
, then that forms a release-acquire pair, and we will definitely see the full data.So, all the release/acquire are necessary here. (I know this is not a seqlock tutorial, and @m-ou-se is certainly aware of this, but it still seemed worth pointing out -- many people reading this will not be aware of this.)
(This is related to this comment by @cbeuw.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah exactly. This is why people are sometimes asking for a "release-load" operation. This second load operation needs to happen "after" the
read_data()
part, but the usual (incorrect)read_data
implementation doesn't involve atomic operations or a memory ordering, so they attempt to solve this issue with a memory ordering on that final load, which isn't possible. The right solution is a memory ordering on theread_data()
operation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under a reordering based atomic model (as CPUs use), a release load makes sense and works. Release loads don't really work unless they are also RMWs (
fetch_add(0)
) under the C11 model.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the famous seqlock paper discusses "read dont-modify write" operations.