-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking issue for improving std::sync::{Mutex, RwLock, Condvar} #93740
Comments
Nevermind it does not have timeouts for |
I don't think we need to use hash tables like parking lot for targets without futex support. We could just store the queues inside Mutex. I think the only downside is size, but that'll still be better than the current boxed pthread mutex though. |
Also it doesn't support |
Oh. This is a subject extremely near-and-dear to my heart -- I even started on a pre-RFC for atomic wait/wake ops, like the ones that C++ got recently. That said, much of my motivation was hoping it would find a way to unstick us here, and I this is clearly a higher priority. For futex APIs, I wrote a blog post on them at one point. https://shift.click/blog/futex-like-apis I wrote about some of the options here for system APIs, which might be helpful as reference. That said, I got asked about Darwin's For Darwin, it I think it's quite unlikely we could use That said, there's... a pretty cursed option for a workaround on these targets. Too cursed to use? Not sure... possibly. That is:
So, maybe? Given that the OS provides (If not, oh well, someday an API shaped mostly like that is bound to end up as part of the libc someday, after it's copied to C from C++ in C9Z 😛) (EDIT: Thinking about it more, it's fairly likely we should have some justification for this approach, since looking at the libc++ code, enough happens between what our all would be that it could defeat the point. And the effort spent here to make it work would be better spent on improving our fallback) All that said, there are a few options on various platforms that could probably help implement the fallback thread parker, in case there are regressions (which plausibly there wouldn't be if we use
That said, a lot of this probably should be in the a tail of followups, and only if it's worthwhile. Footnotes
|
Please make sure you implement a fair or eventually fair (this is what parking lot does and needs the hashmap for) lock to prevent starvation. |
I don't think pthread mutexes are fair so the new implementation shouldn't be required to be fair either. |
SRWLOCK is also unfair too, More generally: There are a bunch of locking features that would be nice to have in an ideal world, but I think they start becoming contradictory, or only achievable easily on one platform or another -- for example, I believe the parking-lot style of lock won't respect OS priorities (unless there's a clever trick I don't know about or something). Trying to solve all of this probably has no solution, and I'm not sure that we're in the right position decide which things we want to promise yet1... assuming we even want to make these sorts of promises (and I'm not sure we do). Footnotes
|
Apple switched the default from PTHREAD_MUTEX_POLICY_FAIRSHARE_NP to the newly introduced PTHREAD_MUTEX_POLICY_FIRSTFIT_NP in macOS 10.14 (iOS and tvOS 12.0, watchOS 5.0), so they are not fair for recent macOS either (by default). |
Wouldn't implementing lock algorithms on top of futex et al. effectively mean that both the |
@nagisa Yes. That's an issue with all platform-specific code, I suppose. But we do plan to also have the CI test the generic implementation (parking lot) on Linux as well, so it gets test coverage. |
I'm now reading through the various popular lock implementation to see how those are implemented. Starting with musl. MutexesMusl's Its implementation is relatively simple. The When unlocking, a 0 is unconditionally written to the lock, and when the high bit was set or the waiter count was nonzero, a one thread is awoken through the futex. It's not entirely clear to me how useful the waiter count is. This mechanism would also work correctly with only the 'contended'-bit. The waiter count is only used by the spin loop while locking, to avoid spinning and sleep directly when there's other waiters, to avoid stealing the lock in front of others. This makes things a little bit more 'fair'. See commit f5fb20b. Condition variablesMusl's The lock protects a doubly linked list pointed at by the two pointers. This doubly linked list is the waiting queue, where each node is stored on the stack of a waiting thread. Each node contains a state (waiting, signalled, or leaving) and a futex that the thread is waiting on. Reader-writer locksMusl's Read-unlocking a contended lock will wake up one thread. Write-unlocking a contended lock will wake up all threads. The latter could cause another writer thread and many reader threads to wake up at the same time, potentially resulting in the writer thread winning the race and all the reader threads going back to sleep again. The In general it's interesting to see how many implementation details are relevant to Posix, but not to us here in Rust. For example, commit c68de0b mentions how Posix allows destroying and unmapping a mutex even if another thread hasn't returned from Tl;dr: Musl uses trivial rwlocks and mutexes, except it separately counts waiters to avoid spinning when there's other waiters. Rwlocks don't prioritize writers because Posix wants reentrant reader locks. Condition variables are not trivial; they keep a waiter queue as linked list in user space, and requeue threads one by one. |
Next: glibc's mutexes. (Glibc's condition variables and rwlocks are a bit more complicated. I'll come back to those later.) MutexesGlibc's mutexes are 40 bytes and support a lot of different properties. The 'normal' kind, ( The normal mutex, unlike the many other variants, is very simple. It has three states. 0 for unlocked, 1 for locked without waiters, and 2 for locked with waiters. Locking tries to compare-and-swap 0 to 1, or otherwise swaps in a 2 and futex-wait's for things to change. Unlocking swaps a 0 in place, and futex-wake's one waiting thread if it was set to 2. Unlike musl's implementation, it does not try to spin at any point, and directly switches to sleeping if the compare-and-swap fails. It's basically identical to the one I wrote in my experiment: https://github.com/m-ou-se/futex-lock-experiment/blob/ab0e9a135aff5f3b2e01218a3281cbc552d76653/src/lib.rs#L30-L109 |
|
Lack of timed wait is a problem, and libstd supports back to 10.7+ (and I'm not sure what's required to reconsider this). |
Ah, fair point for the 10.7+ requirement. Does the new |
It's supported via condvar: https://doc.rust-lang.org/std/sync/struct.Condvar.html |
Continuing with glibc's condition variables: Condition variablesGlibc's condition variables are 48 bytes, and all of them are used in normal operation. Unlike musl's implementation, this one does not keep a queue in user space, as the same implementation is used for process-shared condition variables, which cannot refer to any memory addresses that might be different in different processes. It keeps a 64-bit sequence number that's incremented by every new waiter, and a second 64-bit number that indicates which of those have been signalled. Within the structure, there are futexes and some other fields for exactly two 'groups' of waiters: G1 and G2. Waiters add themselves to G2, and signals wake up threads in G1. When G1 is empty, signalling first swaps the two groups. This is a workaround for bug 13165, which is about time-traveling signals that end up waking future waiters rather than current waiters. Two bits of the condition variable represent its internal lock (a trivial three state mutex), which protects the state of the condition variable. There is some additional complexity about destroying a condition variable while there are still waiters. However, this is irrelevant for our Rust implementation, since we don't allow dropping things that are still in use. This implementation raises the following questions for a Rust implementation:
|
More on glibc's condition variables: Their old design before the bugfix for the time-traveling signals is documented here in pseudocode: https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/DESIGN-condvar.txt;h=4845251c7586c9667cd33c38af3701910d65eaa8;hb=c0ff3befa9861171498dd29666d32899e9d8145b It keeps a (64-bit) counter for the number of signals that have been sent, and another for the number of threads that have been woken up. The difference between those two is effectively the number of signals ('tokens') that can still be consumed. If a waiting thread wakes up from the futex-wait operation and there are no tokens left (i.e. the counters are equal), it goes back to sleep. That's how a future thread can consume a token and prevent another waiting thread from waiting up from |
llvm libc for comparison: |
Finishing up glibc: Reader-writer locksGlibc's Part of its complexity is due to supporting these different modes within the same structure. Another part of the complexity is due to posix allowing destruction in cases where the Rust borrow checker would not allow it. Both of these are not relevant to our implementation in Rust. The reader counter also contains three state bits. Using it directly as a futex is avoided, in part because that number can rapidly change as readers come and go. Therefore, one of the state bits (whether the lock is in reader or writer mode) is duplicated in a separate futex. It also includes a mechanism through which writers can hand over a writer lock to each other without unlocking, which is only used if the lock is set to a mode that prefers writers. It has nine states, each of which is either part of the 'read phase' (1 through 4a) or the 'write phase' (5 through 8). Even the idle state is split in both a 'read phase' (1) and a 'write phase' (5) idle state. It's entirely symmetrical, except for one more state (4a) in the read phase which is used in the non-recursive writer-preferred mode to block new waiters even when the lock is already read-locked. The transition from one of the idle states to the locked state in the other phase (2 or 7) goes through a intermediary state (3 or 6), during which a thread trying to lock the lock in the 'current phase' (read/write) can prevent the phase change and steal the lock. (This intermediary state is not used for the 'try lock' functions.) The locked states (2 and 7) are split into two states where one indicates there are waiters of the opposite phase (4/4a and 8). Below is a state transition diagram of this lock. The dashed arrows are only used if writers are preferred. The thin arrows are only used by the 'try lock' functions. The thread ID of the current writer that's stored inside the lock is only used for debugging and deadlock detection when the holder of a write lock tries to acquire a read lock. |
Next up: Wine We don't have the source code to Microsoft's SRW locks and condition variables, but we do have the source code to Wine, which implements these as well. However, while they have to be the same size as Microsoft's implementation, they are implemented very differently and you'll see very different bit patterns in them if you inspect them while they are in use. Reader-writer locks (and Mutexes)On Windows we use SRW locks for both The lock and unlock functions are very simple. Write-locking increments the waiter counter (using a 16-bit operation), and then uses a 32-bit CAS operation on both counters at once to acquire the lock if the owner counter is zero by setting it to Read-locking uses a CAS operation to increment the number of owners only if there's no waiting writers and the lock isn't write-locked. So, waiting writers block new readers. If the conditions aren't met, it futex-wait's until they are. Interestingly, two different addresses are used within the SRW lock as two different futexes. Readers wait on the address of the SRW lock itself and use the full 32-bit value, while writers wait only on the second half which contains the 16-bit owner counter. Unlocking the last reader lock will wake a single thread from the writer queue. Unlocking a writer lock will wake a single writer if there was any waiting (indicated by the waiting counter), or wake all readers if there are no waiting writers. Mixing 16-bit and 32-bit atomic operations on overlapping memory is not common, and the Intel manual (vol. 3A) seems to suggest this is not a great idea:
Also, surprisingly, no overflow checks are done. If either of the counter overflows, unexpected things happen. Condition variablesWine's condition variables are trivial. Just like SRW locks, they are the size of a pointer, but Wine's implementation only uses 32-bits. It is simply a 32-bit counter that gets incremented on every notification. It does not keep track of which lock it is used together with, nor does it make use of requeueing. It does not keep track of any information to prevent a futex-wake operation when there's no thread waiting, etc. It is identical to Footnotes
|
Next up: Windows Now this one is a bit tricky because the source code isn't available, but I made some guesses and have good reason to believe they are accurate. Reader-writer locks (and Mutexes)On Windows we use SRW locks for both When there's no contention, the number of active read locks is stored in the counter. For a write lock, the counter is 0. The least significant bit is used to indicate whether the lock is locked (regardless of read or write mode). An unlocked lock is entirely zero. When there are threads waiting, that is indicated by one of the four flag bits, and the counter is no longer used and instead used as a pointer to a doubly linked list of waiting threads. The nodes of that list are stored on the stack of the waiting threads. Each node also contains a pointer to the last node in the queue, so you can quickly get there from the first pointer. (The lock only stores a pointer to the first node.) The last node contains the number of active read locks, since the pointer took the place of that counter within the lock itself. Each node contains the thread id of the waiting thread, and an additional flag that indicates whether the thread is waiting to acquire a read or write lock. The thread id is used with the undocumented I'm assuming some of the flag bits in the lock are used as a mutex to protect modifications to the queue. A waiting writer on a read-locked lock prevent new readers from acquiring the lock, to avoid writer starvation, but it doesn't exactly prefer writers in general: it ~roughly attempts to handle the requests in order. SRW locks are not fair. Like most lock implementations, it first spins a while before going to sleep, to avoid expensive syscalls when the lock is only locked for a very short time. Spinning seems configurable through some global. It also seems to support an alternative to spinning based on the Condition variablesCondition variables on Windows are also a single pointer. They work very similar to SRW locks, storing a pointer with flag bits in the least significant bits, with the pointer pointing to a linked list over waiting threads. The nodes even have the same structure as the nodes used by SRW locks.
Here's a diagram showing an SRW lock in different states: A condition variable looks nearly identical, except for the locked states and the four flag bits. tl;dr: SRW locks and condition variables on Windows are thread parker based and keep their waiting queue as a linked list in user space through nodes on the stack of the waiting threads. Thread parking is natively supported on Windows through an undocumented API. It prefers writers to avoid writer starvation. They spin before sleeping, and can use the special |
These constructors have been const since Rust 1.63 (rust-lang/rust#93740). It's pretty easy for us to make them const too, which allows code that relies on them being const to correctly compile with Shuttle. The one exception is that HashMap::new isn't const, and our Condvar implementation uses a HashMap to track waiters. I took the easy way out and just used a vector as an association list instead -- we shouldn't expect large numbers of waiters on the same condvar, so this shouldn't be too inefficient.
These constructors have been const since Rust 1.63 (rust-lang/rust#93740). It's pretty easy for us to make them const too, which allows code that relies on them being const to correctly compile with Shuttle. The one exception is that HashMap::new isn't const, and our Condvar implementation uses a HashMap to track waiters. I took the easy way out and just used a vector as an association list instead -- we shouldn't expect large numbers of waiters on the same condvar, so this shouldn't be too inefficient.
These constructors have been const since Rust 1.63 (rust-lang/rust#93740). It's pretty easy for us to make them const too, which allows code that relies on them being const to correctly compile with Shuttle. The one exception is that HashMap::new isn't const, and our Condvar implementation uses a HashMap to track waiters. I took the easy way out and just used a vector as an association list instead -- we shouldn't expect large numbers of waiters on the same condvar, so this shouldn't be too inefficient.
Replace pthread `RwLock` with custom implementation This is one of the last items in rust-lang#93740. I'm doing `RwLock` first because it is more self-contained and has less tradeoffs to make. The motivation is explained in the documentation, but in short: the pthread rwlock is slow and buggy and `std` can do much better. I considered implementing a parking lot, as was discussed in the tracking issue, but settled for the queue-based version because writing self-balancing binary trees is not fun in Rust... This is a rather complex change, so I have added quite a bit of documentation to help explain it. Please point out any part that could be explained better. ~~The read performance is really good, I'm getting 4x the throughput of the pthread version and about the same performance as usync/parking_lot on an Apple M1 Max in the usync benchmark suite, but the write performance still falls way behind what usync and parking_lot achieve. I tried using a separate queue lock like what usync uses, but that didn't help. I'll try to investigate further in the future, but I wanted to get some eyes on this first.~~ [Resolved](rust-lang#110211 (comment)) r? `@m-ou-se` CC `@kprotty`
So, with a bit of trickery, we could soon switch over For |
These constructors have been const since Rust 1.63 (rust-lang/rust#93740). It's pretty easy for us to make them const too, which allows code that relies on them being const to correctly compile with Shuttle. The one exception is that HashMap::new isn't const, and our Condvar implementation uses a HashMap to track waiters. I took the easy way out and just used a vector as an association list instead -- we shouldn't expect large numbers of waiters on the same condvar, so this shouldn't be too inefficient.
@m-ou-se I'm working on macOS and I was quite surprised that my malfunction tests panicked on OOM in very surprising places, like locking a mutex. After some investigation I found this issue and your comment:
I looked into the implementation of the But this doesn't mean the mutex is unmovable. I failed to find any reason why it should not be movable, provided we make sure it's 8B aligned in Rust, since there is no self-referentiality. The address of the I.e., adding I didn't check the other sync primitives in detail, but it seems to be the same in green. So what's preventing us from simply declaring Thanks in advance. |
Rust does allow this: let m: Mutex = ...;
mem::forget(m.lock().unwrap()); // never unlocks
let move_locked_mutex = Box::new(m); |
I have an almost-finished patch sitting around that switches macOS to Edit: uploaded it as draft, see #122408 |
That's true. In case a second But this has nothing to do with the problem at hand on macOS - the same potential issue is present on all platforms.
Great, thanks! I hope it makes it in the next Rust release :-). If I understand it correctly, the implementation relies on presence of the respective symbols and if they are not found, then a fallback is used? Anyway, I'm now in progress of installing 14.4 :-). |
We need it to reliably deadlock (or abort the program) for soundness. |
That would be unsound, and a critical bug. Lucky enough we are ensuring that this is not the case. This is leaving aside the question whether macOS guarantees that the lock can be moved at all, even when it is not currently held. AFAIK POSIX does not provide such a guarantee. You made a claim based on the current macOS POSIX lock implementation, but it's not really relevant what the implementation does, it's only relevant what Apple guarantees will be true for the current and all future implementations. |
Thanks. Even after more than 2 years doing systems programming in Rust I'm still discovering where I made thinking mistakes at the beginning (too much C++ influence over the past 30 years :-).
No, I'm afraid, there is no such guarantee. As mentioned, I investigated the coding behind
...this is absolutely correct and we can't rely on it.
For Anyway, let's close the discussion here - I definitely understand the need to make the library sound. OTOH, random panics/aborts at unexpected locations (mutex lock) due to OOM situations are not optimal (especially since undocumented). Therefore, I'm looking forward to the change by @joboet switching macOS |
Of course we need soundness guarantees. But in general, there's a way to safely use platform implementations that don't guarantee a property you need, but that actually have it in practice:
In parallel, you also ask the platform vendor to make the guarantee — either for all platform versions, or in a way that is detectable for future platform versions. If they make it for all supported platform versions, then you can remove the fallback. This assumes that there is some way to implement the fallback. It is also subject to temporary performance regressions for new platform versions until they've been marked as okay. But that's what asking the platform vendor to guarantee the property is for. It can be a reasonable strategy if you think that the vendor is likely to guarantee the property but just aren't sure how long that will take. |
This is a closed issue, there's no point in attempting a discussion here. If you have a concrete issue you think should be fixed, or a concrete change you want to propose, please file a new issue. "Make macOS Mutex/RwLock/Condvar not need heap allocations" does seem like a goal worth tracking. |
Here we go: #131005 |
A few weeks ago, on January 12th, the @rust-lang/libs team discussed a plan to improve
std::sync::Mutex
,std::sync::Condvar
, andstd::sync::RwLock
.On most platforms, these structures are currently wrappers around their
pthread
equivalent, such aspthread_mutex_t
. These types are not movable, however, forcing us to wrap them in aBox
, resulting in an allocation and indirection for our lock types. This also gets in the way of aconst
constructor for these types, which makesstatic
locks more complicated than necessary.@Amanieu presented his library,
parking_lot
, which implements the 'parking lot' structure from WebKit. Rather than letting the operating system handle any waiting queues, this manages the queues in user space in a global hash table indexed by the address of the lock. This allows for flexible 1-byte mutexes with almost no restrictions.There has been an attempt at merging
parking_lot
into std, as a replacement for the implementation of std's locking primitives. However, many different blockers came up and that PR did not get merged. (See also my RustConf talk for some of this history.) Many of these blockers have been resolved since.One of the problems with replacing std's lock implementations by
parking_lot
is thatparking_lot
allocates memory for its global hash table. A Rust program can define its own custom allocator, and such a custom allocator will likely use the standard library's locks, creating a cyclic dependency problem where you can't allocate memory without locking, but you can't lock without first allocating the hash table.A possible solution is to use a
static
table with a fixed number of entries, which does not scale with the number of threads. This could work well in many situations, but can become problematic in highly parallel programs and systems. Another possible solution is to skip the global allocator and directly allocate this hash table with a native system API such asmmap
on Unix. (See this thread for some discussion.)In our meeting, we discussed what's natively available on various operating systems/platforms:
On Windows, we have already switched to Windows SRW locks, which do not require allocation as they can be moved. They are small (32-bit), efficient, and
const
constructible. Just for Windows, there does not seem much advantage to switching toparking_lot
, although there might be some (small) performance difference.On both Linux and some BSDs there's a native
futex()
syscall, which provides functionality similar (but more primitive) than a parking lot. Nearly equivalent functionality is available in Wasm. While not trivial, it's possible to implement our synchronization primitives directly based on such a futex. This makes all the primitives small (32-bit), efficient, andconst
constructible.On macOS and some other Unixes there doesn't seem to be anything that's similar futex syscall. (
parking_lot
usespthread
for its thread parking operations on those platforms.)Some smaller platforms that are (partially) supported by the standard library have their own non-trivial implementation of the standard locks, which are hard to maintain and have not gotten much validation.
After some discussion, the consensus was to providing the locks as 'thinnest possible wrapper' around the native lock APIs as long as they are still small, efficient, and
const
constructible. This means SRW locks on Windows, and futex-based locks on Linux, some BSDs, and Wasm. And for other platforms without suitable native APIs, aparking_lot
-based implementation using one of the possible workarounds for the allocation problem.This means that on platforms like Linux and Windows, the operating system will be responsible for managing the waiting queues of the locks, such that any kernel improvements and features like debugging facilities in this area are directly available for Rust programs.
It also means that porting the standard library to a new platform will be easier, and maintainance of the std implementation for the less common platforms will become easier. These platforms will now only have to implement a thread parker and can rely on a performant and correct implementation of the standard locks on top of that.
Update: We've decided to not directly use parking lot's one-byte (two bit) locks, but instead use the equivalent of its internal
WordLock
or Windows' SRW locks, which are one pointer in size and require no global state in the program. That solves the allocation problem.Update 2: To maintain priority inheritance of the native locks (e.g. on macOS), we've kept using pthread locks on non-futex unix platforms, at least for now. Using #97647, we've made it possible for the locks to have
const fn new
.Primary goals:
macOSstd::sync::{Mutex, Condvar, RwLock}::new
intoconst
functions: Make {Mutex, Condvar, RwLock}::new() const. #97791Possible later goals:
Condvar::wait_rwlock
to makeCondvar
usable withRwLock
s?Send
ingMutexGuard
s to other threads?To do:
Condvar
requirements to allow for unboxed mutexes. Relax promises about condition variable. #76932sys_common::Mutex
to have a separateMovableMutex
type to allow unboxing on some platforms. Split sys_common::Mutex in StaticMutex and MovableMutex. #77147Box
es inMutex
andCondvar
on Windows and Wasm. Unbox mutexes and condvars on some platforms #77380Box
fromRwLock
on Windows and Wasm. Multiple improvements to RwLocks #84687WaitOnAddress
andNtWaitForKeyedEvent
in the standard library.) Add fast WaitOnAddress-based thread parker for Windows. #77618parking_lot
if we want to use that on Windows too. But if we just use SRW locks, this was not necessary to unblock the lock improvements.futex()
syscall in Miri to allow Miri to run standard library tests. Implement futex_wait and futex_wake. miri#1568Mutex
andCondvar
usingfutex()
.ReentrantMutex
usingfutex()
: Replace ReentrantMutex by a futex-based one on Linux. #95727RwLock
usingfutex()
: Replace RwLock by a futex based one on Linux #95801ReentrantMutex
on all platforms: Use a single ReentrantMutex implementation on all platforms. #96042atomic.wait
/atomic.notify
based locks on Wasm. Use sys::unix::locks::futex* on wasm+atomics. #96206new
functionsconst
. Make {Mutex, Condvar, RwLock}::new() const. #97791Instant::now()
, to avoid a cyclic dependency.std::sync::{Mutex, Condvar}
in std::sys's thread parker, to avoid a cyclic dependency. std: directly use pthread in UNIX parker implementation #96393RwLock
with custom implementation #110211)Write some blog post about all this.(Tracking issue for improving std::sync::{Mutex, RwLock, Condvar} #93740 (comment))The text was updated successfully, but these errors were encountered: