-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: improve scaling of lock2 #68578
Comments
Nice analysis! It looks sound to me. In theory there's a latency benefit to spinning, but if you have tons of threads spinning then that can create contention that's much worse than if a few of the threads just went to sleep. I think it's been a looong time since anyone had a crack at the I also wonder if there's a sweet spot on spinning, wherein for very short critical sections it's beneficial to have maybe one or two threads spinning (so some small limit), and the unlock path just wakes another spinning thread, expecting that one of the current spinners will win. It certainly wouldn't be a fair lock, but maybe it would be better? |
Thanks for taking a look, @mknyszek . Yes, I think that limiting the number of spinning threads to one or two is part of the answer. If we can afford the bookkeeping (and we may need to!), I think the An analogy on my mind for how to solve this is a baseball lineup: Most of the threads are "in the dugout", sleeping. One thread is "at bat", maybe doing many I don't know if that's a solution, or if it just pushes the problem back one level. But maybe that's close enough to be a solution in practice? Maybe the "dugout" is an MCS or CLH queue. (MCS seems good, except for requiring a spin during I'm working from The Art of Multiprocessor Programming. I've got the "revised first edition", which appears to be a 2012 update to a 2008 original. Are MCS and CLH still the best available algorithms for workloads like runtime.mutex has? @aclements , if you know (and you often do)? Or @chabbimilind , I see you've published in this area since 2012. Thanks! About fairness: In lock_sema.go (GOOS=darwin, windows, and several others), new waiters push themselves onto a stack so the thread at the bottom of the stack will stay there until all of the other threads have had their turn. So maybe unfairness isn't a problem? Except that when the critical section is short, threads wake up very often and so the stack probably gets cleared out on a regular basis. A thread being able to capture the mutex for short periods of time (rather than always queueing behind other waiting threads) looks important for performance; we wouldn't be able to average ~15 ns per channel operation if the channel's mutex needs to be handed to a different thread each time. But I suspect we also don't want the full amount of unfairness we'd get if the stack in lock_sema.go worked as the code appears to intend. |
The basic MCS lock does not allow you to exist from the wait queue. You may be looking for one of these: I noticed something in the current https://github.com/golang/go/blob/master/src/runtime/lock_futex.go code in
|
Change https://go.dev/cl/601396 mentions this issue: |
Thanks for the references! It'll take me a little while to digest those. I don't think that lock2 needs to allow timeouts. It looks like the more advanced mutex implementations also require more than one word of memory (though we can usually borrow some from the M while it's in a lock2 call). There's a mutex in each Sometimes the limiting factor in an app is a channel's mutex (#57070), but often it's an explicit process-wide singleton like
It's been like that since at least the 1.0 release: https://github.com/golang/go/blob/go1/src/pkg/runtime/lock_futex.c#L66 It's also unusual that the access of |
I was curious about the relative speed of On amd64, On arm64, Maybe the speed of those changes based on how much bus activity there is? But I'm surprised to see more than a 100x difference between Intel and ARM processor performance there, with lock2 not accounting for that at all. On my i7 Linux machine, I'm not surprised by a 5–7x difference in Linux performance on different hardware. I'm not surprised by a 19x difference between how Linux and macOS implement a yield to the OS scheduler. What surprises me is that on my i7 Linux machine, In BenchmarkChanContended, a single channel operation takes around 15 ns. So the lock can be acquired, used, and released over 100 times before an i7 performance core finishes a single At the very least, it's not tuned well for the hardware I happen to have. Details from benchstat below:
|
Change https://go.dev/cl/601597 mentions this issue: |
I wrote https://go.dev/cl/601597 to implement the idea of "a spinning thread announces itself, which allows unlock2 to skip the wake, which allows lock2 to truly sleep". That sketch (at PS 1) is a bit slower than the current three-state mutex when there's no contention, and it degrades quickly at first. But then it stops getting slower; "it scales". Its throughput holds steady at 1/4 of the zero-contention speed, even as the number of contending threads increases to 20 (the size of my test machine). In contrast, the performance of the baseline lock_futex.go degrades with each additional contending thread. At 8 threads, the two implementations are equal, and at 20 the performance of the baseline has slowed an additional 3x. Another benefit that's not shown here: As the number of contending threads increases, nearly all of them are asleep. They're using less electricity and the OS can put them to use running other programs. The microbenchmark of course has its limits: it would be easy for us to increase the argument to Below are the benchstat results of CL 601597 PS 1 versus its parent on my test machine, in text and visual form. In the chart, I've converted the number of microseconds per benchmark iteration into the number of channel operations per second (each ChanContended op is 100 sends and 100 receives). The experiment holds steady around 13.5 million channel operations per second, while the baseline continues to slow down with each additional contending thread (to 4.5 million ops per second at 20 threads).
I read Rust's implementation, https://github.com/rust-lang/rust/blob/1.80.0/library/std/src/sys/sync/mutex/futex.rs . It looks similar in structure to our current lock_futex.go implementation. It does not do On arm64, Rust's Mutex implements the busy loop pause as |
Change https://go.dev/cl/602176 mentions this issue: |
These are delay primitives for lock2. If a mutex isn't immediately available, we can use procyield to tell the processor to wait for a moment, or osyield to allow the OS to run a different process or thread if one is waiting. We expect a processor-level yield to be faster than an os-level yield, and for both of them to be fast relative to entering a full sleep (via futexsleep or semasleep). Each architecture has its own way of hinting to the processor that it's in a spin-wait loop, so procyield presents an architecture-independent interface for use in lock_futex.go and lock_sema.go. Measure the (single-threaded) speed of these to confirm. For #68578 Change-Id: I90cd46ea553f2990395aceb048206285558c877e Reviewed-on: https://go-review.googlesource.com/c/go/+/601396 LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>
The speed of handing off a mutex to a waiting thread is sensitive to the configuration of the spinning section of lock2. Measure that latency directly, to complement our existing benchmarks of mutex throughput. For #68578 Change-Id: I7637684bcff62eb05cc008491f095f653d13af4b Reviewed-on: https://go-review.googlesource.com/c/go/+/602176 Reviewed-by: Dmitri Shuralyov <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Auto-Submit: Rhys Hiltner <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]>
Change https://go.dev/cl/604375 mentions this issue: |
Measure throughput of a single mutex with all threads contending. Do not attempt to measure fairness/starvation. The ChanContended benchmark works somewhat well for this (interacting with the mutex is a large contributor to its results), but it's better to be clear about what we're attempting to measure. For #68578 Change-Id: Ie397b4c363bfcd5afddf796a81cd6c34ebf8551b Reviewed-on: https://go-review.googlesource.com/c/go/+/604375 Reviewed-by: David Chase <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Auto-Submit: Rhys Hiltner <[email protected]>
Change https://go.dev/cl/606901 mentions this issue: |
Change https://go.dev/cl/606900 mentions this issue: |
For #68578 Change-Id: Idecfdbb793f46560dd69287af9170c07cf4ee973 Reviewed-on: https://go-review.googlesource.com/c/go/+/606900 Reviewed-by: Keith Randall <[email protected]> Reviewed-by: Mauri de Souza Meneguzzo <[email protected]> Auto-Submit: Rhys Hiltner <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Keith Randall <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>
For #68578 Change-Id: Ia9580579bfc4709945bfcf6ec3803d5d11812187 Reviewed-on: https://go-review.googlesource.com/c/go/+/606901 Reviewed-by: Mauri de Souza Meneguzzo <[email protected]> Reviewed-by: Keith Randall <[email protected]> Reviewed-by: Keith Randall <[email protected]> Reviewed-by: Martin Möhrmann <[email protected]> Auto-Submit: Rhys Hiltner <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]>
Change https://go.dev/cl/617618 mentions this issue: |
Change https://go.dev/cl/620435 mentions this issue: |
Add a design doc describing the general approach of the "spinbit" mutex protocol, and the details of the two drafts that implement it. Based on futex, for linux/amd64: https://go.dev/cl/601597 For all GOOS values and four architectures: https://go.dev/cl/620435 For golang/go#68578 Change-Id: Ie9665085c9b8cf1741deeb431acfa12fba550b63 Reviewed-on: https://go-review.googlesource.com/c/proposal/+/617618 Auto-Submit: Rhys Hiltner <[email protected]> Reviewed-by: Rhys Hiltner <[email protected]> Commit-Queue: Rhys Hiltner <[email protected]>
Change https://go.dev/cl/622996 mentions this issue: |
Change https://go.dev/cl/622997 mentions this issue: |
Change https://go.dev/cl/622995 mentions this issue: |
When multiple threads all need to acquire the same runtime.mutex, make sure that none of them has to wait for too long. Measure how long a single thread can capture the mutex, and how long individual other threads must go between having a turn with the mutex. For #68578 Change-Id: I56ecc551232f9c2730c128a9f8eeb7bd45c2d3b5 Reviewed-on: https://go-review.googlesource.com/c/go/+/622995 Auto-Submit: Rhys Hiltner <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Cherry Mui <[email protected]>
Implement sema{create,sleep,wakeup} in terms of the futex syscall when available. Split the lock2/unlock2 implementations out of lock_sema.go and lock_futex.go (which they shared with runtime.note) to allow swapping in new implementations of those. Let futex-based platforms use the semaphore-based mutex implementation. Control that via the new "spinbitmutex" GOEXPERMENT value, disabled by default. This lays the groundwork for a "spinbit" mutex implementation; it does not include the new mutex implementation. For #68578. Change-Id: I091289c85124212a87abec7079ecbd9e610b4270 Reviewed-on: https://go-review.googlesource.com/c/go/+/622996 Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Cherry Mui <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]>
Change https://go.dev/cl/628515 mentions this issue: |
Change https://go.dev/cl/629416 mentions this issue: |
Change https://go.dev/cl/629415 mentions this issue: |
Change https://go.dev/cl/629417 mentions this issue: |
Simplify some flow control, as suggested on https://go.dev/cl/620435. The MutexCapture microbenchmark shows a bit of throughput improvement at moderate levels of contention, and little change to capture and starvation. (Note that the capture and starvation figures below are in terms of power-of-two buckets multiplied by throughput, so they either follow similar patterns or move by a factor of two.) For #68578 goos: linux goarch: amd64 pkg: runtime cpu: 13th Gen Intel(R) Core(TM) i7-13700H │ old │ new │ │ sec/op │ sec/op vs base │ MutexCapture 18.21n ± 0% 18.35n ± 0% +0.77% (p=0.000 n=10) MutexCapture-2 21.46n ± 8% 21.05n ± 12% ~ (p=0.796 n=10) MutexCapture-3 22.56n ± 9% 22.59n ± 18% ~ (p=0.631 n=10) MutexCapture-4 22.85n ± 5% 22.74n ± 2% ~ (p=0.565 n=10) MutexCapture-5 22.84n ± 5% 22.50n ± 14% ~ (p=0.912 n=10) MutexCapture-6 23.33n ± 14% 22.22n ± 3% -4.78% (p=0.004 n=10) MutexCapture-7 27.04n ± 14% 23.78n ± 15% ~ (p=0.089 n=10) MutexCapture-8 25.44n ± 10% 23.03n ± 6% -9.48% (p=0.004 n=10) MutexCapture-9 25.56n ± 7% 24.39n ± 11% ~ (p=0.218 n=10) MutexCapture-10 26.77n ± 10% 24.00n ± 7% -10.33% (p=0.023 n=10) MutexCapture-11 27.02n ± 7% 24.55n ± 15% -9.18% (p=0.035 n=10) MutexCapture-12 26.71n ± 8% 24.96n ± 8% ~ (p=0.148 n=10) MutexCapture-13 25.58n ± 4% 25.82n ± 5% ~ (p=0.271 n=10) MutexCapture-14 26.86n ± 6% 25.91n ± 7% ~ (p=0.529 n=10) MutexCapture-15 25.12n ± 13% 26.16n ± 4% ~ (p=0.353 n=10) MutexCapture-16 26.18n ± 4% 26.21n ± 9% ~ (p=0.838 n=10) MutexCapture-17 26.04n ± 4% 25.85n ± 5% ~ (p=0.363 n=10) MutexCapture-18 26.02n ± 7% 25.93n ± 5% ~ (p=0.853 n=10) MutexCapture-19 25.67n ± 5% 26.21n ± 4% ~ (p=0.631 n=10) MutexCapture-20 25.50n ± 6% 25.99n ± 8% ~ (p=0.404 n=10) geomean 24.73n 24.02n -2.88% │ old │ new │ │ sec/streak-p90 │ sec/streak-p90 vs base │ MutexCapture 76.36m ± 0% 76.96m ± 0% +0.79% (p=0.000 n=10) MutexCapture-2 10.609µ ± 50% 5.390µ ± 119% ~ (p=0.579 n=10) MutexCapture-3 5.936µ ± 93% 5.782µ ± 18% ~ (p=0.684 n=10) MutexCapture-4 5.849µ ± 5% 5.820µ ± 2% ~ (p=0.579 n=10) MutexCapture-5 5.849µ ± 5% 5.759µ ± 14% ~ (p=0.912 n=10) MutexCapture-6 5.975µ ± 14% 5.687µ ± 3% -4.81% (p=0.004 n=10) MutexCapture-7 6.921µ ± 14% 6.086µ ± 18% ~ (p=0.165 n=10) MutexCapture-8 6.512µ ± 10% 5.894µ ± 6% -9.50% (p=0.004 n=10) MutexCapture-9 6.544µ ± 7% 6.245µ ± 11% ~ (p=0.218 n=10) MutexCapture-10 6.962µ ± 11% 6.144µ ± 7% -11.76% (p=0.023 n=10) MutexCapture-11 6.938µ ± 7% 6.284µ ± 130% ~ (p=0.190 n=10) MutexCapture-12 6.838µ ± 8% 6.408µ ± 13% ~ (p=0.404 n=10) MutexCapture-13 6.549µ ± 4% 6.608µ ± 5% ~ (p=0.271 n=10) MutexCapture-14 6.877µ ± 8% 6.634µ ± 7% ~ (p=0.436 n=10) MutexCapture-15 6.433µ ± 13% 6.697µ ± 4% ~ (p=0.247 n=10) MutexCapture-16 6.702µ ± 10% 6.711µ ± 116% ~ (p=0.796 n=10) MutexCapture-17 6.730µ ± 3% 6.619µ ± 5% ~ (p=0.225 n=10) MutexCapture-18 6.663µ ± 7% 6.716µ ± 13% ~ (p=0.853 n=10) MutexCapture-19 6.570µ ± 5% 6.710µ ± 4% ~ (p=0.529 n=10) MutexCapture-20 6.528µ ± 6% 6.775µ ± 11% ~ (p=0.247 n=10) geomean 10.66µ 10.00µ -6.13% │ old │ new │ │ sec/starve-p90 │ sec/starve-p90 vs base │ MutexCapture-2 10.609µ ± 50% 5.390µ ± 119% ~ (p=0.579 n=10) MutexCapture-3 184.8µ ± 91% 183.9µ ± 48% ~ (p=0.436 n=10) MutexCapture-4 388.8µ ± 270% 375.6µ ± 280% ~ (p=0.436 n=10) MutexCapture-5 807.2µ ± 83% 2880.9µ ± 85% ~ (p=0.105 n=10) MutexCapture-6 2.272m ± 61% 2.173m ± 34% ~ (p=0.280 n=10) MutexCapture-7 1.351m ± 125% 2.990m ± 70% ~ (p=0.393 n=10) MutexCapture-8 3.328m ± 97% 3.064m ± 96% ~ (p=0.739 n=10) MutexCapture-9 3.526m ± 91% 3.081m ± 47% -12.62% (p=0.015 n=10) MutexCapture-10 3.641m ± 86% 3.228m ± 90% -11.34% (p=0.005 n=10) MutexCapture-11 3.324m ± 109% 3.190m ± 71% ~ (p=0.481 n=10) MutexCapture-12 3.519m ± 77% 3.200m ± 106% ~ (p=0.393 n=10) MutexCapture-13 3.353m ± 91% 3.368m ± 99% ~ (p=0.853 n=10) MutexCapture-14 3.314m ± 101% 3.396m ± 286% ~ (p=0.353 n=10) MutexCapture-15 3.534m ± 83% 3.397m ± 91% ~ (p=0.739 n=10) MutexCapture-16 3.485m ± 90% 3.436m ± 116% ~ (p=0.853 n=10) MutexCapture-17 6.516m ± 48% 3.452m ± 88% ~ (p=0.190 n=10) MutexCapture-18 6.645m ± 105% 3.439m ± 108% ~ (p=0.218 n=10) MutexCapture-19 6.521m ± 46% 4.907m ± 42% ~ (p=0.529 n=10) MutexCapture-20 6.532m ± 47% 3.516m ± 89% ~ (p=0.089 n=10) geomean 1.919m 1.783m -7.06% Change-Id: I36106e1baf8afd132f1568748d1b83b797fa260e Reviewed-on: https://go-review.googlesource.com/c/go/+/629415 Reviewed-by: Michael Knyszek <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Auto-Submit: Rhys Hiltner <[email protected]>
Part of my experience with runtime.mutex values is that process-wide demand for a single mutex can slow the program to a crawl. I don't think that alone will come as a surprise, though the magnitude of slowdown is more than I'd expect. The main surprise is that programs can have a hard time recovering from that performance cliff once they've fallen off.
I've described a case of "larger magnitude than I expected" as #57070. I've described (perhaps indirectly) "performance cliff that defies self-recovery" as #61426, #61427, and #61428. The runtime's ChanContended benchmark shows clear degradation with increasing GOMAXPROCS, which I suspect is the same problem.
Here's what I think is happening:
For some users of runtime.mutex values, the critical section is very fast. When exiting the critical section, unlock2 checks to see if there's another thread that has gone to sleep while waiting for the mutex. If unlock2 finds one, it wakes a single lock2 sleeper. While waking the other thread, unlock2 removes that sleeper's note within the mutex. That looks different in lock_sema.go and lock_futex.go, either popping a single entry off of the linked list or a full state reset to 0.
Now, the previous user of the mutex (recent caller of unlock2) is awake, and the lock2 caller that it woke is awake. If the previous user needs the same mutex again, it enters lock2 and we now have two non-sleeping threads attempting to acquire the mutex. Whichever one wins, it will soon call unlock2. If the caller of unlock2 finds that another thread is sleeping in lock2 waiting for this mutex, it will wake that thread.
Given enough time, a caller of lock2 will give up on the procyield/osyield section and go to sleep. But, how long does it take a thread to give up? And how long does it take a thread to complete another pass through the critical section, call unlock2, and wake an additional thread?
It looks like the time a thread in lock2 takes to go to sleep is over 100 times longer than the runtime's fastest critical sections. Preliminary measurements on an M1 MacBook Air with darwin/arm64 show ~6000 ns of spinning between sleeps and ~14 ns for an uncontended channel operation. That means, I think, that nearly all of the threads that need a runtime.mutex for a quick critical section will be awake the whole time. Being awake means repeatedly loading (about 5 times per wake cycle) and attempting CAS on the mutex value.
As more threads demand access to the mutex's cache line, updating it becomes more expensive. Acquiring the mutex involves updating that cache line. Going to sleep and waking sleepers also require updates to that cache line.
BenchmarkChanContended does 100 sends and 100 receives on a buffered channel with no channel-based waiting. That's 200 lock2/unlock2 pairs per "op". Each additional thread allowed by GOMAXPROCS reduces the whole-process throughput (with minor exceptions near the limit of physical cores, of around 1%).
Below is the behavior I see on an Intel i7-13700H (linux/amd64, 6 P cores with 2 threads each plus 8 E cores). When the process is allowed to use 4 threads, the whole-process throughput is 1/2 of what it was with 1 thread. With 8 threads, throughput is halved again. With 12, it's halved yet again. At GOMAXPROCS=20 the 200 channel operations take an average of 44 µs, so on average there's an unlock2 call every 220 ns, each with a fresh opportunity to wake a sleeping thread.
And here's the behavior I see on an M1 MacBook Air (darwin/arm64, 4 P cores plus 4 E cores). With 5 threads, throughput is less than half of what the program can do with one thread.
Another perspective is to consider how much on-CPU time the process has. The data below show that during 1.78 seconds of wall-clock time, the process's 20 threads were on-CPU for a total of 27.74 seconds within lock2 calls. Those threads were not sleeping!
In short, we have lock2 implementations that appear on the page to let threads sleep but which in practice make them all spin, and spinning threads at least correlate with (and I think also cause) slower lock handoffs.
I think we can do better. I have some ideas on how, but at this point I'd love feedback on whether others agree with my analysis and on the state of the art here today.
CC @golang/runtime
The text was updated successfully, but these errors were encountered: