runtime: improve scaling of lock2 #68578

rhysh · 2024-07-24T21:27:15Z

Part of my experience with runtime.mutex values is that process-wide demand for a single mutex can slow the program to a crawl. I don't think that alone will come as a surprise, though the magnitude of slowdown is more than I'd expect. The main surprise is that programs can have a hard time recovering from that performance cliff once they've fallen off.

I've described a case of "larger magnitude than I expected" as #57070. I've described (perhaps indirectly) "performance cliff that defies self-recovery" as #61426, #61427, and #61428. The runtime's ChanContended benchmark shows clear degradation with increasing GOMAXPROCS, which I suspect is the same problem.

Here's what I think is happening:

For some users of runtime.mutex values, the critical section is very fast. When exiting the critical section, unlock2 checks to see if there's another thread that has gone to sleep while waiting for the mutex. If unlock2 finds one, it wakes a single lock2 sleeper. While waking the other thread, unlock2 removes that sleeper's note within the mutex. That looks different in lock_sema.go and lock_futex.go, either popping a single entry off of the linked list or a full state reset to 0.

Now, the previous user of the mutex (recent caller of unlock2) is awake, and the lock2 caller that it woke is awake. If the previous user needs the same mutex again, it enters lock2 and we now have two non-sleeping threads attempting to acquire the mutex. Whichever one wins, it will soon call unlock2. If the caller of unlock2 finds that another thread is sleeping in lock2 waiting for this mutex, it will wake that thread.

Given enough time, a caller of lock2 will give up on the procyield/osyield section and go to sleep. But, how long does it take a thread to give up? And how long does it take a thread to complete another pass through the critical section, call unlock2, and wake an additional thread?

It looks like the time a thread in lock2 takes to go to sleep is over 100 times longer than the runtime's fastest critical sections. Preliminary measurements on an M1 MacBook Air with darwin/arm64 show ~6000 ns of spinning between sleeps and ~14 ns for an uncontended channel operation. That means, I think, that nearly all of the threads that need a runtime.mutex for a quick critical section will be awake the whole time. Being awake means repeatedly loading (about 5 times per wake cycle) and attempting CAS on the mutex value.

As more threads demand access to the mutex's cache line, updating it becomes more expensive. Acquiring the mutex involves updating that cache line. Going to sleep and waking sleepers also require updates to that cache line.

BenchmarkChanContended does 100 sends and 100 receives on a buffered channel with no channel-based waiting. That's 200 lock2/unlock2 pairs per "op". Each additional thread allowed by GOMAXPROCS reduces the whole-process throughput (with minor exceptions near the limit of physical cores, of around 1%).

Below is the behavior I see on an Intel i7-13700H (linux/amd64, 6 P cores with 2 threads each plus 8 E cores). When the process is allowed to use 4 threads, the whole-process throughput is 1/2 of what it was with 1 thread. With 8 threads, throughput is halved again. With 12, it's halved yet again. At GOMAXPROCS=20 the 200 channel operations take an average of 44 µs, so on average there's an unlock2 call every 220 ns, each with a fresh opportunity to wake a sleeping thread.

$ go version
go version go1.23rc2 linux/amd64
$ go test runtime -test.run='^$' -test.bench=ChanContended -test.cpu="$(seq 1 20 | tr '\n' ',')" -test.count=10

goos: linux
goarch: amd64
pkg: runtime
cpu: 13th Gen Intel(R) Core(TM) i7-13700H
                 │      -      │
                 │   sec/op    │
ChanContended      3.167µ ± 1%
ChanContended-2    4.723µ ± 6%
ChanContended-3    5.563µ ± 3%
ChanContended-4    6.527µ ± 2%
ChanContended-5    7.848µ ± 4%
ChanContended-6    8.683µ ± 1%
ChanContended-7    11.19µ ± 0%
ChanContended-8    13.96µ ± 1%
ChanContended-9    17.15µ ± 5%
ChanContended-10   20.17µ ± 3%
ChanContended-11   24.14µ ± 0%
ChanContended-12   29.16µ ± 1%
ChanContended-13   32.72µ ± 4%
ChanContended-14   36.23µ ± 0%
ChanContended-15   36.10µ ± 1%
ChanContended-16   37.73µ ± 5%
ChanContended-17   37.59µ ± 4%
ChanContended-18   41.37µ ± 6%
ChanContended-19   42.87µ ± 1%
ChanContended-20   44.36µ ± 2%
geomean            17.43µ

And here's the behavior I see on an M1 MacBook Air (darwin/arm64, 4 P cores plus 4 E cores). With 5 threads, throughput is less than half of what the program can do with one thread.

$ go version
go version go1.23rc2 darwin/arm64
$ go test runtime -test.run='^$' -test.bench=ChanContended -test.cpu="$(seq 1 8 | tr '\n' ',')" -test.count=10

goos: darwin
goarch: arm64
pkg: runtime
cpu: Apple M1
                │      -      │
                │   sec/op    │
ChanContended     2.639µ ± 1%
ChanContended-2   3.303µ ± 6%
ChanContended-3   4.548µ ± 1%
ChanContended-4   5.041µ ± 0%
ChanContended-5   5.788µ ± 1%
ChanContended-6   6.171µ ± 0%
ChanContended-7   6.470µ ± 0%
ChanContended-8   6.405µ ± 0%
geomean           4.829µ

Another perspective is to consider how much on-CPU time the process has. The data below show that during 1.78 seconds of wall-clock time, the process's 20 threads were on-CPU for a total of 27.74 seconds within lock2 calls. Those threads were not sleeping!

$ go test runtime -test.run='^$' -test.bench=ChanContended -test.cpu=20 -test.count=1 -test.cpuprofile=/tmp/p
goos: linux
goarch: amd64
pkg: runtime
cpu: 13th Gen Intel(R) Core(TM) i7-13700H
BenchmarkChanContended-20    	   26667	     44404 ns/op
PASS
ok  	runtime	1.785s

$ go tool pprof -peek runtime.lock2 /tmp/p
File: runtime.test
Type: cpu
Time: Jul 24, 2024 at 8:45pm (UTC)
Duration: 1.78s, Total samples = 31.32s (1759.32%)
Showing nodes accounting for 31.32s, 100% of 31.32s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                            27.74s   100% |   runtime.lockWithRank
     4.57s 14.59% 14.59%     27.74s 88.57%                | runtime.lock2
                                            19.50s 70.30% |   runtime.procyield
                                             2.74s  9.88% |   runtime.futexsleep
                                             0.84s  3.03% |   runtime.osyield
                                             0.07s  0.25% |   runtime.(*lockTimer).begin
                                             0.02s 0.072% |   runtime.(*lockTimer).end
----------------------------------------------------------+-------------

In short, we have lock2 implementations that appear on the page to let threads sleep but which in practice make them all spin, and spinning threads at least correlate with (and I think also cause) slower lock handoffs.

I think we can do better. I have some ideas on how, but at this point I'd love feedback on whether others agree with my analysis and on the state of the art here today.

CC @golang/runtime

The text was updated successfully, but these errors were encountered:

gabyhelp · 2024-07-24T21:28:57Z

Related Issues and Documentation

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

mknyszek · 2024-07-24T22:13:24Z

Nice analysis! It looks sound to me. In theory there's a latency benefit to spinning, but if you have tons of threads spinning then that can create contention that's much worse than if a few of the threads just went to sleep.

I think it's been a looong time since anyone had a crack at the runtime.mutex code. I'd love to hear your ideas for improving the lock path.

I also wonder if there's a sweet spot on spinning, wherein for very short critical sections it's beneficial to have maybe one or two threads spinning (so some small limit), and the unlock path just wakes another spinning thread, expecting that one of the current spinners will win. It certainly wouldn't be a fair lock, but maybe it would be better?

rhysh · 2024-07-25T03:40:29Z

Thanks for taking a look, @mknyszek .

Yes, I think that limiting the number of spinning threads to one or two is part of the answer. If we can afford the bookkeeping (and we may need to!), I think the unlock path should be able to detect the presence of a spinning thread and should only wake a sleeping waiter if it knows that no threads are spinning.

An analogy on my mind for how to solve this is a baseball lineup: Most of the threads are "in the dugout", sleeping. One thread is "at bat", maybe doing many lock/unlock pairs in a row. And one thread is "on deck", spinning and ready to jump in when the "at bat" thread releases the mutex for a bit too long to recapture. When a thread calls unlock (it must have been "at bat"), it checks to see if there's a thread "on deck". If so, no further action is needed. If not, it does a wakeup. A new thread that becomes interested in the mutex could start off by spinning (as they do today), and then sleep after a few rounds of disappointment (also as they do today, but with "sleep" now being the improved "in the dugout" version).

I don't know if that's a solution, or if it just pushes the problem back one level. But maybe that's close enough to be a solution in practice? Maybe the "dugout" is an MCS or CLH queue. (MCS seems good, except for requiring a spin during unlock to wait for the second half of a new sleeper's two-part registration?) And maybe threads make a point of going to the end of the queue if they haven't done that for any of their last N acquisitions, in the name of fairness.

I'm working from The Art of Multiprocessor Programming. I've got the "revised first edition", which appears to be a 2012 update to a 2008 original. Are MCS and CLH still the best available algorithms for workloads like runtime.mutex has? @aclements , if you know (and you often do)? Or @chabbimilind , I see you've published in this area since 2012. Thanks!

About fairness: In lock_sema.go (GOOS=darwin, windows, and several others), new waiters push themselves onto a stack so the thread at the bottom of the stack will stay there until all of the other threads have had their turn. So maybe unfairness isn't a problem? Except that when the critical section is short, threads wake up very often and so the stack probably gets cleared out on a regular basis.

A thread being able to capture the mutex for short periods of time (rather than always queueing behind other waiting threads) looks important for performance; we wouldn't be able to average ~15 ns per channel operation if the channel's mutex needs to be handed to a different thread each time. But I suspect we also don't want the full amount of unfairness we'd get if the stack in lock_sema.go worked as the code appears to intend.

chabbimilind · 2024-07-25T04:57:28Z

The basic MCS lock does not allow you to exist from the wait queue. You may be looking for one of these:
https://www.mcs.anl.gov/~aamer/papers/ppopp17_hmcst.pdf (one level variant)
https://www.cs.rochester.edu/u/scott/papers/2002_PODC_nb_timeout.pdf
https://www.cs.rochester.edu/u/scott/papers/2001_PPoPP_Timeout.pdf

I noticed something in the current https://github.com/golang/go/blob/master/src/runtime/lock_futex.go code in lock2: why is the l.key == mutex_unlocked condition check in a loop in the code below? If the objective is to make it behave as a test-and-test-and-set, I would expect the inner loop to be an if condition, not a loop. May be there a s reason why it is the way it is but I could not understand easily.

		for i := 0; i < spin; i++ {
			for l.key == mutex_unlocked { // loop?
				if atomic.Cas(key32(&l.key), mutex_unlocked, wait) {
					timer.end()
					return
				}
			}
			procyield(active_spin_cnt)
		}

gopherbot · 2024-07-26T17:34:51Z

Change https://go.dev/cl/601396 mentions this issue: runtime: measure speed of procyield and osyield

rhysh · 2024-07-26T18:19:00Z

Thanks for the references! It'll take me a little while to digest those. I don't think that lock2 needs to allow timeouts.

It looks like the more advanced mutex implementations also require more than one word of memory (though we can usually borrow some from the M while it's in a lock2 call). There's a mutex in each chan value, so I don't think we can balloon those too much.

Sometimes the limiting factor in an app is a channel's mutex (#57070), but often it's an explicit process-wide singleton like sched.lock or mheap_.lock. I wonder if, for those few special cases, we should consider using an especially scalable / collapse-resistant mutex implementation with a higher memory footprint. Maybe once we have a new single-word lock2 implementation and have used it for a few releases. :)

why is the l.key == mutex_unlocked condition check in a loop in the code below?

It's been like that since at least the 1.0 release: https://github.com/golang/go/blob/go1/src/pkg/runtime/lock_futex.c#L66

It's also unusual that the access of l.key there isn't an explicit atomic load. I think that's due to amd64 being the main target for production workloads at the time. (See also, the netpoller problems from a few years ago due to the same habits.) Another thing I plan to fix / clean up (or to better understand and then document).

rhysh · 2024-07-26T18:40:12Z

I was curious about the relative speed of procyield and osyield, and the results I got on my machines looked odd. I've shared the test code in https://go.dev/cl/601396 . In both lock_futex.go and lock_sema.go, lock2 does 4 calls of procyield(30), 1 call of osyield(), and then attempts to sleep (waiting for another thread to wake it when the lock might be available).

On amd64, procyield is the PAUSE instruction, https://www.felixcloutier.com/x86/pause.html . On my "13th Gen Intel(R) Core(TM) i7-13700H", I measure procyield(30) as taking 750 ns on the single-threaded efficiency cores, 2300 ns on the two-way SMT performance cores, and 2200 ns across the machine as a whole.

On arm64, procyield is the YIELD instruction, https://developer.arm.com/documentation/100076/0100/A64-Instruction-Set-Reference/A64-General-Instructions/YIELD . On my "M1 MacBook Air", I measure procyield(30) as taking 12 ns. (Since macOS can also run amd64 programs I measured with GOARCH=amd64 too, finding procyield(30) to take 13 ns.) On my "Raspberry Pi 5", I measure procyield(30) as taking 16 ns.

Maybe the speed of those changes based on how much bus activity there is? But I'm surprised to see more than a 100x difference between Intel and ARM processor performance there, with lock2 not accounting for that at all.

On my i7 Linux machine, osyield takes between 150 ns and 200 ns (performance vs efficiency cores). On my rpi5 Linux machine, it takes 1100 ns. On my M1 macOS machine, osyield takes 2800 ns.

I'm not surprised by a 5–7x difference in Linux performance on different hardware. I'm not surprised by a 19x difference between how Linux and macOS implement a yield to the OS scheduler. What surprises me is that on my i7 Linux machine, procyield(30) is slower than osyield, that it's slower by 15x, and that because there are 4 calls to procyield before the one call to osyield, that the machine spends 60x more time in procyield.

In BenchmarkChanContended, a single channel operation takes around 15 ns. So the lock can be acquired, used, and released over 100 times before an i7 performance core finishes a single procyield(30) call (2300 ns).

At the very least, it's not tuned well for the hardware I happen to have.

Details from benchstat below:

goos: darwin
goarch: amd64
pkg: runtime
cpu: VirtualApple @ 2.50GHz
                 │ yield-m1-amd64 │
                 │     sec/op     │
ProcYield/1-8         3.066n ± 0%
ProcYield/10-8        6.961n ± 1%
ProcYield/30-8        13.23n ± 1%
ProcYield/100-8       35.08n ± 0%
ProcYield/1000-8      328.2n ± 1%
OSYield-8             2.789µ ± 1%
geomean               45.66n

goarch: arm64
cpu: Apple M1
                 │  yield-m1   │
                 │   sec/op    │
ProcYield/1-8      2.234n ± 4%
ProcYield/10-8     5.631n ± 0%
ProcYield/30-8     11.90n ± 1%
ProcYield/100-8    33.80n ± 0%
ProcYield/1000-8   324.6n ± 1%
OSYield-8          2.769µ ± 1%
geomean            40.71n

goos: linux
goarch: amd64
cpu: 13th Gen Intel(R) Core(TM) i7-13700H
                  │   yield-i7   │     yield-i7-e     │     yield-i7-p      │
                  │    sec/op    │   sec/op     vs base   │    sec/op     vs base   │
ProcYield/1-20      70.63n ± 17%
ProcYield/10-20     735.7n ±  9%
ProcYield/30-20     2.209µ ± 17%
ProcYield/100-20    6.143µ ± 23%
ProcYield/1000-20   74.11µ ± 17%
OSYield-20          150.7n ± 10%
ProcYield/1-8                      26.95n ± 0%
ProcYield/10-8                     251.5n ± 0%
ProcYield/30-8                     750.5n ± 0%
ProcYield/100-8                    2.506µ ± 0%
ProcYield/1000-8                   24.96µ ± 0%
OSYield-8                          197.4n ± 1%
ProcYield/1-12                                          82.73n ±  8%
ProcYield/10-12                                         638.2n ± 19%
ProcYield/30-12                                         2.262µ ± 19%
ProcYield/100-12                                        6.135µ ± 23%
ProcYield/1000-12                                       75.06µ ± 19%
OSYield-12                                              166.0n ± 18%
geomean             1.410µ         630.5n       ? ¹ ²   1.446µ        ? ¹ ²
¹ benchmark set differs from baseline; geomeans may not be comparable
² ratios must be >0 to compute geomean

goarch: arm64
cpu: 
                 │  yield-pi5  │
                 │   sec/op    │
ProcYield/1-4      3.023n ± 1%
ProcYield/10-4     7.931n ± 0%
ProcYield/30-4     16.28n ± 0%
ProcYield/100-4    45.50n ± 0%
ProcYield/1000-4   434.4n ± 0%
OSYield-4          1.073µ ± 0%
geomean            44.97n

gopherbot · 2024-07-28T23:39:10Z

Change https://go.dev/cl/601597 mentions this issue: runtime: allow deeper sleep in futex-based mutexes

rhysh · 2024-07-29T20:01:36Z

I wrote https://go.dev/cl/601597 to implement the idea of "a spinning thread announces itself, which allows unlock2 to skip the wake, which allows lock2 to truly sleep". That sketch (at PS 1) is a bit slower than the current three-state mutex when there's no contention, and it degrades quickly at first.

But then it stops getting slower; "it scales". Its throughput holds steady at 1/4 of the zero-contention speed, even as the number of contending threads increases to 20 (the size of my test machine). In contrast, the performance of the baseline lock_futex.go degrades with each additional contending thread. At 8 threads, the two implementations are equal, and at 20 the performance of the baseline has slowed an additional 3x.

Another benefit that's not shown here: As the number of contending threads increases, nearly all of them are asleep. They're using less electricity and the OS can put them to use running other programs.

The microbenchmark of course has its limits: it would be easy for us to increase the argument to procyield, to slow down the spinning thread and thus makes it harder for it to acquire the mutex, to better allows the previous holder to capture it, to improve throughput at a (hidden) cost of latency. I've held off on any tuning like that. As for macro-benchmarks, perf_vs_parent builders show motion in several of their results, in both directions. I think it has potential.

Below are the benchstat results of CL 601597 PS 1 versus its parent on my test machine, in text and visual form. In the chart, I've converted the number of microseconds per benchmark iteration into the number of channel operations per second (each ChanContended op is 100 sends and 100 receives). The experiment holds steady around 13.5 million channel operations per second, while the baseline continues to slow down with each additional contending thread (to 4.5 million ops per second at 20 threads).

goos: linux
goarch: amd64
pkg: runtime
cpu: 13th Gen Intel(R) Core(TM) i7-13700H
                │     old     │                  new                  │
                │   sec/op    │    sec/op     vs base                 │
ChanContended      3.147µ ± 0%    3.703µ ± 0%   +17.65% (p=0.000 n=10)
ChanContended-2    4.511µ ± 2%    5.280µ ± 7%   +17.06% (p=0.000 n=10)
ChanContended-3    5.726µ ± 2%   12.125µ ± 2%  +111.75% (p=0.000 n=10)
ChanContended-4    6.574µ ± 1%   13.356µ ± 4%  +103.16% (p=0.000 n=10)
ChanContended-5    7.706µ ± 1%   13.717µ ± 3%   +78.00% (p=0.000 n=10)
ChanContended-6    8.830µ ± 1%   13.674µ ± 2%   +54.85% (p=0.000 n=10)
ChanContended-7    11.07µ ± 0%    13.59µ ± 2%   +22.77% (p=0.000 n=10)
ChanContended-8    13.99µ ± 1%    14.06µ ± 1%         ~ (p=0.190 n=10)
ChanContended-9    16.93µ ± 2%    14.04µ ± 3%   -17.04% (p=0.000 n=10)
ChanContended-10   20.12µ ± 4%    14.12µ ± 1%   -29.80% (p=0.000 n=10)
ChanContended-11   23.96µ ± 2%    14.44µ ± 3%   -39.74% (p=0.000 n=10)
ChanContended-12   29.65µ ± 6%    14.61µ ± 3%   -50.74% (p=0.000 n=10)
ChanContended-13   33.98µ ± 7%    14.69µ ± 3%   -56.76% (p=0.000 n=10)
ChanContended-14   37.90µ ± 1%    14.69µ ± 3%   -61.23% (p=0.000 n=10)
ChanContended-15   37.94µ ± 4%    14.89µ ± 5%   -60.75% (p=0.000 n=10)
ChanContended-16   39.56µ ± 0%    13.89µ ± 1%   -64.89% (p=0.000 n=10)
ChanContended-17   39.56µ ± 0%    14.45µ ± 4%   -63.47% (p=0.000 n=10)
ChanContended-18   41.24µ ± 2%    13.95µ ± 3%   -66.17% (p=0.000 n=10)
ChanContended-19   42.77µ ± 5%    13.80µ ± 2%   -67.74% (p=0.000 n=10)
ChanContended-20   44.26µ ± 2%    13.74µ ± 1%   -68.96% (p=0.000 n=10)
geomean            17.60µ         12.46µ        -29.22%

I read Rust's implementation, https://github.com/rust-lang/rust/blob/1.80.0/library/std/src/sys/sync/mutex/futex.rs . It looks similar in structure to our current lock_futex.go implementation.

It does not do osyield. @prattmic pointed me to https://go.dev/cl/473656, which links to https://www.realworldtech.com/forum/?threadid=189711&curpostid=189752, which describes why it's best to avoid osyield (which on Linux is a call to sched_yield(2)). Maybe we should remove that (as a separate change).

On arm64, Rust's Mutex implements the busy loop pause as ISB SY, https://github.com/rust-lang/rust/blob/1.80.0/library/core/src/hint.rs#L243 . They used to do YIELD as we do today, but stopped at rust-lang/rust@c064b65 . Maybe that's right for Go too (to be measured, and done as a separate change).

gopherbot · 2024-07-31T21:06:46Z

Change https://go.dev/cl/602176 mentions this issue: runtime: benchmark mutex handoffs

These are delay primitives for lock2. If a mutex isn't immediately available, we can use procyield to tell the processor to wait for a moment, or osyield to allow the OS to run a different process or thread if one is waiting. We expect a processor-level yield to be faster than an os-level yield, and for both of them to be fast relative to entering a full sleep (via futexsleep or semasleep). Each architecture has its own way of hinting to the processor that it's in a spin-wait loop, so procyield presents an architecture-independent interface for use in lock_futex.go and lock_sema.go. Measure the (single-threaded) speed of these to confirm. For #68578 Change-Id: I90cd46ea553f2990395aceb048206285558c877e Reviewed-on: https://go-review.googlesource.com/c/go/+/601396 LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>

The speed of handing off a mutex to a waiting thread is sensitive to the configuration of the spinning section of lock2. Measure that latency directly, to complement our existing benchmarks of mutex throughput. For #68578 Change-Id: I7637684bcff62eb05cc008491f095f653d13af4b Reviewed-on: https://go-review.googlesource.com/c/go/+/602176 Reviewed-by: Dmitri Shuralyov <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Auto-Submit: Rhys Hiltner <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]>

gopherbot · 2024-08-08T23:50:56Z

Change https://go.dev/cl/604375 mentions this issue: runtime: add direct benchmark of mutex contention

Measure throughput of a single mutex with all threads contending. Do not attempt to measure fairness/starvation. The ChanContended benchmark works somewhat well for this (interacting with the mutex is a large contributor to its results), but it's better to be clear about what we're attempting to measure. For #68578 Change-Id: Ie397b4c363bfcd5afddf796a81cd6c34ebf8551b Reviewed-on: https://go-review.googlesource.com/c/go/+/604375 Reviewed-by: David Chase <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Auto-Submit: Rhys Hiltner <[email protected]>

gopherbot · 2024-08-19T23:03:14Z

Change https://go.dev/cl/606901 mentions this issue: cmd/compile/internal/ssa: add atomic.Xchg8 intrinsic

gopherbot · 2024-08-19T23:03:15Z

Change https://go.dev/cl/606900 mentions this issue: internal/runtime/atomic: add Xchg8 for amd64

For #68578 Change-Id: Idecfdbb793f46560dd69287af9170c07cf4ee973 Reviewed-on: https://go-review.googlesource.com/c/go/+/606900 Reviewed-by: Keith Randall <[email protected]> Reviewed-by: Mauri de Souza Meneguzzo <[email protected]> Auto-Submit: Rhys Hiltner <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Keith Randall <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>

For #68578 Change-Id: Ia9580579bfc4709945bfcf6ec3803d5d11812187 Reviewed-on: https://go-review.googlesource.com/c/go/+/606901 Reviewed-by: Mauri de Souza Meneguzzo <[email protected]> Reviewed-by: Keith Randall <[email protected]> Reviewed-by: Keith Randall <[email protected]> Reviewed-by: Martin Möhrmann <[email protected]> Auto-Submit: Rhys Hiltner <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]>

gopherbot · 2024-10-04T22:01:27Z

Change https://go.dev/cl/617618 mentions this issue: design/68578-mutex-spinbit.md: share TLA+ model

gopherbot · 2024-10-15T20:22:20Z

Change https://go.dev/cl/620435 mentions this issue: runtime: unify lock2, allow deeper sleep

Add a design doc describing the general approach of the "spinbit" mutex protocol, and the details of the two drafts that implement it. Based on futex, for linux/amd64: https://go.dev/cl/601597 For all GOOS values and four architectures: https://go.dev/cl/620435 For golang/go#68578 Change-Id: Ie9665085c9b8cf1741deeb431acfa12fba550b63 Reviewed-on: https://go-review.googlesource.com/c/proposal/+/617618 Auto-Submit: Rhys Hiltner <[email protected]> Reviewed-by: Rhys Hiltner <[email protected]> Commit-Queue: Rhys Hiltner <[email protected]>

gopherbot · 2024-10-28T22:52:08Z

Change https://go.dev/cl/622996 mentions this issue: runtime: allow futex OSes to use sema-based mutex

gopherbot · 2024-10-28T22:52:08Z

Change https://go.dev/cl/622997 mentions this issue: runtime: unify lock2, allow deeper sleep, control tail

gopherbot · 2024-10-28T22:52:10Z

Change https://go.dev/cl/622995 mentions this issue: runtime: add test for mutex starvation

When multiple threads all need to acquire the same runtime.mutex, make sure that none of them has to wait for too long. Measure how long a single thread can capture the mutex, and how long individual other threads must go between having a turn with the mutex. For #68578 Change-Id: I56ecc551232f9c2730c128a9f8eeb7bd45c2d3b5 Reviewed-on: https://go-review.googlesource.com/c/go/+/622995 Auto-Submit: Rhys Hiltner <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Cherry Mui <[email protected]>

Implement sema{create,sleep,wakeup} in terms of the futex syscall when available. Split the lock2/unlock2 implementations out of lock_sema.go and lock_futex.go (which they shared with runtime.note) to allow swapping in new implementations of those. Let futex-based platforms use the semaphore-based mutex implementation. Control that via the new "spinbitmutex" GOEXPERMENT value, disabled by default. This lays the groundwork for a "spinbit" mutex implementation; it does not include the new mutex implementation. For #68578. Change-Id: I091289c85124212a87abec7079ecbd9e610b4270 Reviewed-on: https://go-review.googlesource.com/c/go/+/622996 Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Cherry Mui <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]>

gopherbot · 2024-11-15T23:06:41Z

Change https://go.dev/cl/628515 mentions this issue: runtime: double-link list of waiting Ms

gopherbot · 2024-11-18T21:53:59Z

Change https://go.dev/cl/629416 mentions this issue: runtime: use only CAS in lock2 slow path

gopherbot · 2024-11-18T21:54:00Z

Change https://go.dev/cl/629415 mentions this issue: runtime: clean up new lock2 structure

gopherbot · 2024-11-18T22:50:41Z

Change https://go.dev/cl/629417 mentions this issue: runtime: spin faster in lock2

Simplify some flow control, as suggested on https://go.dev/cl/620435. The MutexCapture microbenchmark shows a bit of throughput improvement at moderate levels of contention, and little change to capture and starvation. (Note that the capture and starvation figures below are in terms of power-of-two buckets multiplied by throughput, so they either follow similar patterns or move by a factor of two.) For #68578 goos: linux goarch: amd64 pkg: runtime cpu: 13th Gen Intel(R) Core(TM) i7-13700H │ old │ new │ │ sec/op │ sec/op vs base │ MutexCapture 18.21n ± 0% 18.35n ± 0% +0.77% (p=0.000 n=10) MutexCapture-2 21.46n ± 8% 21.05n ± 12% ~ (p=0.796 n=10) MutexCapture-3 22.56n ± 9% 22.59n ± 18% ~ (p=0.631 n=10) MutexCapture-4 22.85n ± 5% 22.74n ± 2% ~ (p=0.565 n=10) MutexCapture-5 22.84n ± 5% 22.50n ± 14% ~ (p=0.912 n=10) MutexCapture-6 23.33n ± 14% 22.22n ± 3% -4.78% (p=0.004 n=10) MutexCapture-7 27.04n ± 14% 23.78n ± 15% ~ (p=0.089 n=10) MutexCapture-8 25.44n ± 10% 23.03n ± 6% -9.48% (p=0.004 n=10) MutexCapture-9 25.56n ± 7% 24.39n ± 11% ~ (p=0.218 n=10) MutexCapture-10 26.77n ± 10% 24.00n ± 7% -10.33% (p=0.023 n=10) MutexCapture-11 27.02n ± 7% 24.55n ± 15% -9.18% (p=0.035 n=10) MutexCapture-12 26.71n ± 8% 24.96n ± 8% ~ (p=0.148 n=10) MutexCapture-13 25.58n ± 4% 25.82n ± 5% ~ (p=0.271 n=10) MutexCapture-14 26.86n ± 6% 25.91n ± 7% ~ (p=0.529 n=10) MutexCapture-15 25.12n ± 13% 26.16n ± 4% ~ (p=0.353 n=10) MutexCapture-16 26.18n ± 4% 26.21n ± 9% ~ (p=0.838 n=10) MutexCapture-17 26.04n ± 4% 25.85n ± 5% ~ (p=0.363 n=10) MutexCapture-18 26.02n ± 7% 25.93n ± 5% ~ (p=0.853 n=10) MutexCapture-19 25.67n ± 5% 26.21n ± 4% ~ (p=0.631 n=10) MutexCapture-20 25.50n ± 6% 25.99n ± 8% ~ (p=0.404 n=10) geomean 24.73n 24.02n -2.88% │ old │ new │ │ sec/streak-p90 │ sec/streak-p90 vs base │ MutexCapture 76.36m ± 0% 76.96m ± 0% +0.79% (p=0.000 n=10) MutexCapture-2 10.609µ ± 50% 5.390µ ± 119% ~ (p=0.579 n=10) MutexCapture-3 5.936µ ± 93% 5.782µ ± 18% ~ (p=0.684 n=10) MutexCapture-4 5.849µ ± 5% 5.820µ ± 2% ~ (p=0.579 n=10) MutexCapture-5 5.849µ ± 5% 5.759µ ± 14% ~ (p=0.912 n=10) MutexCapture-6 5.975µ ± 14% 5.687µ ± 3% -4.81% (p=0.004 n=10) MutexCapture-7 6.921µ ± 14% 6.086µ ± 18% ~ (p=0.165 n=10) MutexCapture-8 6.512µ ± 10% 5.894µ ± 6% -9.50% (p=0.004 n=10) MutexCapture-9 6.544µ ± 7% 6.245µ ± 11% ~ (p=0.218 n=10) MutexCapture-10 6.962µ ± 11% 6.144µ ± 7% -11.76% (p=0.023 n=10) MutexCapture-11 6.938µ ± 7% 6.284µ ± 130% ~ (p=0.190 n=10) MutexCapture-12 6.838µ ± 8% 6.408µ ± 13% ~ (p=0.404 n=10) MutexCapture-13 6.549µ ± 4% 6.608µ ± 5% ~ (p=0.271 n=10) MutexCapture-14 6.877µ ± 8% 6.634µ ± 7% ~ (p=0.436 n=10) MutexCapture-15 6.433µ ± 13% 6.697µ ± 4% ~ (p=0.247 n=10) MutexCapture-16 6.702µ ± 10% 6.711µ ± 116% ~ (p=0.796 n=10) MutexCapture-17 6.730µ ± 3% 6.619µ ± 5% ~ (p=0.225 n=10) MutexCapture-18 6.663µ ± 7% 6.716µ ± 13% ~ (p=0.853 n=10) MutexCapture-19 6.570µ ± 5% 6.710µ ± 4% ~ (p=0.529 n=10) MutexCapture-20 6.528µ ± 6% 6.775µ ± 11% ~ (p=0.247 n=10) geomean 10.66µ 10.00µ -6.13% │ old │ new │ │ sec/starve-p90 │ sec/starve-p90 vs base │ MutexCapture-2 10.609µ ± 50% 5.390µ ± 119% ~ (p=0.579 n=10) MutexCapture-3 184.8µ ± 91% 183.9µ ± 48% ~ (p=0.436 n=10) MutexCapture-4 388.8µ ± 270% 375.6µ ± 280% ~ (p=0.436 n=10) MutexCapture-5 807.2µ ± 83% 2880.9µ ± 85% ~ (p=0.105 n=10) MutexCapture-6 2.272m ± 61% 2.173m ± 34% ~ (p=0.280 n=10) MutexCapture-7 1.351m ± 125% 2.990m ± 70% ~ (p=0.393 n=10) MutexCapture-8 3.328m ± 97% 3.064m ± 96% ~ (p=0.739 n=10) MutexCapture-9 3.526m ± 91% 3.081m ± 47% -12.62% (p=0.015 n=10) MutexCapture-10 3.641m ± 86% 3.228m ± 90% -11.34% (p=0.005 n=10) MutexCapture-11 3.324m ± 109% 3.190m ± 71% ~ (p=0.481 n=10) MutexCapture-12 3.519m ± 77% 3.200m ± 106% ~ (p=0.393 n=10) MutexCapture-13 3.353m ± 91% 3.368m ± 99% ~ (p=0.853 n=10) MutexCapture-14 3.314m ± 101% 3.396m ± 286% ~ (p=0.353 n=10) MutexCapture-15 3.534m ± 83% 3.397m ± 91% ~ (p=0.739 n=10) MutexCapture-16 3.485m ± 90% 3.436m ± 116% ~ (p=0.853 n=10) MutexCapture-17 6.516m ± 48% 3.452m ± 88% ~ (p=0.190 n=10) MutexCapture-18 6.645m ± 105% 3.439m ± 108% ~ (p=0.218 n=10) MutexCapture-19 6.521m ± 46% 4.907m ± 42% ~ (p=0.529 n=10) MutexCapture-20 6.532m ± 47% 3.516m ± 89% ~ (p=0.089 n=10) geomean 1.919m 1.783m -7.06% Change-Id: I36106e1baf8afd132f1568748d1b83b797fa260e Reviewed-on: https://go-review.googlesource.com/c/go/+/629415 Reviewed-by: Michael Knyszek <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Auto-Submit: Rhys Hiltner <[email protected]>

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 24, 2024

dmitshur added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jul 25, 2024

dmitshur added this to the Backlog milestone Jul 25, 2024

rhysh mentioned this issue Jul 31, 2024

runtime: contention in runtime.gcFlushBgCredit on work.assistQueue.lock #61426

Open

mknyszek added this to Go Compiler / Runtime Jul 31, 2024

mknyszek moved this to Todo in Go Compiler / Runtime Jul 31, 2024

rhysh mentioned this issue Aug 7, 2024

runtime: findrunnableGCWorker consumed much higher CPU under 64c AWS instance (ARM64), comparing to 32c #68399

Open

rhysh mentioned this issue Sep 3, 2024

runtime: implement procyield as ISB instruction on arm64 #69232

Open

rhysh mentioned this issue Oct 2, 2024

internal/runtime/atomic: add Xchg8 to all GOOS=linux platforms #69735

Open

rhysh modified the milestones: Backlog, Go1.24 Nov 14, 2024

mknyszek assigned rhysh Nov 14, 2024

mknyszek moved this from Todo to In Progress in Go Compiler / Runtime Nov 14, 2024

gopherbot closed this as completed in fd050b3 Nov 15, 2024

github-project-automation bot moved this from In Progress to Done in Go Compiler / Runtime Nov 15, 2024

dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Nov 18, 2024

cherrymui mentioned this issue Nov 25, 2024

Go compiler and runtime meeting notes #43930

Open

bep mentioned this issue Nov 27, 2024

Go 1.24 notes/TODOs gohugoio/hugo#13091

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: improve scaling of lock2 #68578

runtime: improve scaling of lock2 #68578

rhysh commented Jul 24, 2024

gabyhelp commented Jul 24, 2024

mknyszek commented Jul 24, 2024

rhysh commented Jul 25, 2024

chabbimilind commented Jul 25, 2024

gopherbot commented Jul 26, 2024

rhysh commented Jul 26, 2024

rhysh commented Jul 26, 2024

gopherbot commented Jul 28, 2024

rhysh commented Jul 29, 2024 •

edited by gabyhelp

Loading

gopherbot commented Jul 31, 2024

gopherbot commented Aug 8, 2024

gopherbot commented Aug 19, 2024

gopherbot commented Aug 19, 2024

gopherbot commented Oct 4, 2024

gopherbot commented Oct 15, 2024

gopherbot commented Oct 28, 2024

gopherbot commented Oct 28, 2024

gopherbot commented Oct 28, 2024

gopherbot commented Nov 15, 2024

gopherbot commented Nov 18, 2024

gopherbot commented Nov 18, 2024

gopherbot commented Nov 18, 2024

runtime: improve scaling of lock2 #68578

runtime: improve scaling of lock2 #68578

Comments

rhysh commented Jul 24, 2024

gabyhelp commented Jul 24, 2024

mknyszek commented Jul 24, 2024

rhysh commented Jul 25, 2024

chabbimilind commented Jul 25, 2024

gopherbot commented Jul 26, 2024

rhysh commented Jul 26, 2024

rhysh commented Jul 26, 2024

gopherbot commented Jul 28, 2024

rhysh commented Jul 29, 2024 • edited by gabyhelp Loading

gopherbot commented Jul 31, 2024

gopherbot commented Aug 8, 2024

gopherbot commented Aug 19, 2024

gopherbot commented Aug 19, 2024

gopherbot commented Oct 4, 2024

gopherbot commented Oct 15, 2024

gopherbot commented Oct 28, 2024

gopherbot commented Oct 28, 2024

gopherbot commented Oct 28, 2024

gopherbot commented Nov 15, 2024

gopherbot commented Nov 18, 2024

gopherbot commented Nov 18, 2024

gopherbot commented Nov 18, 2024

rhysh commented Jul 29, 2024 •

edited by gabyhelp

Loading