-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: optimization to reduce P churn #32113
Comments
What about just enforcing a minimum delay between when a G is created and when it can be stolen? That gives the local P time to finish the spawning G (finish = either done or block) and pick up the new G itself. The delay would be on the order of the overhead to move a G between processors (sys calls, cache warmup, etc.) The tricky part is to not even wake the remote P when the goroutine is queued. We want a timer somehow that can be cancelled if the G is started locally. |
Yes, most of the waste is generated by the wakeup call itself. Ensuring that the other P does not steal the G is probably a minor improvement, but you're still going to waste a ton of cycles (maybe even doing these wake ups twice -- on (1) and (4)). I think using a timer gets much trickier. This is the reason I have limited the proposal to compiler-identified sequences of "chansend(block=true); chanrecv(block=true)" calls. It's possible that the system thread could be pre-empted between those calls, but if the system is busy (though Ps in this process may still be idle) it's probably even more valuable to not waste useless cycles. |
(Totally open to a timer, but I'm concerned about replacing a P wakeup with a kick to sysmon in order to enforce the timer, which solves the locality issue but still burns cycles.) |
Also see #8903 which was about a similar problem. I don't remember all details exactly now, but as far as I remember my proposal was somewhat more generic, but your wins in simplicity and most likely safer from potential negative effects for corner cases. |
This has come up repeatedly. Obviously it is easy to recognize and fuse
It's harder to see that in more complex code that would benefit from the optimization, though. We've fiddled with heuristics in the runtime to try to wait a little bit before stealing a G from a P, and so on. Probably more tuning is needed. It's unclear this needs to be a proposal, unless you are proposing a language change, and it sounds like you've backed away from that. The way forward with a suggestion like this is to try implementing it and see how much of an improvement (and how general of an improvement) it yields. |
/cc @randall77 @aclements |
I backed away from a language change proposal based on the assumption that it would likely not be accepted. My personal preference would be to have an operation like <~ that immediately switches to the other goroutine if currently waiting. (And behaves like a normal channel operation if busy.) But I realize that the existence of this operator might be confusing. I think it's unclear how much of a impact this would have in general. This is probably just be a tiny optimization that doesn't matter in the general case, but can help in a few very specific ones. For us, it might let us structure some goroutine interactions much more efficiently. I hacked something together, and it seems like there's a decent effect on microbenchmarks at least (unless I screwed something up). Code:
Before:
After:
The system time is telling @ 20x, and the extra 14% in CPU usage is indicative of an additional P waking up with nothing to do. (Or maybe it occasionally successfully steals the goroutine, which is also bad.) Assuming this small optimization is readily acceptable -- what's the best way to group those operations and transform the channel calls? The runtime bits are straight-forward, but any up front guidance on the compiler side is appreciated. Otherwise, I'm just planning to call a specialized scan in walkstmt list, but maybe there's a better way. |
Given that there is no language change here anymore, going to move this to being a regular issue. |
I've started looking into this. I've got a very naive implementation (probably very similar to Adin's) to use with his microbenchmark. Combined with Fixed time (
Fixed iterations (
I've included both since the the different fixed dimensions change the interpretation. e.g., the first case has higher cycles after because it is simply able to do a lot more work. And it still does nearly double the iterations in 30% less CPU time (== far less time stalled)! This certainly looks worthwhile from the micro-benchmark perspective. The questions remaining to me are if we can efficiently and reliably detect these scenarios, and if they affect many programs. |
For future reference, here's @amscanne's prototype: amscanne@eee812b This is a bit more advanced than mine, as I haven't made any compiler changes yet. |
Change https://golang.org/cl/254817 mentions this issue: |
@prattmic Michael, I have one question, amscanne/go@eee812b needs to modify apis? and needs user's program to perceive?How does compiler make the decision? |
Neither @amscanne nor my prototype change any language syntax or APIs. Rather, the compiler detects channel send followed immediately by channel receive and rather than calling the typical Both prototypes are rudimentary and would probably hurt performance for many programs due to poor decisions and would need more refinement. |
Some synchronization patterns require the ability to simultaneously wake and sleep a goroutine. For the sleep package, this is the case when a waker must be asserted when a subsequent fetch is imminent. Currently, this operation results in significant P churn in the runtime, which ping-pongs execution between multiple system threads and cores and consumes a significant amount of host CPU (and because of the context switches, this can be significant worse with mitigations for side channel vulnerabilities). The solution is to introduce a dedicated mechanism for a synchronous switch which does not wake another runtime P (see golang/go#32113). This can be used by the `AssertAndFetch` API in the sleep package. The benchmark results for this package are very similiar to raw channel operations for all cases, with the exception of operations that do not wait. The primary advantage is more precise control over scheduling. This will be used in a subsequent change. ``` BenchmarkGoAssertNonWaiting BenchmarkGoAssertNonWaiting-8 261364384 4.976 ns/op BenchmarkGoSingleSelect BenchmarkGoSingleSelect-8 20946358 57.77 ns/op BenchmarkGoMultiSelect BenchmarkGoMultiSelect-8 6071697 197.0 ns/op BenchmarkGoWaitOnSingleSelect BenchmarkGoWaitOnSingleSelect-8 4978051 235.4 ns/op BenchmarkGoWaitOnMultiSelect BenchmarkGoWaitOnMultiSelect-8 2309224 520.2 ns/op BenchmarkSleeperAssertNonWaiting BenchmarkSleeperAssertNonWaiting-8 447325033 2.657 ns/op BenchmarkSleeperSingleSelect BenchmarkSleeperSingleSelect-8 21488844 55.19 ns/op BenchmarkSleeperMultiSelect BenchmarkSleeperMultiSelect-8 21851674 54.89 ns/op BenchmarkSleeperWaitOnSingleSelect BenchmarkSleeperWaitOnSingleSelect-8 2860327 416.4 ns/op BenchmarkSleeperWaitOnSingleSelectSync BenchmarkSleeperWaitOnSingleSelectSync-8 2741733 427.1 ns/op BenchmarkSleeperWaitOnMultiSelect BenchmarkSleeperWaitOnMultiSelect-8 2867484 418.1 ns/op BenchmarkSleeperWaitOnMultiSelectSync BenchmarkSleeperWaitOnMultiSelectSync-8 2789158 427.9 ns/op ``` PiperOrigin-RevId: 406873844
Background: Scheduler is very easy to get into a "P churn" problem as stated by Adin in golang#32113. This problem is more serious in gVisor as futex() syscall, used for wake and idle M, is a much heavier operation from GR0 into HR0. Adin proposed to add the context semantics into scheduler to decide if we need to wake a new M. Let's call it a local strategy. Here we propose another way to solve this problem. Let's call it a global strategy. When we need to decide whether to start a new M, except the condition of an extra P, let's calculate # of runnable Gs and # of running Ps. When # of runnable Gs <= # of running Ps * factor, do not start another M, as those already-running Ms will steal Gs this P. We have tried using a factor of 1.5; but then switch to 1. The mechnism applies when we ready a G; and we also add this to handleoffp(). For handoffp(), the previous strategy to wake up a M is to satisfy one of below two conditions: - the local runq is not empty; - the global runq is not empty. We constrain the 2nd condition by checking the comparison of # of running Gs and # of running Ps. (Note that the running P here has different meaning with the one used above for wakep(). Two concerns are raised for this method: - If it adds too much contention when we do the G/P counting? A side info is that we usually set GOMAXPROCS to 4 or 8. And as we see from our result, CPU util is much lower; and we don't see too much contention in the flame graph. - If it brings worse latency? Yes, it does incur small regression in latency, but the CPU util seems a big enough advantage. Signed-off-by: Shi Liu <[email protected]> Signed-off-by: Jielong Zhou <[email protected]> Signed-off-by: Yong He <[email protected]> Signed-off-by: Jianfeng Tan <[email protected]>
Some synchronization patterns require the ability to simultaneously wake and sleep a goroutine. For the sleep package, this is the case when a waker must be asserted when a subsequent fetch is imminent. Currently, this operation results in significant P churn in the runtime, which ping-pongs execution between multiple system threads and cores and consumes a significant amount of host CPU (and because of the context switches, this can be significant worse with mitigations for side channel vulnerabilities). The solution is to introduce a dedicated mechanism for a synchronous switch which does not wake another runtime P (see golang/go#32113). This can be used by the `AssertAndFetch` API in the sleep package. The benchmark results for this package are very similiar to raw channel operations for all cases, with the exception of operations that do not wait. The primary advantage is more precise control over scheduling. This will be used in a subsequent change. ``` BenchmarkGoAssertNonWaiting BenchmarkGoAssertNonWaiting-8 261364384 4.976 ns/op BenchmarkGoSingleSelect BenchmarkGoSingleSelect-8 20946358 57.77 ns/op BenchmarkGoMultiSelect BenchmarkGoMultiSelect-8 6071697 197.0 ns/op BenchmarkGoWaitOnSingleSelect BenchmarkGoWaitOnSingleSelect-8 4978051 235.4 ns/op BenchmarkGoWaitOnMultiSelect BenchmarkGoWaitOnMultiSelect-8 2309224 520.2 ns/op BenchmarkSleeperAssertNonWaiting BenchmarkSleeperAssertNonWaiting-8 447325033 2.657 ns/op BenchmarkSleeperSingleSelect BenchmarkSleeperSingleSelect-8 21488844 55.19 ns/op BenchmarkSleeperMultiSelect BenchmarkSleeperMultiSelect-8 21851674 54.89 ns/op BenchmarkSleeperWaitOnSingleSelect BenchmarkSleeperWaitOnSingleSelect-8 2860327 416.4 ns/op BenchmarkSleeperWaitOnSingleSelectSync BenchmarkSleeperWaitOnSingleSelectSync-8 2741733 427.1 ns/op BenchmarkSleeperWaitOnMultiSelect BenchmarkSleeperWaitOnMultiSelect-8 2867484 418.1 ns/op BenchmarkSleeperWaitOnMultiSelectSync BenchmarkSleeperWaitOnMultiSelectSync-8 2789158 427.9 ns/op ``` PiperOrigin-RevId: 415581417
The most recently goready()'d G on each P is given a special position in the P's runqueue, p.runnext. Other Ps steal p.runnext only as a last resort, and usleep(3) before doing so: findRunnable() => stealWork() => runqsteal() => runqgrab(). As documented in runqgrab(), this is to reduce thrashing of Gs between Ps in cases where one goroutine wakes another and then "almost immediately" blocks. On Linux, usleep() is implemented by invoking the nanosleep system call. Syscall timeouts in the Linux kernel are subject to timer slack, as documented by the man page for syscall prctl, section "PR_SET_TIMERSLACK". Experimentally, short timeouts can expect to expire 50 microseconds late regardless of other system activity. Thus, on Linux, usleep(3) typically sleeps for at least 53 microseconds, more than 17x longer than intended. A P must be in the spinning state in order to attempt work-stealing. While at least one P is spinning, wakep() will refuse to wake a new spinning P. One P sleeping in runqgrab() thus prevents further threads from being woken in response to e.g. goroutine wakeups *globally* (throughout the process). Futex wake-to-wakeup latency is approximately 20 microseconds, so sleeping for 53 microseconds can significantly increase goroutine wakeup latency by delaying thread wakeup. Fix this by timestamping Gs when they are runqput() into p.runnext, and causing runqgrab() to indicate to findRunnable() that it should loop if p.runnext is not yet stealable. Alternative fixes considered: - osyield() on Linux as we do on a few other platforms. On Linux, osyield() is implemented by the sched_yield system call, which IIUC causes the calling thread to yield its timeslice to any thread on its runqueue that it would not preempt on wakeup, potentially introducing even larger latencies on busy systems. See also https://www.realworldtech.com/forum/?threadid=189711&curpostid=189752 for a case against sched_yield on semantic grounds. - Replace the usleep() with a spin loop in-place. This tends to waste the spinning P's time, since it can't check other runqueues and the number of calls to runqgrab() - and therefore sleeps - is linear in the number of Ps. Empirically, it introduces regressions not observed in this change. Unfortunately, this is a load-bearing bug. In programs with goroutines that frequently wake up goroutines and then immediately block, this bug significantly reduces overhead from useless thread wakeups in wakep(). In golang.org/x/benchmarks, this manifests most clearly as regressions in benchmark dustin_broadcast. To avoid this regression, we need to intentionally throttle wakep() => acquirem(). Thus, this change also introduces a "need-wakep()" prediction mechanism, which causes goready() and newproc() to call wakep() only if the calling goroutine is predicted not to immediately block. To handle mispredictions, sysmon is changed to wakep() if it detects underutilization. The current prediction algorithm is simple, but appears to be effective; it can be improved in the future as warranted. Results from golang.org/x/benchmarks: (Baseline is go1.20.1; experiment is go1.20.1 plus this change) shortname: ajstarks_deck_generate goos: linux goarch: amd64 pkg: github.com/ajstarks/deck/generate cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ Arc-12 3.857µ ± 5% 3.753µ ± 5% ~ (p=0.424 n=10) Polygon-12 7.074µ ± 6% 6.969µ ± 4% ~ (p=0.190 n=10) geomean 5.224µ 5.114µ -2.10% shortname: aws_jsonutil pkg: github.com/aws/aws-sdk-go/private/protocol/json/jsonutil │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ BuildJSON-12 5.602µ ± 3% 5.600µ ± 2% ~ (p=0.896 n=10) StdlibJSON-12 3.843µ ± 2% 3.828µ ± 2% ~ (p=0.224 n=10) geomean 4.640µ 4.630µ -0.22% shortname: benhoyt_goawk_1_18 pkg: github.com/benhoyt/goawk/interp │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ RecursiveFunc-12 17.79µ ± 3% 17.65µ ± 3% ~ (p=0.436 n=10) RegexMatch-12 815.8n ± 4% 823.3n ± 1% ~ (p=0.353 n=10) RepeatExecProgram-12 21.30µ ± 6% 21.69µ ± 3% ~ (p=0.052 n=10) RepeatNew-12 79.21n ± 4% 79.73n ± 3% ~ (p=0.529 n=10) RepeatIOExecProgram-12 41.83µ ± 1% 42.07µ ± 2% ~ (p=0.796 n=10) RepeatIONew-12 1.195µ ± 3% 1.196µ ± 2% ~ (p=1.000 n=10) geomean 3.271µ 3.288µ +0.54% shortname: bindata pkg: github.com/kevinburke/go-bindata │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ Bindata-12 316.2m ± 5% 309.7m ± 4% ~ (p=0.436 n=10) │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/s │ B/s vs base │ Bindata-12 20.71Mi ± 5% 21.14Mi ± 4% ~ (p=0.436 n=10) │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/op │ B/op vs base │ Bindata-12 183.0Mi ± 0% 183.0Mi ± 0% ~ (p=0.353 n=10) │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ allocs/op │ allocs/op vs base │ Bindata-12 5.790k ± 0% 5.789k ± 0% ~ (p=0.358 n=10) shortname: bloom_bloom pkg: github.com/bits-and-blooms/bloom/v3 │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ SeparateTestAndAdd-12 414.6n ± 4% 413.9n ± 2% ~ (p=0.895 n=10) CombinedTestAndAdd-12 425.8n ± 9% 419.8n ± 8% ~ (p=0.353 n=10) geomean 420.2n 416.9n -0.78% shortname: capnproto2 pkg: zombiezen.com/go/capnproto2 │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ TextMovementBetweenSegments-12 320.5µ ± 5% 318.4µ ± 10% ~ (p=0.579 n=10) Growth_MultiSegment-12 13.63m ± 1% 13.87m ± 2% +1.71% (p=0.029 n=10) geomean 2.090m 2.101m +0.52% │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/s │ B/s vs base │ Growth_MultiSegment-12 73.35Mi ± 1% 72.12Mi ± 2% -1.68% (p=0.027 n=10) │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/op │ B/op vs base │ Growth_MultiSegment-12 1.572Mi ± 0% 1.572Mi ± 0% ~ (p=0.320 n=10) │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ allocs/op │ allocs/op vs base │ Growth_MultiSegment-12 21.00 ± 0% 21.00 ± 0% ~ (p=1.000 n=10) ¹ ¹ all samples are equal shortname: cespare_mph pkg: github.com/cespare/mph │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ Build-12 32.72m ± 2% 32.49m ± 1% ~ (p=0.280 n=10) shortname: commonmark_markdown pkg: gitlab.com/golang-commonmark/markdown │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ RenderSpecNoHTML-12 10.09m ± 2% 10.18m ± 3% ~ (p=0.796 n=10) RenderSpec-12 10.19m ± 1% 10.11m ± 3% ~ (p=0.684 n=10) RenderSpecBlackFriday2-12 6.793m ± 5% 6.946m ± 2% ~ (p=0.063 n=10) geomean 8.872m 8.944m +0.81% shortname: dustin_broadcast pkg: github.com/dustin/go-broadcast │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ DirectSend-12 570.5n ± 7% 355.2n ± 2% -37.74% (p=0.000 n=10) ParallelDirectSend-12 549.0n ± 5% 360.9n ± 3% -34.25% (p=0.000 n=10) ParallelBrodcast-12 788.7n ± 2% 486.0n ± 4% -38.37% (p=0.000 n=10) MuxBrodcast-12 788.6n ± 4% 471.5n ± 6% -40.21% (p=0.000 n=10) geomean 664.4n 414.0n -37.68% shortname: dustin_humanize pkg: github.com/dustin/go-humanize │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ ParseBigBytes-12 1.964µ ± 5% 1.941µ ± 3% ~ (p=0.289 n=10) shortname: ericlagergren_decimal pkg: github.com/ericlagergren/decimal/benchmarks │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ Pi/foo=ericlagergren_(Go)/prec=100-12 147.5µ ± 2% 147.5µ ± 1% ~ (p=0.912 n=10) Pi/foo=ericlagergren_(GDA)/prec=100-12 329.6µ ± 1% 332.1µ ± 2% ~ (p=0.063 n=10) Pi/foo=shopspring/prec=100-12 680.5µ ± 4% 688.6µ ± 2% ~ (p=0.481 n=10) Pi/foo=apmckinlay/prec=100-12 2.541µ ± 4% 2.525µ ± 3% ~ (p=0.218 n=10) Pi/foo=go-inf/prec=100-12 169.5µ ± 3% 170.7µ ± 3% ~ (p=0.218 n=10) Pi/foo=float64/prec=100-12 4.136µ ± 3% 4.162µ ± 6% ~ (p=0.436 n=10) geomean 62.38µ 62.66µ +0.45% shortname: ethereum_bitutil pkg: github.com/ethereum/go-ethereum/common/bitutil │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ FastTest2KB-12 130.4n ± 1% 131.5n ± 1% ~ (p=0.093 n=10) BaseTest2KB-12 624.8n ± 2% 983.0n ± 2% +57.32% (p=0.000 n=10) Encoding4KBVerySparse-12 21.48µ ± 3% 22.20µ ± 3% +3.37% (p=0.005 n=10) geomean 1.205µ 1.421µ +17.94% │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/op │ B/op vs base │ Encoding4KBVerySparse-12 9.750Ki ± 0% 9.750Ki ± 0% ~ (p=1.000 n=10) ¹ ¹ all samples are equal │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ allocs/op │ allocs/op vs base │ Encoding4KBVerySparse-12 15.00 ± 0% 15.00 ± 0% ~ (p=1.000 n=10) ¹ ¹ all samples are equal shortname: ethereum_core pkg: github.com/ethereum/go-ethereum/core │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ PendingDemotion10000-12 96.72n ± 4% 98.55n ± 2% ~ (p=0.055 n=10) FuturePromotion10000-12 2.128n ± 3% 2.093n ± 3% ~ (p=0.896 n=10) PoolBatchInsert10000-12 642.6m ± 2% 642.1m ± 5% ~ (p=0.796 n=10) PoolBatchLocalInsert10000-12 805.2m ± 2% 826.6m ± 4% ~ (p=0.105 n=10) geomean 101.6µ 102.3µ +0.69% shortname: ethereum_corevm pkg: github.com/ethereum/go-ethereum/core/vm │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ OpDiv128-12 137.4n ± 3% 139.5n ± 1% +1.56% (p=0.024 n=10) shortname: ethereum_ecies pkg: github.com/ethereum/go-ethereum/crypto/ecies │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ GenerateKeyP256-12 15.67µ ± 6% 15.66µ ± 3% ~ (p=0.971 n=10) GenSharedKeyP256-12 51.09µ ± 6% 52.09µ ± 4% ~ (p=0.631 n=10) GenSharedKeyS256-12 47.24µ ± 2% 46.67µ ± 3% ~ (p=0.247 n=10) geomean 33.57µ 33.64µ +0.21% shortname: ethereum_ethash pkg: github.com/ethereum/go-ethereum/consensus/ethash │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ HashimotoLight-12 1.116m ± 5% 1.112m ± 2% ~ (p=0.684 n=10) shortname: ethereum_trie pkg: github.com/ethereum/go-ethereum/trie │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ HashFixedSize/10K-12 9.236m ± 1% 9.106m ± 1% -1.40% (p=0.019 n=10) CommitAfterHashFixedSize/10K-12 19.60m ± 1% 19.51m ± 1% ~ (p=0.796 n=10) geomean 13.45m 13.33m -0.93% │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/op │ B/op vs base │ HashFixedSize/10K-12 6.036Mi ± 0% 6.037Mi ± 0% ~ (p=0.247 n=10) CommitAfterHashFixedSize/10K-12 8.626Mi ± 0% 8.626Mi ± 0% ~ (p=0.280 n=10) geomean 7.216Mi 7.216Mi +0.01% │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ allocs/op │ allocs/op vs base │ HashFixedSize/10K-12 77.17k ± 0% 77.17k ± 0% ~ (p=0.050 n=10) CommitAfterHashFixedSize/10K-12 79.99k ± 0% 79.99k ± 0% ~ (p=0.391 n=10) geomean 78.56k 78.57k +0.00% shortname: gonum_blas_native pkg: gonum.org/v1/gonum/blas/gonum │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ Dnrm2MediumPosInc-12 1.953µ ± 2% 1.940µ ± 5% ~ (p=0.989 n=10) DasumMediumUnitaryInc-12 932.5n ± 1% 931.2n ± 1% ~ (p=0.753 n=10) geomean 1.349µ 1.344µ -0.40% shortname: gonum_community pkg: gonum.org/v1/gonum/graph/community │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ LouvainDirectedMultiplex-12 26.40m ± 1% 26.64m ± 1% ~ (p=0.165 n=10) shortname: gonum_lapack_native pkg: gonum.org/v1/gonum/lapack/gonum │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ Dgeev/Circulant10-12 41.97µ ± 6% 42.90µ ± 4% ~ (p=0.143 n=10) Dgeev/Circulant100-12 12.13m ± 4% 12.30m ± 3% ~ (p=0.796 n=10) geomean 713.4µ 726.4µ +1.81% shortname: gonum_mat pkg: gonum.org/v1/gonum/mat │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ MulWorkspaceDense1000Hundredth-12 89.78m ± 0% 81.48m ± 1% -9.24% (p=0.000 n=10) ScaleVec10000Inc20-12 7.204µ ± 36% 8.450µ ± 35% ~ (p=0.853 n=10) geomean 804.2µ 829.7µ +3.18% shortname: gonum_topo pkg: gonum.org/v1/gonum/graph/topo │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ TarjanSCCGnp_10_tenth-12 7.251µ ± 1% 7.187µ ± 1% -0.88% (p=0.025 n=10) TarjanSCCGnp_1000_half-12 74.48m ± 2% 74.37m ± 4% ~ (p=0.796 n=10) geomean 734.8µ 731.1µ -0.51% shortname: gonum_traverse pkg: gonum.org/v1/gonum/graph/traverse │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ WalkAllBreadthFirstGnp_10_tenth-12 3.517µ ± 1% 3.534µ ± 1% ~ (p=0.343 n=10) WalkAllBreadthFirstGnp_1000_tenth-12 11.12m ± 6% 11.19m ± 2% ~ (p=0.631 n=10) geomean 197.8µ 198.9µ +0.54% shortname: gtank_blake2s pkg: github.com/gtank/blake2s │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ Hash8K-12 18.96µ ± 4% 18.82µ ± 5% ~ (p=0.579 n=10) │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/s │ B/s vs base │ Hash8K-12 412.2Mi ± 4% 415.2Mi ± 5% ~ (p=0.579 n=10) shortname: hugo_hugolib pkg: github.com/gohugoio/hugo/hugolib │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ MergeByLanguage-12 529.9n ± 1% 531.5n ± 2% ~ (p=0.305 n=10) ResourceChainPostProcess-12 62.76m ± 3% 56.23m ± 2% -10.39% (p=0.000 n=10) ReplaceShortcodeTokens-12 2.727µ ± 3% 2.701µ ± 7% ~ (p=0.592 n=10) geomean 44.92µ 43.22µ -3.80% shortname: k8s_cache pkg: k8s.io/client-go/tools/cache │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ Listener-12 1.312µ ± 1% 1.199µ ± 1% -8.62% (p=0.000 n=10) ReflectorResyncChanMany-12 785.7n ± 4% 796.3n ± 3% ~ (p=0.089 n=10) geomean 1.015µ 976.9n -3.76% │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/op │ B/op vs base │ Listener-12 16.00 ± 0% 16.00 ± 0% ~ (p=1.000 n=10) ¹ ¹ all samples are equal │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ allocs/op │ allocs/op vs base │ Listener-12 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=10) ¹ ¹ all samples are equal shortname: k8s_workqueue pkg: k8s.io/client-go/util/workqueue │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ ParallelizeUntil/pieces:1000,workers:10,chunkSize:1-12 244.6µ ± 1% 245.9µ ± 0% +0.55% (p=0.023 n=10) ParallelizeUntil/pieces:1000,workers:10,chunkSize:10-12 75.09µ ± 1% 63.54µ ± 1% -15.37% (p=0.000 n=10) ParallelizeUntil/pieces:1000,workers:10,chunkSize:100-12 49.47µ ± 2% 42.45µ ± 2% -14.19% (p=0.000 n=10) ParallelizeUntil/pieces:999,workers:10,chunkSize:13-12 68.51µ ± 1% 55.07µ ± 1% -19.63% (p=0.000 n=10) geomean 88.82µ 77.74µ -12.47% shortname: kanzi pkg: github.com/flanglet/kanzi-go/benchmark │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ BWTS-12 0.4479n ± 6% 0.4385n ± 7% ~ (p=0.529 n=10) FPAQ-12 17.03m ± 3% 17.42m ± 3% ~ (p=0.123 n=10) LZ-12 1.897m ± 2% 1.887m ± 4% ~ (p=1.000 n=10) MTFT-12 771.2µ ± 4% 785.8µ ± 3% ~ (p=0.247 n=10) geomean 57.79µ 58.01µ +0.38% shortname: minio pkg: github.com/minio/minio/cmd │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ DecodehealingTracker-12 852.8n ± 5% 866.8n ± 5% ~ (p=0.190 n=10) AppendMsgReplicateDecision-12 0.5383n ± 4% 0.7598n ± 3% +41.13% (p=0.000 n=10) AppendMsgResyncTargetsInfo-12 4.785n ± 2% 4.639n ± 3% -3.06% (p=0.003 n=10) DataUpdateTracker-12 3.122µ ± 2% 1.880µ ± 3% -39.77% (p=0.000 n=10) MarshalMsgdataUsageCacheInfo-12 110.9n ± 2% 109.4n ± 3% ~ (p=0.101 n=10) geomean 59.74n 57.50n -3.75% │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/s │ B/s vs base │ DecodehealingTracker-12 347.8Mi ± 5% 342.2Mi ± 6% ~ (p=0.190 n=10) AppendMsgReplicateDecision-12 1.730Gi ± 3% 1.226Gi ± 3% -29.14% (p=0.000 n=10) AppendMsgResyncTargetsInfo-12 1.946Gi ± 2% 2.008Gi ± 3% +3.15% (p=0.003 n=10) DataUpdateTracker-12 312.5Ki ± 3% 517.6Ki ± 2% +65.62% (p=0.000 n=10) geomean 139.1Mi 145.4Mi +4.47% │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/op │ B/op vs base │ DecodehealingTracker-12 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ AppendMsgReplicateDecision-12 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ AppendMsgResyncTargetsInfo-12 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ DataUpdateTracker-12 340.0 ± 0% 339.0 ± 1% ~ (p=0.737 n=10) MarshalMsgdataUsageCacheInfo-12 96.00 ± 0% 96.00 ± 0% ~ (p=1.000 n=10) ¹ geomean ² -0.06% ² ¹ all samples are equal ² summaries must be >0 to compute geomean │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ allocs/op │ allocs/op vs base │ DecodehealingTracker-12 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ AppendMsgReplicateDecision-12 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ AppendMsgResyncTargetsInfo-12 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ DataUpdateTracker-12 9.000 ± 0% 9.000 ± 0% ~ (p=1.000 n=10) ¹ MarshalMsgdataUsageCacheInfo-12 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=10) ¹ geomean ² +0.00% ² ¹ all samples are equal ² summaries must be >0 to compute geomean shortname: semver pkg: github.com/Masterminds/semver │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ ValidateVersionTildeFail-12 854.7n ± 2% 842.7n ± 2% ~ (p=0.123 n=10) shortname: shopify_sarama pkg: github.com/Shopify/sarama │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ Broker_Open-12 212.2µ ± 1% 205.9µ ± 2% -2.95% (p=0.000 n=10) Broker_No_Metrics_Open-12 132.9µ ± 1% 121.3µ ± 2% -8.68% (p=0.000 n=10) geomean 167.9µ 158.1µ -5.86% shortname: spexs2 pkg: github.com/egonelbre/spexs2/_benchmark │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ Run/10k/1-12 23.29 ± 1% 23.11 ± 2% ~ (p=0.315 n=10) Run/10k/16-12 5.648 ± 2% 5.462 ± 4% -3.30% (p=0.004 n=10) geomean 11.47 11.23 -2.06% shortname: sweet-biogo-igor goos: goarch: pkg: cpu: │ ./sweet/results/biogo-igor/baseline.results │ ./sweet/results/biogo-igor/experiment.results │ │ sec/op │ sec/op vs base │ BiogoIgor 13.53 ± 1% 13.62 ± 1% ~ (p=0.165 n=10) │ ./sweet/results/biogo-igor/baseline.results │ ./sweet/results/biogo-igor/experiment.results │ │ average-RSS-bytes │ average-RSS-bytes vs base │ BiogoIgor 62.19Mi ± 3% 62.86Mi ± 1% ~ (p=0.247 n=10) │ ./sweet/results/biogo-igor/baseline.results │ ./sweet/results/biogo-igor/experiment.results │ │ peak-RSS-bytes │ peak-RSS-bytes vs base │ BiogoIgor 89.57Mi ± 4% 89.03Mi ± 3% ~ (p=0.516 n=10) │ ./sweet/results/biogo-igor/baseline.results │ ./sweet/results/biogo-igor/experiment.results │ │ peak-VM-bytes │ peak-VM-bytes vs base │ BiogoIgor 766.4Mi ± 0% 766.4Mi ± 0% ~ (p=0.954 n=10) shortname: sweet-biogo-krishna │ ./sweet/results/biogo-krishna/baseline.results │ ./sweet/results/biogo-krishna/experiment.results │ │ sec/op │ sec/op vs base │ BiogoKrishna 12.70 ± 2% 12.09 ± 3% -4.86% (p=0.000 n=10) │ ./sweet/results/biogo-krishna/baseline.results │ ./sweet/results/biogo-krishna/experiment.results │ │ average-RSS-bytes │ average-RSS-bytes vs base │ BiogoKrishna 4.085Gi ± 0% 4.083Gi ± 0% ~ (p=0.105 n=10) │ ./sweet/results/biogo-krishna/baseline.results │ ./sweet/results/biogo-krishna/experiment.results │ │ peak-RSS-bytes │ peak-RSS-bytes vs base │ BiogoKrishna 4.174Gi ± 0% 4.173Gi ± 0% ~ (p=0.853 n=10) │ ./sweet/results/biogo-krishna/baseline.results │ ./sweet/results/biogo-krishna/experiment.results │ │ peak-VM-bytes │ peak-VM-bytes vs base │ BiogoKrishna 4.877Gi ± 0% 4.877Gi ± 0% ~ (p=0.591 n=10) shortname: sweet-bleve-index │ ./sweet/results/bleve-index/baseline.results │ ./sweet/results/bleve-index/experiment.results │ │ sec/op │ sec/op vs base │ BleveIndexBatch100 4.675 ± 1% 4.669 ± 1% ~ (p=0.739 n=10) │ ./sweet/results/bleve-index/baseline.results │ ./sweet/results/bleve-index/experiment.results │ │ average-RSS-bytes │ average-RSS-bytes vs base │ BleveIndexBatch100 185.5Mi ± 1% 185.9Mi ± 1% ~ (p=0.796 n=10) │ ./sweet/results/bleve-index/baseline.results │ ./sweet/results/bleve-index/experiment.results │ │ peak-RSS-bytes │ peak-RSS-bytes vs base │ BleveIndexBatch100 267.5Mi ± 6% 265.0Mi ± 2% ~ (p=0.739 n=10) │ ./sweet/results/bleve-index/baseline.results │ ./sweet/results/bleve-index/experiment.results │ │ peak-VM-bytes │ peak-VM-bytes vs base │ BleveIndexBatch100 1.945Gi ± 4% 1.945Gi ± 0% ~ (p=0.725 n=10) shortname: sweet-go-build │ ./sweet/results/go-build/baseline.results │ ./sweet/results/go-build/experiment.results │ │ sec/op │ sec/op vs base │ GoBuildKubelet 51.32 ± 0% 51.38 ± 3% ~ (p=0.105 n=10) GoBuildKubeletLink 7.669 ± 1% 7.663 ± 2% ~ (p=0.579 n=10) GoBuildIstioctl 46.02 ± 0% 46.07 ± 0% ~ (p=0.739 n=10) GoBuildIstioctlLink 8.174 ± 1% 8.143 ± 2% ~ (p=0.436 n=10) GoBuildFrontend 16.17 ± 1% 16.10 ± 1% ~ (p=0.143 n=10) GoBuildFrontendLink 1.399 ± 3% 1.377 ± 3% ~ (p=0.218 n=10) geomean 12.23 12.18 -0.39% shortname: sweet-gopher-lua │ ./sweet/results/gopher-lua/baseline.results │ ./sweet/results/gopher-lua/experiment.results │ │ sec/op │ sec/op vs base │ GopherLuaKNucleotide 22.71 ± 1% 22.86 ± 1% ~ (p=0.218 n=10) │ ./sweet/results/gopher-lua/baseline.results │ ./sweet/results/gopher-lua/experiment.results │ │ average-RSS-bytes │ average-RSS-bytes vs base │ GopherLuaKNucleotide 36.64Mi ± 2% 36.40Mi ± 1% ~ (p=0.631 n=10) │ ./sweet/results/gopher-lua/baseline.results │ ./sweet/results/gopher-lua/experiment.results │ │ peak-RSS-bytes │ peak-RSS-bytes vs base │ GopherLuaKNucleotide 43.28Mi ± 5% 41.55Mi ± 7% ~ (p=0.089 n=10) │ ./sweet/results/gopher-lua/baseline.results │ ./sweet/results/gopher-lua/experiment.results │ │ peak-VM-bytes │ peak-VM-bytes vs base │ GopherLuaKNucleotide 699.6Mi ± 0% 699.9Mi ± 0% +0.04% (p=0.006 n=10) shortname: sweet-markdown │ ./sweet/results/markdown/baseline.results │ ./sweet/results/markdown/experiment.results │ │ sec/op │ sec/op vs base │ MarkdownRenderXHTML 260.6m ± 4% 256.4m ± 4% ~ (p=0.796 n=10) │ ./sweet/results/markdown/baseline.results │ ./sweet/results/markdown/experiment.results │ │ average-RSS-bytes │ average-RSS-bytes vs base │ MarkdownRenderXHTML 20.47Mi ± 1% 20.71Mi ± 2% ~ (p=0.393 n=10) │ ./sweet/results/markdown/baseline.results │ ./sweet/results/markdown/experiment.results │ │ peak-RSS-bytes │ peak-RSS-bytes vs base │ MarkdownRenderXHTML 20.88Mi ± 11% 21.73Mi ± 6% ~ (p=0.470 n=10) │ ./sweet/results/markdown/baseline.results │ ./sweet/results/markdown/experiment.results │ │ peak-VM-bytes │ peak-VM-bytes vs base │ MarkdownRenderXHTML 699.2Mi ± 0% 699.3Mi ± 0% ~ (p=0.464 n=10) shortname: sweet-tile38 │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │ │ sec/op │ sec/op vs base │ Tile38WithinCircle100kmRequest 529.1µ ± 1% 530.3µ ± 1% ~ (p=0.143 n=10) Tile38IntersectsCircle100kmRequest 629.6µ ± 1% 630.8µ ± 1% ~ (p=0.971 n=10) Tile38KNearestLimit100Request 446.4µ ± 1% 453.7µ ± 1% +1.62% (p=0.000 n=10) geomean 529.8µ 533.4µ +0.67% │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │ │ average-RSS-bytes │ average-RSS-bytes vs base │ Tile38WithinCircle100kmRequest 5.054Gi ± 1% 5.057Gi ± 1% ~ (p=0.796 n=10) Tile38IntersectsCircle100kmRequest 5.381Gi ± 0% 5.431Gi ± 1% +0.94% (p=0.019 n=10) Tile38KNearestLimit100Request 6.801Gi ± 0% 6.802Gi ± 0% ~ (p=0.684 n=10) geomean 5.697Gi 5.717Gi +0.34% │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │ │ peak-RSS-bytes │ peak-RSS-bytes vs base │ Tile38WithinCircle100kmRequest 5.380Gi ± 1% 5.381Gi ± 1% ~ (p=0.912 n=10) Tile38IntersectsCircle100kmRequest 5.669Gi ± 1% 5.756Gi ± 1% +1.53% (p=0.019 n=10) Tile38KNearestLimit100Request 7.013Gi ± 0% 7.011Gi ± 0% ~ (p=0.796 n=10) geomean 5.980Gi 6.010Gi +0.50% │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │ │ peak-VM-bytes │ peak-VM-bytes vs base │ Tile38WithinCircle100kmRequest 6.047Gi ± 1% 6.047Gi ± 1% ~ (p=0.725 n=10) Tile38IntersectsCircle100kmRequest 6.305Gi ± 1% 6.402Gi ± 2% +1.53% (p=0.035 n=10) Tile38KNearestLimit100Request 7.685Gi ± 0% 7.685Gi ± 0% ~ (p=0.955 n=10) geomean 6.642Gi 6.676Gi +0.51% │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │ │ p50-latency-sec │ p50-latency-sec vs base │ Tile38WithinCircle100kmRequest 88.81µ ± 1% 89.36µ ± 1% +0.61% (p=0.043 n=10) Tile38IntersectsCircle100kmRequest 151.5µ ± 1% 152.0µ ± 1% ~ (p=0.089 n=10) Tile38KNearestLimit100Request 259.0µ ± 0% 259.1µ ± 0% ~ (p=0.853 n=10) geomean 151.6µ 152.1µ +0.33% │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │ │ p90-latency-sec │ p90-latency-sec vs base │ Tile38WithinCircle100kmRequest 712.5µ ± 0% 713.9µ ± 1% ~ (p=0.190 n=10) Tile38IntersectsCircle100kmRequest 960.6µ ± 1% 958.2µ ± 1% ~ (p=0.739 n=10) Tile38KNearestLimit100Request 1.007m ± 1% 1.032m ± 1% +2.50% (p=0.000 n=10) geomean 883.4µ 890.5µ +0.80% │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │ │ p99-latency-sec │ p99-latency-sec vs base │ Tile38WithinCircle100kmRequest 7.061m ± 1% 7.085m ± 1% ~ (p=0.481 n=10) Tile38IntersectsCircle100kmRequest 7.228m ± 1% 7.187m ± 1% ~ (p=0.143 n=10) Tile38KNearestLimit100Request 2.085m ± 0% 2.131m ± 1% +2.22% (p=0.000 n=10) geomean 4.738m 4.770m +0.66% │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │ │ ops/s │ ops/s vs base │ Tile38WithinCircle100kmRequest 17.01k ± 1% 16.97k ± 1% ~ (p=0.143 n=10) Tile38IntersectsCircle100kmRequest 14.29k ± 1% 14.27k ± 1% ~ (p=0.988 n=10) Tile38KNearestLimit100Request 20.16k ± 1% 19.84k ± 1% -1.59% (p=0.000 n=10) geomean 16.99k 16.87k -0.67% shortname: uber_tally goos: linux goarch: amd64 pkg: github.com/uber-go/tally cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ ScopeTaggedNoCachedSubscopes-12 2.867µ ± 4% 2.921µ ± 4% ~ (p=0.579 n=10) HistogramAllocation-12 1.519µ ± 3% 1.507µ ± 7% ~ (p=0.631 n=10) geomean 2.087µ 2.098µ +0.53% │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/op │ B/op vs base │ HistogramAllocation-12 1.124Ki ± 1% 1.125Ki ± 4% ~ (p=0.271 n=10) │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ allocs/op │ allocs/op vs base │ HistogramAllocation-12 20.00 ± 0% 20.00 ± 0% ~ (p=1.000 n=10) ¹ ¹ all samples are equal shortname: uber_zap pkg: go.uber.org/zap/zapcore │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ BufferedWriteSyncer/write_file_with_buffer-12 296.1n ± 12% 205.9n ± 10% -30.46% (p=0.000 n=10) MultiWriteSyncer/2_discarder-12 7.528n ± 4% 7.014n ± 2% -6.83% (p=0.000 n=10) MultiWriteSyncer/4_discarder-12 9.065n ± 1% 8.908n ± 1% -1.73% (p=0.002 n=10) MultiWriteSyncer/4_discarder_with_buffer-12 225.2n ± 2% 147.6n ± 2% -34.48% (p=0.000 n=10) WriteSyncer/write_file_with_no_buffer-12 4.785µ ± 1% 4.933µ ± 3% +3.08% (p=0.001 n=10) ZapConsole-12 702.5n ± 1% 649.1n ± 1% -7.62% (p=0.000 n=10) JSONLogMarshalerFunc-12 1.219µ ± 2% 1.226µ ± 3% ~ (p=0.781 n=10) ZapJSON-12 555.4n ± 1% 480.9n ± 3% -13.40% (p=0.000 n=10) StandardJSON-12 814.1n ± 1% 809.0n ± 0% ~ (p=0.101 n=10) Sampler_Check/7_keys-12 10.55n ± 2% 10.61n ± 1% ~ (p=0.594 n=10) Sampler_Check/50_keys-12 11.01n ± 0% 10.98n ± 1% ~ (p=0.286 n=10) Sampler_Check/100_keys-12 10.71n ± 0% 10.71n ± 0% ~ (p=0.563 n=10) Sampler_CheckWithHook/7_keys-12 20.20n ± 2% 20.42n ± 2% ~ (p=0.446 n=10) Sampler_CheckWithHook/50_keys-12 20.72n ± 2% 21.02n ± 1% ~ (p=0.078 n=10) Sampler_CheckWithHook/100_keys-12 20.15n ± 2% 20.68n ± 3% +2.63% (p=0.037 n=10) TeeCheck-12 140.8n ± 2% 140.5n ± 2% ~ (p=0.754 n=10) geomean 87.80n 82.39n -6.15% The only large regression (in ethereum_bitutil's BaseTest2KB) appears to be spurious, as the test does not involve any goroutines (or B.RunParallel()), which profiling confirms. Updates golang/go#18237 Related to golang/go#32113
The most recently goready()'d G on each P is given a special position in the P's runqueue, p.runnext. Other Ps steal p.runnext only as a last resort, and usleep(3) before doing so: findRunnable() => stealWork() => runqsteal() => runqgrab(). As documented in runqgrab(), this is to reduce thrashing of Gs between Ps in cases where one goroutine wakes another and then "almost immediately" blocks. On Linux, usleep() is implemented by invoking the nanosleep system call. Syscall timeouts in the Linux kernel are subject to timer slack, as documented by the man page for syscall prctl, section "PR_SET_TIMERSLACK". Experimentally, short timeouts can expect to expire 50 microseconds late regardless of other system activity. Thus, on Linux, usleep(3) typically sleeps for at least 53 microseconds, more than 17x longer than intended. A P must be in the spinning state in order to attempt work-stealing. While at least one P is spinning, wakep() will refuse to wake a new spinning P. One P sleeping in runqgrab() thus prevents further threads from being woken in response to e.g. goroutine wakeups *globally* (throughout the process). Futex wake-to-wakeup latency is approximately 20 microseconds, so sleeping for 53 microseconds can significantly increase goroutine wakeup latency by delaying thread wakeup. Fix this by timestamping Gs when they are runqput() into p.runnext, and causing runqgrab() to indicate to findRunnable() that it should loop if p.runnext is not yet stealable. Alternative fixes considered: - osyield() on Linux as we do on a few other platforms. On Linux, osyield() is implemented by the sched_yield system call, which IIUC causes the calling thread to yield its timeslice to any thread on its runqueue that it would not preempt on wakeup, potentially introducing even larger latencies on busy systems. See also https://www.realworldtech.com/forum/?threadid=189711&curpostid=189752 for a case against sched_yield on semantic grounds. - Replace the usleep() with a spin loop in-place. This tends to waste the spinning P's time, since it can't check other runqueues and the number of calls to runqgrab() - and therefore sleeps - is linear in the number of Ps. Empirically, it introduces regressions not observed in this change. - Change thread timer slack using prctl(PR_SET_TIMERSLACK). In practice, user programs will have been tuned based on the default timer slack value, so tampering with this may introduce regressions into existing programs. Unfortunately, this is a load-bearing bug. In programs with goroutines that frequently wake up goroutines and then immediately block, this bug significantly reduces overhead from useless thread wakeups in wakep(). In golang.org/x/benchmarks, this manifests most clearly as regressions in benchmark dustin_broadcast. To avoid this regression, we need to intentionally throttle wakep() => acquirem(). Thus, this change also introduces a "need-wakep()" prediction mechanism, which causes goready() and newproc() to call wakep() only if the calling goroutine is predicted not to immediately block. To handle mispredictions, sysmon is changed to wakep() if it detects underutilization. The current prediction algorithm is simple, but appears to be effective; it can be improved in the future as warranted. Results from golang.org/x/benchmarks: (Baseline is go1.20.1; experiment is go1.20.1 plus this change) shortname: ajstarks_deck_generate goos: linux goarch: amd64 pkg: github.com/ajstarks/deck/generate cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ Arc-12 3.857µ ± 5% 3.753µ ± 5% ~ (p=0.424 n=10) Polygon-12 7.074µ ± 6% 6.969µ ± 4% ~ (p=0.190 n=10) geomean 5.224µ 5.114µ -2.10% shortname: aws_jsonutil pkg: github.com/aws/aws-sdk-go/private/protocol/json/jsonutil │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ BuildJSON-12 5.602µ ± 3% 5.600µ ± 2% ~ (p=0.896 n=10) StdlibJSON-12 3.843µ ± 2% 3.828µ ± 2% ~ (p=0.224 n=10) geomean 4.640µ 4.630µ -0.22% shortname: benhoyt_goawk_1_18 pkg: github.com/benhoyt/goawk/interp │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ RecursiveFunc-12 17.79µ ± 3% 17.65µ ± 3% ~ (p=0.436 n=10) RegexMatch-12 815.8n ± 4% 823.3n ± 1% ~ (p=0.353 n=10) RepeatExecProgram-12 21.30µ ± 6% 21.69µ ± 3% ~ (p=0.052 n=10) RepeatNew-12 79.21n ± 4% 79.73n ± 3% ~ (p=0.529 n=10) RepeatIOExecProgram-12 41.83µ ± 1% 42.07µ ± 2% ~ (p=0.796 n=10) RepeatIONew-12 1.195µ ± 3% 1.196µ ± 2% ~ (p=1.000 n=10) geomean 3.271µ 3.288µ +0.54% shortname: bindata pkg: github.com/kevinburke/go-bindata │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ Bindata-12 316.2m ± 5% 309.7m ± 4% ~ (p=0.436 n=10) │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/s │ B/s vs base │ Bindata-12 20.71Mi ± 5% 21.14Mi ± 4% ~ (p=0.436 n=10) │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/op │ B/op vs base │ Bindata-12 183.0Mi ± 0% 183.0Mi ± 0% ~ (p=0.353 n=10) │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ allocs/op │ allocs/op vs base │ Bindata-12 5.790k ± 0% 5.789k ± 0% ~ (p=0.358 n=10) shortname: bloom_bloom pkg: github.com/bits-and-blooms/bloom/v3 │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ SeparateTestAndAdd-12 414.6n ± 4% 413.9n ± 2% ~ (p=0.895 n=10) CombinedTestAndAdd-12 425.8n ± 9% 419.8n ± 8% ~ (p=0.353 n=10) geomean 420.2n 416.9n -0.78% shortname: capnproto2 pkg: zombiezen.com/go/capnproto2 │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ TextMovementBetweenSegments-12 320.5µ ± 5% 318.4µ ± 10% ~ (p=0.579 n=10) Growth_MultiSegment-12 13.63m ± 1% 13.87m ± 2% +1.71% (p=0.029 n=10) geomean 2.090m 2.101m +0.52% │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/s │ B/s vs base │ Growth_MultiSegment-12 73.35Mi ± 1% 72.12Mi ± 2% -1.68% (p=0.027 n=10) │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/op │ B/op vs base │ Growth_MultiSegment-12 1.572Mi ± 0% 1.572Mi ± 0% ~ (p=0.320 n=10) │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ allocs/op │ allocs/op vs base │ Growth_MultiSegment-12 21.00 ± 0% 21.00 ± 0% ~ (p=1.000 n=10) ¹ ¹ all samples are equal shortname: cespare_mph pkg: github.com/cespare/mph │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ Build-12 32.72m ± 2% 32.49m ± 1% ~ (p=0.280 n=10) shortname: commonmark_markdown pkg: gitlab.com/golang-commonmark/markdown │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ RenderSpecNoHTML-12 10.09m ± 2% 10.18m ± 3% ~ (p=0.796 n=10) RenderSpec-12 10.19m ± 1% 10.11m ± 3% ~ (p=0.684 n=10) RenderSpecBlackFriday2-12 6.793m ± 5% 6.946m ± 2% ~ (p=0.063 n=10) geomean 8.872m 8.944m +0.81% shortname: dustin_broadcast pkg: github.com/dustin/go-broadcast │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ DirectSend-12 570.5n ± 7% 355.2n ± 2% -37.74% (p=0.000 n=10) ParallelDirectSend-12 549.0n ± 5% 360.9n ± 3% -34.25% (p=0.000 n=10) ParallelBrodcast-12 788.7n ± 2% 486.0n ± 4% -38.37% (p=0.000 n=10) MuxBrodcast-12 788.6n ± 4% 471.5n ± 6% -40.21% (p=0.000 n=10) geomean 664.4n 414.0n -37.68% shortname: dustin_humanize pkg: github.com/dustin/go-humanize │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ ParseBigBytes-12 1.964µ ± 5% 1.941µ ± 3% ~ (p=0.289 n=10) shortname: ericlagergren_decimal pkg: github.com/ericlagergren/decimal/benchmarks │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ Pi/foo=ericlagergren_(Go)/prec=100-12 147.5µ ± 2% 147.5µ ± 1% ~ (p=0.912 n=10) Pi/foo=ericlagergren_(GDA)/prec=100-12 329.6µ ± 1% 332.1µ ± 2% ~ (p=0.063 n=10) Pi/foo=shopspring/prec=100-12 680.5µ ± 4% 688.6µ ± 2% ~ (p=0.481 n=10) Pi/foo=apmckinlay/prec=100-12 2.541µ ± 4% 2.525µ ± 3% ~ (p=0.218 n=10) Pi/foo=go-inf/prec=100-12 169.5µ ± 3% 170.7µ ± 3% ~ (p=0.218 n=10) Pi/foo=float64/prec=100-12 4.136µ ± 3% 4.162µ ± 6% ~ (p=0.436 n=10) geomean 62.38µ 62.66µ +0.45% shortname: ethereum_bitutil pkg: github.com/ethereum/go-ethereum/common/bitutil │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ FastTest2KB-12 130.4n ± 1% 131.5n ± 1% ~ (p=0.093 n=10) BaseTest2KB-12 624.8n ± 2% 983.0n ± 2% +57.32% (p=0.000 n=10) Encoding4KBVerySparse-12 21.48µ ± 3% 22.20µ ± 3% +3.37% (p=0.005 n=10) geomean 1.205µ 1.421µ +17.94% │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/op │ B/op vs base │ Encoding4KBVerySparse-12 9.750Ki ± 0% 9.750Ki ± 0% ~ (p=1.000 n=10) ¹ ¹ all samples are equal │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ allocs/op │ allocs/op vs base │ Encoding4KBVerySparse-12 15.00 ± 0% 15.00 ± 0% ~ (p=1.000 n=10) ¹ ¹ all samples are equal shortname: ethereum_core pkg: github.com/ethereum/go-ethereum/core │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ PendingDemotion10000-12 96.72n ± 4% 98.55n ± 2% ~ (p=0.055 n=10) FuturePromotion10000-12 2.128n ± 3% 2.093n ± 3% ~ (p=0.896 n=10) PoolBatchInsert10000-12 642.6m ± 2% 642.1m ± 5% ~ (p=0.796 n=10) PoolBatchLocalInsert10000-12 805.2m ± 2% 826.6m ± 4% ~ (p=0.105 n=10) geomean 101.6µ 102.3µ +0.69% shortname: ethereum_corevm pkg: github.com/ethereum/go-ethereum/core/vm │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ OpDiv128-12 137.4n ± 3% 139.5n ± 1% +1.56% (p=0.024 n=10) shortname: ethereum_ecies pkg: github.com/ethereum/go-ethereum/crypto/ecies │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ GenerateKeyP256-12 15.67µ ± 6% 15.66µ ± 3% ~ (p=0.971 n=10) GenSharedKeyP256-12 51.09µ ± 6% 52.09µ ± 4% ~ (p=0.631 n=10) GenSharedKeyS256-12 47.24µ ± 2% 46.67µ ± 3% ~ (p=0.247 n=10) geomean 33.57µ 33.64µ +0.21% shortname: ethereum_ethash pkg: github.com/ethereum/go-ethereum/consensus/ethash │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ HashimotoLight-12 1.116m ± 5% 1.112m ± 2% ~ (p=0.684 n=10) shortname: ethereum_trie pkg: github.com/ethereum/go-ethereum/trie │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ HashFixedSize/10K-12 9.236m ± 1% 9.106m ± 1% -1.40% (p=0.019 n=10) CommitAfterHashFixedSize/10K-12 19.60m ± 1% 19.51m ± 1% ~ (p=0.796 n=10) geomean 13.45m 13.33m -0.93% │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/op │ B/op vs base │ HashFixedSize/10K-12 6.036Mi ± 0% 6.037Mi ± 0% ~ (p=0.247 n=10) CommitAfterHashFixedSize/10K-12 8.626Mi ± 0% 8.626Mi ± 0% ~ (p=0.280 n=10) geomean 7.216Mi 7.216Mi +0.01% │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ allocs/op │ allocs/op vs base │ HashFixedSize/10K-12 77.17k ± 0% 77.17k ± 0% ~ (p=0.050 n=10) CommitAfterHashFixedSize/10K-12 79.99k ± 0% 79.99k ± 0% ~ (p=0.391 n=10) geomean 78.56k 78.57k +0.00% shortname: gonum_blas_native pkg: gonum.org/v1/gonum/blas/gonum │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ Dnrm2MediumPosInc-12 1.953µ ± 2% 1.940µ ± 5% ~ (p=0.989 n=10) DasumMediumUnitaryInc-12 932.5n ± 1% 931.2n ± 1% ~ (p=0.753 n=10) geomean 1.349µ 1.344µ -0.40% shortname: gonum_community pkg: gonum.org/v1/gonum/graph/community │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ LouvainDirectedMultiplex-12 26.40m ± 1% 26.64m ± 1% ~ (p=0.165 n=10) shortname: gonum_lapack_native pkg: gonum.org/v1/gonum/lapack/gonum │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ Dgeev/Circulant10-12 41.97µ ± 6% 42.90µ ± 4% ~ (p=0.143 n=10) Dgeev/Circulant100-12 12.13m ± 4% 12.30m ± 3% ~ (p=0.796 n=10) geomean 713.4µ 726.4µ +1.81% shortname: gonum_mat pkg: gonum.org/v1/gonum/mat │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ MulWorkspaceDense1000Hundredth-12 89.78m ± 0% 81.48m ± 1% -9.24% (p=0.000 n=10) ScaleVec10000Inc20-12 7.204µ ± 36% 8.450µ ± 35% ~ (p=0.853 n=10) geomean 804.2µ 829.7µ +3.18% shortname: gonum_topo pkg: gonum.org/v1/gonum/graph/topo │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ TarjanSCCGnp_10_tenth-12 7.251µ ± 1% 7.187µ ± 1% -0.88% (p=0.025 n=10) TarjanSCCGnp_1000_half-12 74.48m ± 2% 74.37m ± 4% ~ (p=0.796 n=10) geomean 734.8µ 731.1µ -0.51% shortname: gonum_traverse pkg: gonum.org/v1/gonum/graph/traverse │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ WalkAllBreadthFirstGnp_10_tenth-12 3.517µ ± 1% 3.534µ ± 1% ~ (p=0.343 n=10) WalkAllBreadthFirstGnp_1000_tenth-12 11.12m ± 6% 11.19m ± 2% ~ (p=0.631 n=10) geomean 197.8µ 198.9µ +0.54% shortname: gtank_blake2s pkg: github.com/gtank/blake2s │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ Hash8K-12 18.96µ ± 4% 18.82µ ± 5% ~ (p=0.579 n=10) │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/s │ B/s vs base │ Hash8K-12 412.2Mi ± 4% 415.2Mi ± 5% ~ (p=0.579 n=10) shortname: hugo_hugolib pkg: github.com/gohugoio/hugo/hugolib │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ MergeByLanguage-12 529.9n ± 1% 531.5n ± 2% ~ (p=0.305 n=10) ResourceChainPostProcess-12 62.76m ± 3% 56.23m ± 2% -10.39% (p=0.000 n=10) ReplaceShortcodeTokens-12 2.727µ ± 3% 2.701µ ± 7% ~ (p=0.592 n=10) geomean 44.92µ 43.22µ -3.80% shortname: k8s_cache pkg: k8s.io/client-go/tools/cache │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ Listener-12 1.312µ ± 1% 1.199µ ± 1% -8.62% (p=0.000 n=10) ReflectorResyncChanMany-12 785.7n ± 4% 796.3n ± 3% ~ (p=0.089 n=10) geomean 1.015µ 976.9n -3.76% │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/op │ B/op vs base │ Listener-12 16.00 ± 0% 16.00 ± 0% ~ (p=1.000 n=10) ¹ ¹ all samples are equal │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ allocs/op │ allocs/op vs base │ Listener-12 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=10) ¹ ¹ all samples are equal shortname: k8s_workqueue pkg: k8s.io/client-go/util/workqueue │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ ParallelizeUntil/pieces:1000,workers:10,chunkSize:1-12 244.6µ ± 1% 245.9µ ± 0% +0.55% (p=0.023 n=10) ParallelizeUntil/pieces:1000,workers:10,chunkSize:10-12 75.09µ ± 1% 63.54µ ± 1% -15.37% (p=0.000 n=10) ParallelizeUntil/pieces:1000,workers:10,chunkSize:100-12 49.47µ ± 2% 42.45µ ± 2% -14.19% (p=0.000 n=10) ParallelizeUntil/pieces:999,workers:10,chunkSize:13-12 68.51µ ± 1% 55.07µ ± 1% -19.63% (p=0.000 n=10) geomean 88.82µ 77.74µ -12.47% shortname: kanzi pkg: github.com/flanglet/kanzi-go/benchmark │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ BWTS-12 0.4479n ± 6% 0.4385n ± 7% ~ (p=0.529 n=10) FPAQ-12 17.03m ± 3% 17.42m ± 3% ~ (p=0.123 n=10) LZ-12 1.897m ± 2% 1.887m ± 4% ~ (p=1.000 n=10) MTFT-12 771.2µ ± 4% 785.8µ ± 3% ~ (p=0.247 n=10) geomean 57.79µ 58.01µ +0.38% shortname: minio pkg: github.com/minio/minio/cmd │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ DecodehealingTracker-12 852.8n ± 5% 866.8n ± 5% ~ (p=0.190 n=10) AppendMsgReplicateDecision-12 0.5383n ± 4% 0.7598n ± 3% +41.13% (p=0.000 n=10) AppendMsgResyncTargetsInfo-12 4.785n ± 2% 4.639n ± 3% -3.06% (p=0.003 n=10) DataUpdateTracker-12 3.122µ ± 2% 1.880µ ± 3% -39.77% (p=0.000 n=10) MarshalMsgdataUsageCacheInfo-12 110.9n ± 2% 109.4n ± 3% ~ (p=0.101 n=10) geomean 59.74n 57.50n -3.75% │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/s │ B/s vs base │ DecodehealingTracker-12 347.8Mi ± 5% 342.2Mi ± 6% ~ (p=0.190 n=10) AppendMsgReplicateDecision-12 1.730Gi ± 3% 1.226Gi ± 3% -29.14% (p=0.000 n=10) AppendMsgResyncTargetsInfo-12 1.946Gi ± 2% 2.008Gi ± 3% +3.15% (p=0.003 n=10) DataUpdateTracker-12 312.5Ki ± 3% 517.6Ki ± 2% +65.62% (p=0.000 n=10) geomean 139.1Mi 145.4Mi +4.47% │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/op │ B/op vs base │ DecodehealingTracker-12 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ AppendMsgReplicateDecision-12 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ AppendMsgResyncTargetsInfo-12 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ DataUpdateTracker-12 340.0 ± 0% 339.0 ± 1% ~ (p=0.737 n=10) MarshalMsgdataUsageCacheInfo-12 96.00 ± 0% 96.00 ± 0% ~ (p=1.000 n=10) ¹ geomean ² -0.06% ² ¹ all samples are equal ² summaries must be >0 to compute geomean │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ allocs/op │ allocs/op vs base │ DecodehealingTracker-12 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ AppendMsgReplicateDecision-12 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ AppendMsgResyncTargetsInfo-12 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ DataUpdateTracker-12 9.000 ± 0% 9.000 ± 0% ~ (p=1.000 n=10) ¹ MarshalMsgdataUsageCacheInfo-12 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=10) ¹ geomean ² +0.00% ² ¹ all samples are equal ² summaries must be >0 to compute geomean shortname: semver pkg: github.com/Masterminds/semver │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ ValidateVersionTildeFail-12 854.7n ± 2% 842.7n ± 2% ~ (p=0.123 n=10) shortname: shopify_sarama pkg: github.com/Shopify/sarama │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ Broker_Open-12 212.2µ ± 1% 205.9µ ± 2% -2.95% (p=0.000 n=10) Broker_No_Metrics_Open-12 132.9µ ± 1% 121.3µ ± 2% -8.68% (p=0.000 n=10) geomean 167.9µ 158.1µ -5.86% shortname: spexs2 pkg: github.com/egonelbre/spexs2/_benchmark │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ Run/10k/1-12 23.29 ± 1% 23.11 ± 2% ~ (p=0.315 n=10) Run/10k/16-12 5.648 ± 2% 5.462 ± 4% -3.30% (p=0.004 n=10) geomean 11.47 11.23 -2.06% shortname: sweet-biogo-igor goos: goarch: pkg: cpu: │ ./sweet/results/biogo-igor/baseline.results │ ./sweet/results/biogo-igor/experiment.results │ │ sec/op │ sec/op vs base │ BiogoIgor 13.53 ± 1% 13.62 ± 1% ~ (p=0.165 n=10) │ ./sweet/results/biogo-igor/baseline.results │ ./sweet/results/biogo-igor/experiment.results │ │ average-RSS-bytes │ average-RSS-bytes vs base │ BiogoIgor 62.19Mi ± 3% 62.86Mi ± 1% ~ (p=0.247 n=10) │ ./sweet/results/biogo-igor/baseline.results │ ./sweet/results/biogo-igor/experiment.results │ │ peak-RSS-bytes │ peak-RSS-bytes vs base │ BiogoIgor 89.57Mi ± 4% 89.03Mi ± 3% ~ (p=0.516 n=10) │ ./sweet/results/biogo-igor/baseline.results │ ./sweet/results/biogo-igor/experiment.results │ │ peak-VM-bytes │ peak-VM-bytes vs base │ BiogoIgor 766.4Mi ± 0% 766.4Mi ± 0% ~ (p=0.954 n=10) shortname: sweet-biogo-krishna │ ./sweet/results/biogo-krishna/baseline.results │ ./sweet/results/biogo-krishna/experiment.results │ │ sec/op │ sec/op vs base │ BiogoKrishna 12.70 ± 2% 12.09 ± 3% -4.86% (p=0.000 n=10) │ ./sweet/results/biogo-krishna/baseline.results │ ./sweet/results/biogo-krishna/experiment.results │ │ average-RSS-bytes │ average-RSS-bytes vs base │ BiogoKrishna 4.085Gi ± 0% 4.083Gi ± 0% ~ (p=0.105 n=10) │ ./sweet/results/biogo-krishna/baseline.results │ ./sweet/results/biogo-krishna/experiment.results │ │ peak-RSS-bytes │ peak-RSS-bytes vs base │ BiogoKrishna 4.174Gi ± 0% 4.173Gi ± 0% ~ (p=0.853 n=10) │ ./sweet/results/biogo-krishna/baseline.results │ ./sweet/results/biogo-krishna/experiment.results │ │ peak-VM-bytes │ peak-VM-bytes vs base │ BiogoKrishna 4.877Gi ± 0% 4.877Gi ± 0% ~ (p=0.591 n=10) shortname: sweet-bleve-index │ ./sweet/results/bleve-index/baseline.results │ ./sweet/results/bleve-index/experiment.results │ │ sec/op │ sec/op vs base │ BleveIndexBatch100 4.675 ± 1% 4.669 ± 1% ~ (p=0.739 n=10) │ ./sweet/results/bleve-index/baseline.results │ ./sweet/results/bleve-index/experiment.results │ │ average-RSS-bytes │ average-RSS-bytes vs base │ BleveIndexBatch100 185.5Mi ± 1% 185.9Mi ± 1% ~ (p=0.796 n=10) │ ./sweet/results/bleve-index/baseline.results │ ./sweet/results/bleve-index/experiment.results │ │ peak-RSS-bytes │ peak-RSS-bytes vs base │ BleveIndexBatch100 267.5Mi ± 6% 265.0Mi ± 2% ~ (p=0.739 n=10) │ ./sweet/results/bleve-index/baseline.results │ ./sweet/results/bleve-index/experiment.results │ │ peak-VM-bytes │ peak-VM-bytes vs base │ BleveIndexBatch100 1.945Gi ± 4% 1.945Gi ± 0% ~ (p=0.725 n=10) shortname: sweet-go-build │ ./sweet/results/go-build/baseline.results │ ./sweet/results/go-build/experiment.results │ │ sec/op │ sec/op vs base │ GoBuildKubelet 51.32 ± 0% 51.38 ± 3% ~ (p=0.105 n=10) GoBuildKubeletLink 7.669 ± 1% 7.663 ± 2% ~ (p=0.579 n=10) GoBuildIstioctl 46.02 ± 0% 46.07 ± 0% ~ (p=0.739 n=10) GoBuildIstioctlLink 8.174 ± 1% 8.143 ± 2% ~ (p=0.436 n=10) GoBuildFrontend 16.17 ± 1% 16.10 ± 1% ~ (p=0.143 n=10) GoBuildFrontendLink 1.399 ± 3% 1.377 ± 3% ~ (p=0.218 n=10) geomean 12.23 12.18 -0.39% shortname: sweet-gopher-lua │ ./sweet/results/gopher-lua/baseline.results │ ./sweet/results/gopher-lua/experiment.results │ │ sec/op │ sec/op vs base │ GopherLuaKNucleotide 22.71 ± 1% 22.86 ± 1% ~ (p=0.218 n=10) │ ./sweet/results/gopher-lua/baseline.results │ ./sweet/results/gopher-lua/experiment.results │ │ average-RSS-bytes │ average-RSS-bytes vs base │ GopherLuaKNucleotide 36.64Mi ± 2% 36.40Mi ± 1% ~ (p=0.631 n=10) │ ./sweet/results/gopher-lua/baseline.results │ ./sweet/results/gopher-lua/experiment.results │ │ peak-RSS-bytes │ peak-RSS-bytes vs base │ GopherLuaKNucleotide 43.28Mi ± 5% 41.55Mi ± 7% ~ (p=0.089 n=10) │ ./sweet/results/gopher-lua/baseline.results │ ./sweet/results/gopher-lua/experiment.results │ │ peak-VM-bytes │ peak-VM-bytes vs base │ GopherLuaKNucleotide 699.6Mi ± 0% 699.9Mi ± 0% +0.04% (p=0.006 n=10) shortname: sweet-markdown │ ./sweet/results/markdown/baseline.results │ ./sweet/results/markdown/experiment.results │ │ sec/op │ sec/op vs base │ MarkdownRenderXHTML 260.6m ± 4% 256.4m ± 4% ~ (p=0.796 n=10) │ ./sweet/results/markdown/baseline.results │ ./sweet/results/markdown/experiment.results │ │ average-RSS-bytes │ average-RSS-bytes vs base │ MarkdownRenderXHTML 20.47Mi ± 1% 20.71Mi ± 2% ~ (p=0.393 n=10) │ ./sweet/results/markdown/baseline.results │ ./sweet/results/markdown/experiment.results │ │ peak-RSS-bytes │ peak-RSS-bytes vs base │ MarkdownRenderXHTML 20.88Mi ± 11% 21.73Mi ± 6% ~ (p=0.470 n=10) │ ./sweet/results/markdown/baseline.results │ ./sweet/results/markdown/experiment.results │ │ peak-VM-bytes │ peak-VM-bytes vs base │ MarkdownRenderXHTML 699.2Mi ± 0% 699.3Mi ± 0% ~ (p=0.464 n=10) shortname: sweet-tile38 │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │ │ sec/op │ sec/op vs base │ Tile38WithinCircle100kmRequest 529.1µ ± 1% 530.3µ ± 1% ~ (p=0.143 n=10) Tile38IntersectsCircle100kmRequest 629.6µ ± 1% 630.8µ ± 1% ~ (p=0.971 n=10) Tile38KNearestLimit100Request 446.4µ ± 1% 453.7µ ± 1% +1.62% (p=0.000 n=10) geomean 529.8µ 533.4µ +0.67% │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │ │ average-RSS-bytes │ average-RSS-bytes vs base │ Tile38WithinCircle100kmRequest 5.054Gi ± 1% 5.057Gi ± 1% ~ (p=0.796 n=10) Tile38IntersectsCircle100kmRequest 5.381Gi ± 0% 5.431Gi ± 1% +0.94% (p=0.019 n=10) Tile38KNearestLimit100Request 6.801Gi ± 0% 6.802Gi ± 0% ~ (p=0.684 n=10) geomean 5.697Gi 5.717Gi +0.34% │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │ │ peak-RSS-bytes │ peak-RSS-bytes vs base │ Tile38WithinCircle100kmRequest 5.380Gi ± 1% 5.381Gi ± 1% ~ (p=0.912 n=10) Tile38IntersectsCircle100kmRequest 5.669Gi ± 1% 5.756Gi ± 1% +1.53% (p=0.019 n=10) Tile38KNearestLimit100Request 7.013Gi ± 0% 7.011Gi ± 0% ~ (p=0.796 n=10) geomean 5.980Gi 6.010Gi +0.50% │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │ │ peak-VM-bytes │ peak-VM-bytes vs base │ Tile38WithinCircle100kmRequest 6.047Gi ± 1% 6.047Gi ± 1% ~ (p=0.725 n=10) Tile38IntersectsCircle100kmRequest 6.305Gi ± 1% 6.402Gi ± 2% +1.53% (p=0.035 n=10) Tile38KNearestLimit100Request 7.685Gi ± 0% 7.685Gi ± 0% ~ (p=0.955 n=10) geomean 6.642Gi 6.676Gi +0.51% │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │ │ p50-latency-sec │ p50-latency-sec vs base │ Tile38WithinCircle100kmRequest 88.81µ ± 1% 89.36µ ± 1% +0.61% (p=0.043 n=10) Tile38IntersectsCircle100kmRequest 151.5µ ± 1% 152.0µ ± 1% ~ (p=0.089 n=10) Tile38KNearestLimit100Request 259.0µ ± 0% 259.1µ ± 0% ~ (p=0.853 n=10) geomean 151.6µ 152.1µ +0.33% │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │ │ p90-latency-sec │ p90-latency-sec vs base │ Tile38WithinCircle100kmRequest 712.5µ ± 0% 713.9µ ± 1% ~ (p=0.190 n=10) Tile38IntersectsCircle100kmRequest 960.6µ ± 1% 958.2µ ± 1% ~ (p=0.739 n=10) Tile38KNearestLimit100Request 1.007m ± 1% 1.032m ± 1% +2.50% (p=0.000 n=10) geomean 883.4µ 890.5µ +0.80% │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │ │ p99-latency-sec │ p99-latency-sec vs base │ Tile38WithinCircle100kmRequest 7.061m ± 1% 7.085m ± 1% ~ (p=0.481 n=10) Tile38IntersectsCircle100kmRequest 7.228m ± 1% 7.187m ± 1% ~ (p=0.143 n=10) Tile38KNearestLimit100Request 2.085m ± 0% 2.131m ± 1% +2.22% (p=0.000 n=10) geomean 4.738m 4.770m +0.66% │ ./sweet/results/tile38/baseline.results │ ./sweet/results/tile38/experiment.results │ │ ops/s │ ops/s vs base │ Tile38WithinCircle100kmRequest 17.01k ± 1% 16.97k ± 1% ~ (p=0.143 n=10) Tile38IntersectsCircle100kmRequest 14.29k ± 1% 14.27k ± 1% ~ (p=0.988 n=10) Tile38KNearestLimit100Request 20.16k ± 1% 19.84k ± 1% -1.59% (p=0.000 n=10) geomean 16.99k 16.87k -0.67% shortname: uber_tally goos: linux goarch: amd64 pkg: github.com/uber-go/tally cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ ScopeTaggedNoCachedSubscopes-12 2.867µ ± 4% 2.921µ ± 4% ~ (p=0.579 n=10) HistogramAllocation-12 1.519µ ± 3% 1.507µ ± 7% ~ (p=0.631 n=10) geomean 2.087µ 2.098µ +0.53% │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ B/op │ B/op vs base │ HistogramAllocation-12 1.124Ki ± 1% 1.125Ki ± 4% ~ (p=0.271 n=10) │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ allocs/op │ allocs/op vs base │ HistogramAllocation-12 20.00 ± 0% 20.00 ± 0% ~ (p=1.000 n=10) ¹ ¹ all samples are equal shortname: uber_zap pkg: go.uber.org/zap/zapcore │ ./bent-bench/20230303T173250.baseline.stdout │ ./bent-bench/20230303T173250.experiment.stdout │ │ sec/op │ sec/op vs base │ BufferedWriteSyncer/write_file_with_buffer-12 296.1n ± 12% 205.9n ± 10% -30.46% (p=0.000 n=10) MultiWriteSyncer/2_discarder-12 7.528n ± 4% 7.014n ± 2% -6.83% (p=0.000 n=10) MultiWriteSyncer/4_discarder-12 9.065n ± 1% 8.908n ± 1% -1.73% (p=0.002 n=10) MultiWriteSyncer/4_discarder_with_buffer-12 225.2n ± 2% 147.6n ± 2% -34.48% (p=0.000 n=10) WriteSyncer/write_file_with_no_buffer-12 4.785µ ± 1% 4.933µ ± 3% +3.08% (p=0.001 n=10) ZapConsole-12 702.5n ± 1% 649.1n ± 1% -7.62% (p=0.000 n=10) JSONLogMarshalerFunc-12 1.219µ ± 2% 1.226µ ± 3% ~ (p=0.781 n=10) ZapJSON-12 555.4n ± 1% 480.9n ± 3% -13.40% (p=0.000 n=10) StandardJSON-12 814.1n ± 1% 809.0n ± 0% ~ (p=0.101 n=10) Sampler_Check/7_keys-12 10.55n ± 2% 10.61n ± 1% ~ (p=0.594 n=10) Sampler_Check/50_keys-12 11.01n ± 0% 10.98n ± 1% ~ (p=0.286 n=10) Sampler_Check/100_keys-12 10.71n ± 0% 10.71n ± 0% ~ (p=0.563 n=10) Sampler_CheckWithHook/7_keys-12 20.20n ± 2% 20.42n ± 2% ~ (p=0.446 n=10) Sampler_CheckWithHook/50_keys-12 20.72n ± 2% 21.02n ± 1% ~ (p=0.078 n=10) Sampler_CheckWithHook/100_keys-12 20.15n ± 2% 20.68n ± 3% +2.63% (p=0.037 n=10) TeeCheck-12 140.8n ± 2% 140.5n ± 2% ~ (p=0.754 n=10) geomean 87.80n 82.39n -6.15% The only large regression (in ethereum_bitutil's BaseTest2KB) appears to be spurious, as the test does not involve any goroutines (or B.RunParallel()), which profiling confirms. Updates golang/go#18237 Related to golang/go#32113
Change https://go.dev/cl/473656 mentions this issue: |
Background
The following is a fairly frequent pattern that appears in our code and others:
goroutine1:
goroutine2:
The scheduler exhibits two different behaviors, depending on whether goroutine2 is busy and there are available Ps.
In the second case, if the P wakes and successfully steals the now runnable goroutine2, i.e. (3) happens first, then it will start executing on the new P. Unfortunately, the whole dance will happen again with the result. If the P wakes but does not successfully steal the now runnable goroutine2, i.e. (4) happens first and goroutine2 is run locally, then a large number of cycles are wasted. Either way, this dance happens again with the result. In both cases, we spend a large number of cycles and interprocessor co-ordination costs for what should be a goroutine context switch.
These are further problems caused by this, as it will introduce unnecessary work stealing and bouncing of goroutines between system threads and cores. (Leading to locality inefficiencies.)
Ideal schedule
With an oracle, the ideal schedule after (1) would be:
In essence, we want to yield the goroutine1's time to goroutine2 in this case, or at least avoid all the wasted signaling overhead. To put it another way: if goroutine1's P will block, then it fills the role of the "idle P" far more efficiently.
Proposal
It may be possible to specifically optimize for this case in the compiler, just as certain loop patterns are optimized.
In the case where a blocking channel send is immediately followed by a blocking channel receive, I propose an optimization that tries to avoid these scheduler round trips.
Here's a rough sketch of the idea:
Rejected alternatives
I thought about this problem a few years ago when it caused issues. In the past, I considered the possibility of a different channel operator. Something like:
ch1 <~ data
This operator would write to the channel and immediately yield to the other goroutine, if it was not already running (otherwise would fall back to the existing channel behavior). Using this operator in the above situation would make it much more efficient in general.
However, this is a language change, and confusing to users. When do you use which operator? It would be good to have the effect of this optimization out of the box.
Extensions
[1] https://github.com/golang/go/blob/master/src/runtime/proc.go#L665
The text was updated successfully, but these errors were encountered: