testing: document best practices for avoiding compiler optimizations in benchmarks #27400

nicksnyder · 2018-08-31T00:47:52Z

What did you do?

I wanted to make sure I was creating accurate benchmarks.
I found and read Dave Cheney's 2013 blog post on how to write benchmarks in Go. In the "A note on compiler optimisations" section he mentions that it is best practice to assign results to local and package level variables to avoid optimizations.
I went to https://golang.org/pkg/testing/#hdr-Benchmarks

What did you expect to see?

I expected to see documentation on how to correctly write benchmarks that avoid compiler optimizations and examples that reflect best practices.

If the techniques described in Dave's blog post are no longer necessary, I expected to see explicit documentation to that effect.

What did you see instead?

Neither of those things.

meirf · 2018-08-31T03:37:42Z

@davecheney,

Is that kind of optimization avoidance still recommended?

If so, do you think that info should be put in https://golang.org/pkg/testing/#hdr-Benchmarks? I don't see an official benchmark wiki so seems useful to give a short explanation in testing doc.

as · 2018-09-01T03:22:14Z

Couldn't reproduce benchmark code elision in currently-supported Go versions.

josharian · 2018-09-01T04:14:18Z

@as I believe it can happen now with inlined calls. There is discussion of purity analysis, which might impact non-inlined pure calls later.

gbbr · 2019-10-25T10:11:06Z

How come no action has been taken here? This indeed seems like it would be documentation worthy. Does it warrant a doc update?

randall77 · 2019-10-25T14:27:48Z

@gbbr I think adding some documentation around this would be fine. No one has gotten to it yet. Want to send a CL?

nicksnyder · 2019-10-25T14:45:48Z

Based on the conversation in this thread it is still unclear to me what the documentation should say. Does Dave’s 2013 blog post reflect today’s best practices?

gbbr · 2019-10-25T17:11:52Z

I am not able to come up with an example that illustrates the problem. I'm not sure if this really is a problem today. The example here is wrong, as is #14813, because it uses the loop's index as an argument to the function call. The other example in Dave's post here also does not prove any noticeable differences between the two solutions.

cespare · 2019-10-25T21:06:14Z

Here's an example that demonstrates the problem. You have to be careful to ensure that the function is inlined but also that the result cannot computed at compile time.

package main

import (
	"runtime"
	"testing"
	"time"
)

func BenchmarkX(b *testing.B) {
	for i := 0; i < b.N; i++ {
		f()
	}
}

var sink int

func BenchmarkXSink(b *testing.B) {
	for i := 0; i < b.N; i++ {
		sink = f()
	}
}

func BenchmarkXKeepAlive(b *testing.B) {
	for i := 0; i < b.N; i++ {
		runtime.KeepAlive(f())
	}
}

var input float64

func init() {
	if time.Now().Year() > 1900 {
		input = 123412341234123
	}
}

func f() int {
	x := input
	x /= 7.3
	x /= 7.3
	x /= 7.3
	x /= 7.3
	return int(x)
}

Here's what I get using gc (tip) and gccgo (8.3.0):

$ go test -bench . bench_test.go
goos: linux
goarch: amd64
BenchmarkX-4                    1000000000               0.281 ns/op
BenchmarkXSink-4                43315279                24.5 ns/op
BenchmarkXSinkExported-4        49105191                24.5 ns/op
BenchmarkXKeepAlive-4           48882840                24.3 ns/op
PASS
ok      command-line-arguments  5.823s
$ go test -compiler gccgo -bench . bench_test.go
goos: linux
goarch: amd64
BenchmarkX-4                    50000000                24.4 ns/op
BenchmarkXSink-4                50000000                24.4 ns/op
BenchmarkXSinkExported-4        50000000                24.5 ns/op
BenchmarkXKeepAlive-4           20000000                68.6 ns/op
PASS
ok      command-line-arguments  5.286s

cespare · 2019-10-25T21:08:29Z

I started one of the other discussions about this a few years ago on golang-dev. I think the situation is still quite unfortunate. To the best of my knowledge, I think that situation is:

Examples where the compiler breaks benchmarks (as my code above) are fairly rare in the real world today.
There are a variety of approaches for mitigating this hypothetical problem which have entered the lore, some of which could be themselves thwarted by a sufficiently clever compiler.
If certain optimizations appeared in the gc compiler, they would probably break loads of benchmarks that already exist. (If Go could inline functions with for loops (cmd/compile: for-loops cannot be inlined #14768), which is an important optimization that hasn't been implemented yet, then I expect this issue would suddenly apply to a whole lot more functions out there.)

Given (1) and (2), I think a lot of the current "sink" approaches are not good since they make the code uglier and they are hard to justify (they protect the benchmark against some, but not all, hypothetical future optimizations).

More recently some people have suggested that using runtime.KeepAlive to mark values that you want to always be computed in the benchmark is the best approach (such as @randall77's comment here). That seems better, but is still not completely ideal in two ways:

KeepAlive might not convey the meaning as well as something that is named/documented specifically for this purpose. (@randall77 suggested adding a testing.B method which can just delegate to runtime.KeepAlive.)
KeepAlive adds a little overhead that seems unnecessary (cmd/compile: runtime.KeepAlive causes useless extra run-time overhead, even though it is an intrinsic? #33442).

Some comments in these discussions have seemed to suggest that writing microbenchmarks is an advanced task for experienced programmers which always requires some knowledge about the compiler and system underneath and therefore there's nothing to be done here. I don't really buy that, though: I think that even beginners can reasonably want to write benchmarks which measure how long their function f takes to run and compare different approaches to writing f, and we should make it as easy as possible to avoid any future pitfalls where the obvious benchmark code becomes broken.

I advocate that we decide on a single best approach here (which seems to be using runtime.KeepAlive or else a new helper in the testing package), document it thoroughly, and adopt it as widely as possible.

bcmills · 2020-04-27T15:22:02Z

I just ran into this in some mostly-for-fun unsafe code: the optimized version of the code is a non-allocating pure function, so it got optimized away in the benchmark.

I ended up working around it using runtime.KeepAlive, but ideally the compiler should be smart enough to know not to DCE calls within a Benchmark function itself.

ianlancetaylor · 2020-04-27T23:51:19Z

I don't think the compiler should apply special treatment to benchmarks. That will in the long run mean that benchmarks aren't doing what we think they are doing. Which is already true, but it will be even worse.

randall77 · 2021-09-29T17:21:14Z

It would be nice to have some sort of sink function as a method of testing.B.
Just using (b *B) Sink(interface{}) doesn't really work, as converting to interface{} might introduce an allocation, which is probably not what the benchmarker wants.
With generics, we could fix the allocation problem. Unfortunately it can't be a method of testing.B because we can't have generic methods on non-generic types. But a function would work:

//go:noinline
func Sink[T any](x T) {
}

Then any benchmark could do testing.Sink(x) to ensure that x was calculated. It would leave a rump call-return pair, but crucially the allocation would be gone.

Possibly we could teach the compiler something about testing.Sink to get rid of the call-return pair, but that's getting involved. (And if we were willing to do that, we could do the same for (b *B) Sink(interface{}), which is a better API.)

bcmills · 2021-09-29T19:09:29Z

Thinking about this some more, I suspect that a callback-style API similar to (*T).Fuzz would make these sorts of inlining mistakes much easier to avoid. Consider:

package testing

// Iterate accepts an argument of any function type and zero or more arguments to pass to that function.
// Iterate then invokes the function sequentially exactly b.N times with those arguments.
//
// The compiler will not optimize through the function's arguments or return values,
// so arguments passed to the function will not be subject to constant-propagation
// optimizations and values returned by it will not be subject to dead-store elimination.
func (b *B) Iterate(f interface{}, args ...interface{})

// IterateParallel is like Iterate, but invokes the function on multiple goroutines in parallel.
//
// The number of goroutines defaults to GOMAXPROCS, and can be increased by calling
// SetParallelism before IterateParallel.
func (b *B) IterateParallel(f interface{}, args ...interface{})

Iterate could be trivially implemented using reflect, provided that the reflect.Call overhead could be made small enough (or the Iterate method special-cased well enough) that the call overhead does not bias the benchmark results.

Then the example from the blog post might look like:

func BenchmarkFibComplete(b *testing.B) {
	b.Iterate(Fib, 10)
}

Not only more robust and intuitive, but also much more concise to boot!

josharian · 2021-09-29T21:20:08Z

@bcmills I mused about something similar at #38677 (comment)

bcmills · 2021-10-04T15:39:47Z

(I filed proposal #48768 for the API from the above comment.)

eliben · 2023-06-22T13:03:28Z

I ran into this issue in a particularly insidious way recently, and want to
share the scenario, because it's somewhat different from what I've seen before.
IMHO this highlights the seriousness of this issue and the need to at least
document it properly.

My goal was to benchmark a function akin to this countCond (this example is
artificial but it's very similar to the real one):

func isCond(b byte) bool {
	if b == 3 || b == 7 || b == 11 || b == 17 || b == 19 || b == 31 {
		return true
	}
	return false
}

func countCond(b []byte) int {
	result := 0
	for i := 0; i < len(b); i++ {
		if isCond(b[i]) {
			result += 1
		}
	}
	return result
}

So I wrote a benchmark function:

func BenchmarkNoSink(b *testing.B) {
	inp := getInputContents()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		countCond(inp)
	}
}

getInputContents creates a large testing slice input (full code is in
https://gist.github.com/eliben/3ad700c3589d814eb87dfef704083abe).

I've been burned by compilers optimizing away benchmarking code many times
before and am always on the lookout for some signs:

The benchmark reporting a ridiculously low time (single nanoseconds for
a loop-y operation, for example)
The benchmark not demonstrating its expected asymptotic growth behavior;
e.g. in a loop like countCond I would expect the time to grow Nx if I
increase the input slice length Nx.

Neither of these happened here:

$ gotip test -bench=. benchloopopt_test.go
goos: linux
goarch: amd64
cpu: 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
BenchmarkNoSink-8      	   10000	    103905 ns/op

This is for input size of 400k. 100us/op doesn't sound very outlandish.
Growing the input size to 800k I saw:

$ gotip test -bench=. benchloopopt_test.go
goos: linux
goarch: amd64
cpu: 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
BenchmarkNoSink-8      	    5816	    194492 ns/op

Roughly 2x growth in time per op - makes sense.

But in fact, the compiler here inlined everything into the benchmark function,
then realized the actual result isn't used anywhere and optimized the loop
body of countCond away entirely, but left the loop itself in place. The
loop just loops from 0 to len(input) with an empty body. This explains the
remaining linear growth w.r.t. input size.

Mitigating this by having a sink or KeepAlive fixes the issue and I see
the "real" benchmark numbers:

$ gotip test -bench=. benchloopopt_test.go
goos: linux
goarch: amd64
cpu: 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
BenchmarkNoSink-8      	    5784	    194232 ns/op
BenchmarkYesSink-8     	    1753	    683741 ns/op
BenchmarkKeepalive-8   	    1761	    682919 ns/op

What I find particularly problematic about this scenario is that the partial
optimization performed by the compiler thwarted my "bad benchmark" mental smoke
test and misled me for quite a while. So there appears to be no guarantee that
when compiler optimizations confuse benchmarks, it's at least "obvious" by
just eyeballing the results.

gopherbot · 2023-06-22T18:33:14Z

Change https://go.dev/cl/505235 mentions this issue: testing: improve benchmarking example

randall77 · 2023-06-23T17:26:05Z

I was thinking about this some more, along the lines of what Brian proposed, with ergonomic improvements.

What if BenchmarkFoo could return a func() ... which is the thing that is to be benchmarked?

func BenchmarkFoo(b *testing.B) func() int {
    return func() int {
        return Fib(10)
    }
}

The testing package calls BenchmarkFoo once, then does the b.N loop and calls whatever BenchmarkFoo returns in that loop.

The "sink" in this case is the return value of the closure that BenchmarkFoo returns. We could allow any type(s) here, testing would just need to call it with reflect (for which we would probably want #49340 to avoid allocations by reflect).

Any test setup could be done in BenchmarkFoo before returning the closure. (There is no place for test teardown, but that kind of benchmark is rare.)
This is backwards-compatible because Benchmark* functions can't currently return anything. Benchmarks with no return values operate as before.

Possibly we could pass i as an argument to the closure? Usually it could be reconstructable by the benchmark using a closure variable, but it may be convenient in some cases.

bcmills · 2023-06-23T18:12:36Z

@randall77, I like that direction but I think #48768 is a little clearer w.r.t. interactions with existing testing.B methods.

With the “return a closure” approach, presumably calling SetParallelism would cause the function to be invoked in parallel on the indicated number of goroutines? But it's not clear to me how you would indicate “run with the parallelism specified by the -cpu flag” (analogous to calling b.RunParallel).

I think we would also need to be careful to specify that Cleanup callbacks are executed only after the last call to the function has returned.

markdryan · 2023-10-12T16:13:01Z

I've just stumbled across this issue with utf16_test.BenchmarkDecodeRune. After getting weird results from this benchmark when testing a RISC-V patch, I took a look at the disassembly and noticed that the benchmark was neither calling nor inlining DecodeRune. DecodeRune seemed to have been completely optimised away. Assigning the return value of DecodeRune to a local variable and calling runtime.KeepAlive on that local at the end of the function fixed the issue.

Without the fix, I see

goos: linux
goarch: riscv64
pkg: unicode/utf16
BenchmarkDecodeRune-4   	48049786	        24.90 ns/op

and with

goos: linux
goarch: riscv64
pkg: unicode/utf16
BenchmarkDecodeRune-4   	13096431	        91.58 ns/op

This isn't a RISC-V specific problem. Disassembling an amd64 build of the benchmark shows that the DecodeRune code has been optimised away on those builds as well.

seankhliao · 2024-12-02T18:39:19Z

Should we fold this into Keep #61179 or Loop #61515 ?

gopherbot added the Documentation Issues describing a change to documentation. label Aug 31, 2018

meirf added help wanted and removed help wanted labels Aug 31, 2018

FiloSottile added this to the Unplanned milestone Aug 31, 2018

FiloSottile added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 31, 2018

randall77 added the help wanted label Oct 25, 2019

changkun mentioned this issue Mar 26, 2020

第 83 期对 Go 程序进行可靠的性能测试 talkgo/night#564

Closed

bcmills mentioned this issue Sep 29, 2021

cmd/compile,unicode/utf8: utf8.EncodeRune has different performance from the equivalent string conversion #48684

Open

bcmills mentioned this issue Oct 4, 2021

proposal: testing: a less error-prone API for benchmark iteration #48768

Closed

AlexanderYastrebov mentioned this issue Oct 5, 2022

Made cachedEndpoints read-only to avoid write+read races zalando/skipper#2080

Closed

dominikh mentioned this issue Feb 15, 2023

Regression (?) on SA4017 in benchmark tests dominikh/go-tools#1368

Closed

aclements mentioned this issue Jul 5, 2023

proposal: testing: add Keep, to force evaluation in benchmarks #61179

Open

eliben mentioned this issue Aug 25, 2023

cmd/compile: slight reorder of conditions in loop results in much slower code #60611

Closed

turbolent mentioned this issue Nov 11, 2023

Introduce benchmark for the smallIntegerValueCache onflow/cadence#2936

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testing: document best practices for avoiding compiler optimizations in benchmarks #27400

testing: document best practices for avoiding compiler optimizations in benchmarks #27400

nicksnyder commented Aug 31, 2018

meirf commented Aug 31, 2018

as commented Sep 1, 2018

josharian commented Sep 1, 2018

gbbr commented Oct 25, 2019

randall77 commented Oct 25, 2019

nicksnyder commented Oct 25, 2019

gbbr commented Oct 25, 2019 •

edited

Loading

cespare commented Oct 25, 2019

cespare commented Oct 25, 2019

bcmills commented Apr 27, 2020

ianlancetaylor commented Apr 27, 2020

randall77 commented Sep 29, 2021 •

edited

Loading

bcmills commented Sep 29, 2021 •

edited

Loading

josharian commented Sep 29, 2021

bcmills commented Oct 4, 2021

eliben commented Jun 22, 2023

gopherbot commented Jun 22, 2023

randall77 commented Jun 23, 2023

bcmills commented Jun 23, 2023

markdryan commented Oct 12, 2023

seankhliao commented Dec 2, 2024

testing: document best practices for avoiding compiler optimizations in benchmarks #27400

testing: document best practices for avoiding compiler optimizations in benchmarks #27400

Comments

nicksnyder commented Aug 31, 2018

What did you do?

What did you expect to see?

What did you see instead?

meirf commented Aug 31, 2018

as commented Sep 1, 2018

josharian commented Sep 1, 2018

gbbr commented Oct 25, 2019

randall77 commented Oct 25, 2019

nicksnyder commented Oct 25, 2019

gbbr commented Oct 25, 2019 • edited Loading

cespare commented Oct 25, 2019

cespare commented Oct 25, 2019

bcmills commented Apr 27, 2020

ianlancetaylor commented Apr 27, 2020

randall77 commented Sep 29, 2021 • edited Loading

bcmills commented Sep 29, 2021 • edited Loading

josharian commented Sep 29, 2021

bcmills commented Oct 4, 2021

eliben commented Jun 22, 2023

gopherbot commented Jun 22, 2023

randall77 commented Jun 23, 2023

bcmills commented Jun 23, 2023

markdryan commented Oct 12, 2023

seankhliao commented Dec 2, 2024

gbbr commented Oct 25, 2019 •

edited

Loading

randall77 commented Sep 29, 2021 •

edited

Loading

bcmills commented Sep 29, 2021 •

edited

Loading