-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
testing: document best practices for avoiding compiler optimizations in benchmarks #27400
Comments
Is that kind of optimization avoidance still recommended? If so, do you think that info should be put in https://golang.org/pkg/testing/#hdr-Benchmarks? I don't see an official benchmark wiki so seems useful to give a short explanation in |
Couldn't reproduce benchmark code elision in currently-supported Go versions. |
@as I believe it can happen now with inlined calls. There is discussion of purity analysis, which might impact non-inlined pure calls later. |
How come no action has been taken here? This indeed seems like it would be documentation worthy. Does it warrant a doc update? |
@gbbr I think adding some documentation around this would be fine. No one has gotten to it yet. Want to send a CL? |
Based on the conversation in this thread it is still unclear to me what the documentation should say. Does Dave’s 2013 blog post reflect today’s best practices? |
I am not able to come up with an example that illustrates the problem. I'm not sure if this really is a problem today. The example here is wrong, as is #14813, because it uses the loop's index as an argument to the function call. The other example in Dave's post here also does not prove any noticeable differences between the two solutions. |
Here's an example that demonstrates the problem. You have to be careful to ensure that the function is inlined but also that the result cannot computed at compile time.
Here's what I get using gc (tip) and gccgo (8.3.0):
|
I started one of the other discussions about this a few years ago on golang-dev. I think the situation is still quite unfortunate. To the best of my knowledge, I think that situation is:
Given (1) and (2), I think a lot of the current "sink" approaches are not good since they make the code uglier and they are hard to justify (they protect the benchmark against some, but not all, hypothetical future optimizations). More recently some people have suggested that using runtime.KeepAlive to mark values that you want to always be computed in the benchmark is the best approach (such as @randall77's comment here). That seems better, but is still not completely ideal in two ways:
Some comments in these discussions have seemed to suggest that writing microbenchmarks is an advanced task for experienced programmers which always requires some knowledge about the compiler and system underneath and therefore there's nothing to be done here. I don't really buy that, though: I think that even beginners can reasonably want to write benchmarks which measure how long their function I advocate that we decide on a single best approach here (which seems to be using |
I just ran into this in some mostly-for-fun I ended up working around it using |
I don't think the compiler should apply special treatment to benchmarks. That will in the long run mean that benchmarks aren't doing what we think they are doing. Which is already true, but it will be even worse. |
It would be nice to have some sort of sink function as a method of
Then any benchmark could do Possibly we could teach the compiler something about |
Thinking about this some more, I suspect that a callback-style API similar to package testing
// Iterate accepts an argument of any function type and zero or more arguments to pass to that function.
// Iterate then invokes the function sequentially exactly b.N times with those arguments.
//
// The compiler will not optimize through the function's arguments or return values,
// so arguments passed to the function will not be subject to constant-propagation
// optimizations and values returned by it will not be subject to dead-store elimination.
func (b *B) Iterate(f interface{}, args ...interface{})
// IterateParallel is like Iterate, but invokes the function on multiple goroutines in parallel.
//
// The number of goroutines defaults to GOMAXPROCS, and can be increased by calling
// SetParallelism before IterateParallel.
func (b *B) IterateParallel(f interface{}, args ...interface{})
Then the example from the blog post might look like: func BenchmarkFibComplete(b *testing.B) {
b.Iterate(Fib, 10)
} Not only more robust and intuitive, but also much more concise to boot! |
@bcmills I mused about something similar at #38677 (comment) |
(I filed proposal #48768 for the API from the above comment.) |
I ran into this issue in a particularly insidious way recently, and want to My goal was to benchmark a function akin to this
So I wrote a benchmark function:
I've been burned by compilers optimizing away benchmarking code many times
Neither of these happened here:
This is for input size of 400k. 100us/op doesn't sound very outlandish.
Roughly 2x growth in time per op - makes sense. But in fact, the compiler here inlined everything into the benchmark function, Mitigating this by having a sink or
What I find particularly problematic about this scenario is that the partial |
Change https://go.dev/cl/505235 mentions this issue: |
I was thinking about this some more, along the lines of what Brian proposed, with ergonomic improvements. What if
The testing package calls The "sink" in this case is the return value of the closure that Any test setup could be done in Possibly we could pass |
@randall77, I like that direction but I think #48768 is a little clearer w.r.t. interactions with existing With the “return a closure” approach, presumably calling I think we would also need to be careful to specify that |
I've just stumbled across this issue with utf16_test.BenchmarkDecodeRune. After getting weird results from this benchmark when testing a RISC-V patch, I took a look at the disassembly and noticed that the benchmark was neither calling nor inlining DecodeRune. DecodeRune seemed to have been completely optimised away. Assigning the return value of DecodeRune to a local variable and calling runtime.KeepAlive on that local at the end of the function fixed the issue. Without the fix, I see
and with
This isn't a RISC-V specific problem. Disassembling an amd64 build of the benchmark shows that the DecodeRune code has been optimised away on those builds as well. |
What did you do?
What did you expect to see?
I expected to see documentation on how to correctly write benchmarks that avoid compiler optimizations and examples that reflect best practices.
If the techniques described in Dave's blog post are no longer necessary, I expected to see explicit documentation to that effect.
What did you see instead?
Neither of those things.
The text was updated successfully, but these errors were encountered: