-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: mid-stack inline dispatch functions that call a function on each path #30548
Comments
Change https://golang.org/cl/164968 mentions this issue: |
In the normal case, only a few words have to be updated when adding a word to a vector. When that happens, we can simply copy the rest of the words, which is much faster. However, the overhead of that makes it prohibitive for small vectors, so we check the size at the beginning. The implementation is a bit weird to allow addVW to continued to be inlined; see #30548. The AddVW benchmarks are surprising, but fully repeatable. The SubVW benchmarks are more or less as expected. I expect that removing the indirect function call will help both and make them a bit more normal. name old time/op new time/op delta AddVW/1-8 4.27ns ± 2% 3.81ns ± 3% -10.83% (p=0.000 n=89+90) AddVW/2-8 4.91ns ± 2% 4.34ns ± 1% -11.60% (p=0.000 n=83+90) AddVW/3-8 5.77ns ± 4% 5.76ns ± 2% ~ (p=0.365 n=91+87) AddVW/4-8 6.03ns ± 1% 6.03ns ± 1% ~ (p=0.392 n=80+76) AddVW/5-8 6.48ns ± 2% 6.63ns ± 1% +2.27% (p=0.000 n=76+74) AddVW/10-8 9.56ns ± 2% 9.56ns ± 1% -0.02% (p=0.002 n=69+76) AddVW/100-8 90.6ns ± 0% 18.1ns ± 4% -79.99% (p=0.000 n=72+94) AddVW/1000-8 865ns ± 0% 85ns ± 6% -90.14% (p=0.000 n=66+96) AddVW/10000-8 8.57µs ± 2% 1.82µs ± 3% -78.73% (p=0.000 n=99+94) AddVW/100000-8 84.4µs ± 2% 31.8µs ± 4% -62.29% (p=0.000 n=93+98) name old time/op new time/op delta SubVW/1-8 3.90ns ± 2% 4.13ns ± 4% +6.02% (p=0.000 n=92+95) SubVW/2-8 4.15ns ± 1% 5.20ns ± 1% +25.22% (p=0.000 n=83+85) SubVW/3-8 5.50ns ± 2% 6.22ns ± 6% +13.21% (p=0.000 n=91+97) SubVW/4-8 5.99ns ± 1% 6.63ns ± 1% +10.63% (p=0.000 n=79+61) SubVW/5-8 6.75ns ± 4% 6.88ns ± 2% +1.82% (p=0.000 n=98+73) SubVW/10-8 9.57ns ± 1% 9.56ns ± 1% -0.13% (p=0.000 n=77+64) SubVW/100-8 90.3ns ± 1% 18.1ns ± 2% -80.00% (p=0.000 n=75+94) SubVW/1000-8 860ns ± 4% 85ns ± 7% -90.14% (p=0.000 n=97+99) SubVW/10000-8 8.51µs ± 3% 1.77µs ± 6% -79.21% (p=0.000 n=100+97) SubVW/100000-8 84.4µs ± 3% 31.5µs ± 3% -62.66% (p=0.000 n=92+92) Change-Id: I721d7031d40f245b4a284f5bdd93e7bb85e7e937 Reviewed-on: https://go-review.googlesource.com/c/go/+/164968 Run-TryBot: Josh Bleecher Snyder <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Robert Griesemer <[email protected]>
In the normal case, only a few words have to be updated when adding a word to a vector. When that happens, we can simply copy the rest of the words, which is much faster. However, the overhead of that makes it prohibitive for small vectors, so we check the size at the beginning. The implementation is a bit weird to allow addVW to continued to be inlined; see golang#30548. The AddVW benchmarks are surprising, but fully repeatable. The SubVW benchmarks are more or less as expected. I expect that removing the indirect function call will help both and make them a bit more normal. name old time/op new time/op delta AddVW/1-8 4.27ns ± 2% 3.81ns ± 3% -10.83% (p=0.000 n=89+90) AddVW/2-8 4.91ns ± 2% 4.34ns ± 1% -11.60% (p=0.000 n=83+90) AddVW/3-8 5.77ns ± 4% 5.76ns ± 2% ~ (p=0.365 n=91+87) AddVW/4-8 6.03ns ± 1% 6.03ns ± 1% ~ (p=0.392 n=80+76) AddVW/5-8 6.48ns ± 2% 6.63ns ± 1% +2.27% (p=0.000 n=76+74) AddVW/10-8 9.56ns ± 2% 9.56ns ± 1% -0.02% (p=0.002 n=69+76) AddVW/100-8 90.6ns ± 0% 18.1ns ± 4% -79.99% (p=0.000 n=72+94) AddVW/1000-8 865ns ± 0% 85ns ± 6% -90.14% (p=0.000 n=66+96) AddVW/10000-8 8.57µs ± 2% 1.82µs ± 3% -78.73% (p=0.000 n=99+94) AddVW/100000-8 84.4µs ± 2% 31.8µs ± 4% -62.29% (p=0.000 n=93+98) name old time/op new time/op delta SubVW/1-8 3.90ns ± 2% 4.13ns ± 4% +6.02% (p=0.000 n=92+95) SubVW/2-8 4.15ns ± 1% 5.20ns ± 1% +25.22% (p=0.000 n=83+85) SubVW/3-8 5.50ns ± 2% 6.22ns ± 6% +13.21% (p=0.000 n=91+97) SubVW/4-8 5.99ns ± 1% 6.63ns ± 1% +10.63% (p=0.000 n=79+61) SubVW/5-8 6.75ns ± 4% 6.88ns ± 2% +1.82% (p=0.000 n=98+73) SubVW/10-8 9.57ns ± 1% 9.56ns ± 1% -0.13% (p=0.000 n=77+64) SubVW/100-8 90.3ns ± 1% 18.1ns ± 2% -80.00% (p=0.000 n=75+94) SubVW/1000-8 860ns ± 4% 85ns ± 7% -90.14% (p=0.000 n=97+99) SubVW/10000-8 8.51µs ± 3% 1.77µs ± 6% -79.21% (p=0.000 n=100+97) SubVW/100000-8 84.4µs ± 3% 31.5µs ± 3% -62.66% (p=0.000 n=92+92) Change-Id: I721d7031d40f245b4a284f5bdd93e7bb85e7e937
A similar-but-not-identical case (I can open a separate issue if needed) is common in logging libraries like zap: // Debug logs a message at DebugLevel. The message includes any fields passed
// at the log site, as well as any fields accumulated on the logger.
func (log *Logger) Debug(msg string, fields ...Field) {
if ce := log.check(DebugLevel, msg); ce != nil {
ce.Write(fields...)
}
} So much so that library authors resort to suggesting to users to -basically- perform the inlining manually as it allows to turn something like this: logger.Debug("something happened",
zap.String("field", obj.Field()),
zap.Int64("state", obj.State()),
) into this, potentially bypassing one vararg call: if e := logger.Check(zap.DebugLevel, "something happened"); e != nil {
e.Write(
zap.String("field", obj.Field()),
zap.Int64("state", obj.State()),
)
} If the compiler inlined |
Following our |
I have a simplistic prototype that halves It works for if f1() {
f2()
} but it obviously falls for if condition {
return f1()
}
return f2() even though it inlines if condition {
return f1()
} else {
return f2()
} This prototype doesn't seem to affect the time it takes for That made me think that maybe a better way to handle this would be to have the |
Seems worth a shot.
I think we want that added condition. I assume that we don't want to inline a(), b(). @dr2chase put together the initial heuristic, but that piece at least seems pretty clearly intentional. |
Btw, for testing the impact, you may find github.com/josharian/compilecmp helpful |
The ideas here will resolves the different use-cases I was targeting in #19348 (comment). My initial use-cases were around helper methods that do a switch/if-else-if-else or that call 2 functions (see f and g below). f() {
if x {
a() // cost 57
} else if y {
b() // cost 1
} else {
c() // cost 1
}
d() // cost 1 (second function)
e() // cost 57 // without this call, f should be inlineable
}
g() {
a() // cost 57
b() // cost 1
} |
Seems like we could rewrite the visitor slightly, make it compute the max of visits down two arms of an if, rather than the sum. |
CL coming.... |
Change https://golang.org/cl/254796 mentions this issue: |
If y'all feel like playing with the CL to see how it does with your examples, that might be good. amd64: https://perf.golang.org/search?q=upload:20200914.8 Both did poorly on the same benchmark, which is "interesting". PS there's at least one failure when you run tests, because this changed some of the output for logopt (the JSON optimization logger for LSP use). |
Mystery regression debugged; if a function f looks like "return new(Thing).initThing()", and
then inlining initThing into f is a Bad Idea, because it results in avoidable heap allocation. The "solution" to this is a hack, whereby small functions that contain allocations have a lower limit on the size of things that they will inline (similar mechanism to not-inlining into giant methods) which will tend to let them inline more often. This bug could be biting us already, I just happened to make it appear with this bit of tinkering. Better interleaving of escape analysis and inlining would help with this. Like I said, hack. I am benchmarking just to see what the this-time-for-sure heuristic does for binary size and benchmark time. Interested parties are again invited to tinker with this, to see if it helps their code enough to matter. |
There, regressions gone. Slight performance increase, slight binary size increase. |
My Go time is severely restricted at the moment, but I really look forward to trying this out. And reverting my math/big hacks. |
@josharian We're also discussing better hacks. I'm not thrilled at the existence of additional knobs that are not measuring precisely what we care about, yet definitely have value. |
@dr2chase Gentle ping. Any update? It will be great to get this into go1.16. For more context, I have many functions that does something to the tune of: func read1Byte() byte {
if decodingFromBytes {
read1ByteFromByteSlice()
} else if decodingFromBufferedIO {
read1ByteFromBufIO()
} else {
read1ByteFromIO()
}
} Currently, code like this can never be inlined, and using an interface prevented inlining from happening. The common path of read1ByteFromByteSlice() is very cheap, and I wanted that inlined. With your code change, it will be. |
So, the CL is up for review, and there's an open "bug", and this is low-risk -- so it might go in later, rather than sooner -- i.e., it can go in after the deadline, unless I misunderstand that policy. Right now I'm working on stuff that I'd call much riskier that we need to be sure works. There's also some higher-priority inlining work going on to deal with a performance regression in 1.15, and that needs to happen first. See #41474. It's mostly done. There's default/automatic concern about the effect this will have on build times and binary size, since it will by default cause more things to become inlineable, hence larger binaries, and more time needed to produce them, so we may want to tweak those knobs a little bit. |
Change https://golang.org/cl/267419 mentions this issue: |
What's special about
that it should be inlined, but that
should not be? The inlining heuristics are currently oriented around the resulting code size. This justifies inlining If the inlining heuristic cost assigned to function calls is too high, I think it's reasonable to just lower it. I don't see how control flow details weigh in here, since we're not (at least yet) doing any substantive control flow analysis at time of inlining. |
Cutting the cost of a call in half increases text size by 3.5%. By-the-way, the best-looking "floats-all-boats" version of this CL doesn't work for a 3-call if tree. It includes a budget for how much the "max accounting" can be lower than the "sum accounting", and the current works-best number there is only 68, which is 11 more than the cost of a call; 100 was also tested, and it was not as good -- the binaries were a bit larger (of course) and the performance improvements were not as good, and there were more losers. All the benchmarking results used to obtain these parameters appear here. |
My understanding of the inlining heuristics, is that a function takes more budget (~57 out of 80) than a simple instruction, because we decided the inclusion of a call that cannot be inlined suggests that inlining the wrapping function is of less value, so we reduced the leftover budget to about ~20. However, a call like It should be just as possible to inline: func a() {
i++
f()
} as to inline func a() {
i++
if i > 0 {
f()
} else {
g()
}
} When we started talking about mid-stack inlining, the goal was to reduce function call cost to 1 (similar to panic, append, and other intrinsic functions). Since that is no longer on the table and a function call costs a large portion of the inlining budget, then we should allow a switch where only one branch is taken to equal roughly the same cost. This is what the CL achieves.
It would suck, both in performance (creating unnecessary temp closures) and readability, if we unintentionally encourage folks to resort to this. |
One of the costs we're trying to model here is code size. Inlining that 3-branch if statement replaces 1 call (the callsite) with 3 calls. The fact that only one of those 3 calls can execute on a particular invocation doesn't help with code size (unless the conditions can be statically evaluated). |
Putting this off till 1.17, need to do other work first. I do think the max-of-if-arms is reasonable, but we need to do a better job at not inlining in some cases. The companion CL that was supposed to help with this was too flaky, Keith and I were discussing ideas in comments there. Crude sketch is to put off the inlining decision till we can tell if it will over-fill an otherwise inlineable caller; alternately, look ahead more careful in the max-cost estimator. One thing I tried that seems like a win is to also impose a quota/budget on the amount that max-if-cost may diverge from sum-if-cost; this reduced binary size increase without doing much at all to hurt performance. Note that the interaction of max-if-cost with look-before-leap inline cost estimation will be a bit of pain. |
Before I forget, here's a writeup of what I tried and what I learned, and what I might try in the future. https://docs.google.com/document/d/17LAdIdHbcdhvxxQ2nszx4SgM_vbDGz6rDoj7BjL7VqM/edit?usp=sharing |
dispatch should be inlined: It is very simple, and on either path through the function we call exactly one other simple function.
It is not:
x.go:3:6: cannot inline dispatch: function too complex: cost 126 exceeds budget 80
This is because we attach a significant cost to each function call.
Note that we are happy to inline this dispatch function:
Fixing this requires either some special casing to recognize common patterns (like the original dispatch) or a bit of control flow analysis.
cc @randall77 @dr2chase
The text was updated successfully, but these errors were encountered: