-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: feedback-guided optimization #28262
Comments
Oh, hey, I have a concrete example of a potentially-relevant case for this, which is a recurring thing that bites people with microbenchmarks, but may also affect real code: In rare cases, trivial questions of code alignment can produce very large performance differences. I had a test case, at one point, where I had two functions which were getting roughly a 2x performance difference in benchmarking. (As in, one took twice as long as the other.) But the more I experimented, the more confused I became, because it turned out the functions were identical, and I could change which one was faster by reordering them in source. The microbenchmark case probably doesn't matter, but there are real-world cases where a couple of likely-concurrent hot paths end up with some kind of problem which is entirely a function of the exact location of their code -- I'm guessing either "aligned or not aligned with cache line" or "in same cache-associativity group as other hot paths", but I don't have the expertise to dive into it far enough to be sure. My guess is that there's some actual specialized code out there where inserting a handful of nops before some specific functions could produce a +/-20% performance change in the program as a whole. Figuring out which ones seems likely to be arbitrarily hard; you'd need some kind of framework for figuring out how to benchmark the functions, and how to benchmark different sets of functions in parallel, and so on. This is the sort of thing which is a huge amount of work, and which doesn't provide any noticeable benefit until it does, and then it can be inordinately significant. (The big question is whether you end up perceiving it as "sometimes we get lucky and our code runs faster" or "sometimes we get unlucky and our code runs slower"). |
More than two years have passed since the last discussion, is there any progress or plan on FGO recently? |
I hadn't remembered we even had an open issue for this! @cherrymui and @prattmic are actively doing prototype work right now to help us understand the potential wins from PGO/FGO/FDO and how to best integrate it into the build process. It's early stages, but we're tentatively hoping to have initial PGO optimizations available in 1.19. |
Here's a related idea I've been kicking around. Instead of doing running program -> profile -> compiler, add an extra intermediate step: running program -> profile -> optimization file -> compiler. The optimization file consists of optimization information and directives to the compiler. There are a few advantages to this:
|
Though I am delighted to hear this, I feel compelled to say yet again: I wish we had a more predictable, repeatable mechanism to learn things like this than someone happening to ping a relevant issue. |
Is this the type of thing that could get covered in the “Compiler and Runtime Meeting Notes“ tracking issue #43930 ? That did seem useful, although I know it takes some work. |
I have received similar feedback in discussing this with one user: that they would like a way to manually specify optimizations that a PGO profile might indicate if capturing actual profiles is not possible/representative or via, as you say "offline human analysis". They specifically mentioned |
CC @josharian Yeah. I haven't been maintaining that of late. My apologies. (It's kinda fallen on the pile of "stuff I should get to".) I'll try to do better. :) Having said that it's unlikely that PGO would have made that tracking issue. PGO started in experimental form, and we were only trying to see if it was worth pursuing. Our experimentation and the results from some external groups lead us to believe PGO might make the 2022 roadmap, but we're still trying to figure out what form it will take. |
Regarding PGO in the latest .NET 6 release https://devblogs.microsoft.com/dotnet/announcing-net-6/#dynamic-pgo and https://devblogs.microsoft.com/dotnet/announcing-net-6/#dynamic-pgo (UPD: there is a 2nd Dynamic PGO paragraph closer to the end of the article) |
To echo #28262 (comment), #49688 came as a surprise to seemingly everyone except the compiler team. If #43930 isn't the right medium to keep everyone else in the loop, perhaps consider trying something else. My gut feeling is that those public weekly notes are a good starting point, and I'd like to see them resume - especially if they included a very brief summary of the main projects currently in progress. To hopefully give a bit of perspective: the lack of transparency into what is being actively considered or experimented on discourages external contributions. Using this thread as an example, it seems like Josh has been thinking about this idea for a while, and I'd bet he would contribute towards the design, testing, and perhaps even coding/reviews. That kind of external involvement simply isn't going to happen if there isn't a public way to be kept in the loop, though. |
@mvdan I'm sorry it seems that there is a lack of transparency by the Go team. The compiler team at Google hasn't done any work on PGO that is not public. The code in #49688 was developed entirely independently at Uber. The compiler team only became aware of it relatively recently when Uber reached out privately. The public #49688 is a step toward transparency and keeping everybody in the loop. It would have been better to make a public announcement about it first. Still, we would now be in the same position anyhow. The communication on this is a lesson to bear in mind the next time there is a substantial privately developed contribution. Sorry for the confusion and angst. |
Perhaps I'm not seeing the full picture - in #28262 (comment) it was described that some prototype work was already taking place, and in #49688 (comment) it seems like the design and coordination with Uber to upstream their work is currently underway. These updates are certainly welcome, and I'm really excited to see some progress in this space. However, I'd like to clarify that I don't think we'd have arrived at the same position anyway. To once again take Josh as an example, he's only learning about developments around this proposal via comments that feel like after-thoughts, leaving him little room to participate or even be kept in the loop with some detail. Which is unfortunate, given he's the author of the proposal and some of its design ideas, and has been one of the most reliable compiler contributors over the years :) Maybe the way the compiler team functions is indeed to work on this issue privately, and then come out publicly with a proposal document and possibly even a prototype implementation. I'm trying to argue, hopefully in a constructive way, that we should aim to be more transparent and welcoming to external contributors. A good first step would be to post regular and early updates somewhere like #43930, making it easy for others to be kept up to date in a timely manner. Ideally some of the team's communication channels would be public too, much like golang-tools has been doing for years, but at least weekly written updates are an easier incremental step. |
@mvdan I don't think you're missing the picture – there are a few details that could help fill you in, but most of it is all current events. A bit of background: The present day: That is where we are. Uber wants us to consider their inlining implementation, and we need to take a look at it. The feedback that the compiler and runtime team could do a better job is correct, but in this instance, I feel like we communicated when we knew things. The PGO work we undertook in 2021H2 was truly an experiment. We do experimentation all the time, and when we can share results (ie, it isn't based on internal Google data), we do. I hope this explanation helps, and I'm happy to chat at any time. |
Thanks, Jeremy - I appreciate the extra updates and context on PGO in this thread, as well as all the work that is happening here.
Just so we're clear, this implies only Google, correct? I think that's what I'm trying to get at. People like Josh certainly deserve being considered part of the compiler team, but are repeatedly kept out of relevant threads and video calls simply because they are not employed by Google :) I admit this is getting off-topic, though, so this will be my last comment on the topic here. I've already outlined how I think you could improve transparency and encourage more significant contributions from non-Googlers, and I'd rather not repeat myself as I don't have much else to add. I hope it's clear this is not an anti-Google sentiment in any way - but rather, how other companies (like Josh's Tailscale or this thread's Uber) would likely participate more regularly and actively in those teams if they were given an honest chance. |
Sorry, yes. That's how we split things up within google. These are the people within Google who think about the C&RT, don't mean to sound like we're alone in this fight. :) |
Change https://go.dev/cl/357330 mentions this issue: |
Performance is kind of hard to exactly quantify. One big difference between jump tables and the old binary search scheme is that there's only 1 branch statement instead of O(n) of them. That can be both a blessing and a curse, and can make evaluating jump tables very hard to do. The single branch can become a choke point for the hardware branch predictor. A branch table jump must fit all of its state in a single branch predictor entry (technically, a branch target predictor entry). With binary search that predictor state can be spread among lots of entries. In cases where the case selection is repetitive and thus predictable, binary search can perform better. The big win for a jump table is that it doesn't consume so much of the branch predictor's resources. But that benefit is essentially never observed in microbenchmarks, because the branch predictor can easily keep state for all the binary search branches in a microbenchmark. So that benefit is really hard to measure. So predictable switch microbenchmarks are ~useless - they will almost always favor the binary search scheme. Fully unpredictable switch microbenchmarks are better, as they aren't lying to us quite so much. In a perfectly unpredictable situation, a jump table will expect to incur 1-1/N branch mispredicts, where a binary search would incur lg(N)/2 of them. That makes the crossover point at about N=4. But of course switches in real programs are seldom fully unpredictable, so we'll use a higher crossover point. Beyond the branch predictor, jump tables tend to execute more instructions per switch but have no additional instructions per case, which also argues for a larger crossover. As far as code size goes, with this CL cmd/go has a slightly smaller code segment and a slightly larger overall size (from the jump tables themselves which live in the data segment). This is a case where some FDO (feedback-directed optimization) would be really nice to have. #28262 Some large-program benchmarks might help make the case for this CL. Especially if we can turn on branch mispredict counters so we can see how much using jump tables can free up branch prediction resources that can be gainfully used elsewhere in the program. name old time/op new time/op delta Switch8Predictable 1.89ns ± 2% 1.27ns ± 3% -32.58% (p=0.000 n=9+10) Switch8Unpredictable 9.33ns ± 1% 7.50ns ± 1% -19.60% (p=0.000 n=10+9) Switch32Predictable 2.20ns ± 2% 1.64ns ± 1% -25.39% (p=0.000 n=10+9) Switch32Unpredictable 10.0ns ± 2% 7.6ns ± 2% -24.04% (p=0.000 n=10+10) Fixes #5496 Update #34381 Change-Id: I3ff56011d02be53f605ca5fd3fb96b905517c34f Reviewed-on: https://go-review.googlesource.com/c/go/+/357330 Run-TryBot: Keith Randall <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Cherry Mui <[email protected]> Reviewed-by: Keith Randall <[email protected]>
@mvdan posted a comment at #49688 (comment) asking for a tracking bug, and updates on what's going on. This is the right tracking bug. Apologies for lack of updates, our progress has been very on-and-off with other things going on. The current status here is:
|
Go already supports PGO, why not close this issue? |
This is an umbrella issue for supporting feedback-guided optimization. This has been discussed offhand in a few other contexts (inlining decisions: #17566, code layout decisions: #20356, language changes: #24204, stack growth: #18138).
It is not clear what the design for FGO/PGO support might look like. In particular, what do the feedback/profile files look like?
@aclements observed during conversation at GopherCon 2018 that in order to preserve reproducible builds and efficiency, the feedback file needs to be hashed by cmd/go along with all other inputs, which suggests that it ought to be committed to version control.
cc @randall77 @ianlancetaylor @aclements @crawshaw @bradfitz @CAFxX (and the list could go on, but I'm sure everyone will find it)
The text was updated successfully, but these errors were encountered: