-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: minimize morestack calls text footprint #29067
Comments
Note that on Intel processors in the absence of an entry in the branch target buffer backward branches are predicted as taken and forward branches are predicted as not taken. So on Intel processors there is a clear performance advantage to making the branch to be taken when more stack space is required be a forward branch rather than a backward branch. |
@ianlancetaylor I'm no expert in microarchitecture but this is what Agner writes about static prediction in "recent" archs:
I obviously can't guarantee the above is correct, still relevant to newer archs, or that there are no other factors that may cause regressions. It would have to be measured to see if it helps. |
@CAFxX I altered the function prologue a couple of years ago to be more static-branch-prediction friendly. At the time, I benchmarked it extensively, and found that on all architectures I had access to, including recently AMD64 processors, laying out the function prologue in the order we do affords a significant performance benefit. I don't have the commits handy, but there are hard numbers in the commit message. |
@josharian I suppose you are referring to 88f423e I have a question though: why would static prediction influence the runtime of super-hot functions that are being executed millions of times back-to-back during the benchmark? Why would there be no record for the branches in the BTB in that case? Wouldn't it be possible that the improvements were mostly due to better icache utilization (since, from a cursory check of the commit message, it seems cold code was moved out of the hot path)? Were you able to measure the two effects independently? |
This is more of a small papercut idea than a real issue, but every time I go through the disassembled output of objdump it comes back to mind, so I'm going to drop it here.
Currently the morestack check is laid out as follows:
if it was instead laid out like this:
we have one less jump (JMP2) and, more importantly, JMP1 can always use the more compact rel8 variant. The morestack block, in this case, would be placed in the padding between this and the previous function (because the morestack block of the previous function would have been moved as well, this should not significantly impact the overall amount of padding used)
For small functions (i.e. functions where rel8 is already used in both jumps) this would save 2 bytes per function, for functions where rel32 is used it would save 10 bytes per function. From a quick check there are ~1800 small functions and ~5000 big functions using morestack in the 1.11.2 windows compiler. This means that text size could decrease by ~86KB. This would help for #6853 and, hopefully, also improve icache hit rates.
Looking forward to when per-function metadata will make it possible to mostly eliminate morestack calls (#25999 (comment)), this could be further extended to have multiple entry points (so that we don't need to store multiple variants of functions for which we need both morestack and no morestack variants):
I'm not knowledgeable about the linker, so I don't really know how workable these two ideas are.
The text was updated successfully, but these errors were encountered: