cmd/compile: minimize morestack calls text footprint #29067

CAFxX · 2018-12-03T02:31:51Z

This is more of a small papercut idea than a real issue, but every time I go through the disassembled output of objdump it comes back to mind, so I'm going to drop it here.

Currently the morestack check is laid out as follows:

function entry point:
  check if enough stack
  if not enough stack, jump to morestack block # JMP1
  ...function body...
morestack block:
  call morestack
  jump to function entry point # JMP2

if it was instead laid out like this:

morestack block:
  call morestack
function entry point:
  check if enough stack
  if not enough stack, jump to morestack block
  ...function body...

we have one less jump (JMP2) and, more importantly, JMP1 can always use the more compact rel8 variant. The morestack block, in this case, would be placed in the padding between this and the previous function (because the morestack block of the previous function would have been moved as well, this should not significantly impact the overall amount of padding used)

For small functions (i.e. functions where rel8 is already used in both jumps) this would save 2 bytes per function, for functions where rel32 is used it would save 10 bytes per function. From a quick check there are ~1800 small functions and ~5000 big functions using morestack in the 1.11.2 windows compiler. This means that text size could decrease by ~86KB. This would help for #6853 and, hopefully, also improve icache hit rates.

Looking forward to when per-function metadata will make it possible to mostly eliminate morestack calls (#25999 (comment)), this could be further extended to have multiple entry points (so that we don't need to store multiple variants of functions for which we need both morestack and no morestack variants):

morestack block:
  call morestack
function entry point:
  check if enough stack
  if not enough stack, jump to morestack block
function entry point no-morestack:
  ...function body...

I'm not knowledgeable about the linker, so I don't really know how workable these two ideas are.

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2018-12-03T04:43:53Z

Note that on Intel processors in the absence of an entry in the branch target buffer backward branches are predicted as taken and forward branches are predicted as not taken. So on Intel processors there is a clear performance advantage to making the branch to be taken when more stack space is required be a forward branch rather than a backward branch.

CAFxX · 2018-12-03T05:12:12Z

@ianlancetaylor I'm no expert in microarchitecture but this is what Agner writes about static prediction in "recent" archs:

Static prediction in PM and Core2

These processors do not use static prediction. The predictor simply makes a random prediction the first time a branch is seen, depending on what happens to be in the BTB entry that is assigned to the new branch. There is simply a 50% chance of making the right prediction of jump or no jump, but the predicted target is correct. Branch hint prefixes have no useful effect on PM and Core2 processors.

Static prediction in AMD

A branch is predicted not taken the first time it is seen. A branch is predicted always taken after the first time it has been taken. Dynamic prediction is used only after a branch has been taken and then not taken. Branch hint prefixes have no effect.

I obviously can't guarantee the above is correct, still relevant to newer archs, or that there are no other factors that may cause regressions. It would have to be measured to see if it helps.

josharian · 2018-12-03T21:29:37Z

@CAFxX I altered the function prologue a couple of years ago to be more static-branch-prediction friendly. At the time, I benchmarked it extensively, and found that on all architectures I had access to, including recently AMD64 processors, laying out the function prologue in the order we do affords a significant performance benefit. I don't have the commits handy, but there are hard numbers in the commit message.

CAFxX · 2018-12-03T22:03:36Z

@josharian I suppose you are referring to 88f423e

I have a question though: why would static prediction influence the runtime of super-hot functions that are being executed millions of times back-to-back during the benchmark? Why would there be no record for the branches in the BTB in that case? Wouldn't it be possible that the improvements were mostly due to better icache utilization (since, from a cursory check of the commit message, it seems cold code was moved out of the hot path)? Were you able to measure the two effects independently?

CAFxX changed the title ~~Minimize morestack calls footprint~~ Minimize morestack calls text footprint Dec 3, 2018

agnivade changed the title ~~Minimize morestack calls text footprint~~ cmd/compile: minimize morestack calls text footprint Dec 3, 2018

agnivade added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 3, 2018

agnivade added this to the Unplanned milestone Dec 3, 2018

CAFxX mentioned this issue Apr 16, 2019

cmd/compile, runtime: reduce function prologue overhead #31193

Open

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022

mknyszek added this to Go Compiler / Runtime Jul 13, 2022

mknyszek removed this from Go Compiler / Runtime Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: minimize morestack calls text footprint #29067

cmd/compile: minimize morestack calls text footprint #29067

CAFxX commented Dec 3, 2018 •

edited

Loading

ianlancetaylor commented Dec 3, 2018

CAFxX commented Dec 3, 2018 •

edited

Loading

josharian commented Dec 3, 2018

CAFxX commented Dec 3, 2018 •

edited

Loading

cmd/compile: minimize morestack calls text footprint #29067

cmd/compile: minimize morestack calls text footprint #29067

Comments

CAFxX commented Dec 3, 2018 • edited Loading

ianlancetaylor commented Dec 3, 2018

CAFxX commented Dec 3, 2018 • edited Loading

josharian commented Dec 3, 2018

CAFxX commented Dec 3, 2018 • edited Loading

CAFxX commented Dec 3, 2018 •

edited

Loading

CAFxX commented Dec 3, 2018 •

edited

Loading

CAFxX commented Dec 3, 2018 •

edited

Loading