-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stack usage issues (due to inlining, etc.) #13
Comments
Related gcc option is -fstack-usage (produces .su file besides output .o). Also, possible to get warning on too big stack usage. e.g.: -Wstack-usage=256. |
A case study: objstr.su:
This happens because static str_modulo_format() gets inlined into its single caller mp_obj_str_binary_op(). It would seem that marking str_modulo_format() with MP_NOINLINE is the obvious fix. It helps,but the result not uncontroversial:
So, if we call mp_obj_str_binary_op for something else than This happens because gcc of course uses live range packing technique, reusing stack slots for different local vars, which can't be used at the same time. Inlining opens more opportunity for such packing, hence the results above. So, there's no "absolute" rule of thumb how to fix it, it all compromises, as usual with optimization. However, some "absolute" conclusion can be made from above: the stack usage of mp_obj_str_binary_op() is really a problem. Even after deinlining str_modulo_format(), it's big. That's because it again packs too much different code (switch branches), some having big allocations, which apply only to a particular operation, but affecting stack allocation of the entire func. |
Well... https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html :
But, it has minimal effect on MicroPython codebase:
So, it chose not to inline str_modulo_format() into mp_obj_str_binary_op(), and also chose to split mp_obj_str_make_new() into 2 functions (which is unlikely a good move). |
Another issue in stack usage is ABI stack alignment. By the latest bloat fashion, this is usually times the native word size of a platform. x86_32 is particularly affected, with "modern ABI" having stack alignment of 16, 4 times the native word width. Here's an example that such an alignment can easily cause ~2x more stack usage, diff of -fstack-usage result for -mpreferred-stack-boundary=2 (4 bytes stack alignment) vs -mpreferred-stack-boundary=4 (16 bytes stack alignment, default):
The GCC manual, e.g. https://gcc.gnu.org/onlinedocs/gcc-4.5.3/gcc/i386-and-x86_002d64-Options.html itself says:
Unfortunately, -mpreferred-stack-boundary= is really a (sub)target-specific option, for example, using it for x86_64 leads to compile error that supported values are 4..12. |
Another thing is that -fomit-frame-pointer is apparently should be enabled. Again helps poor x86_32 to push less of ebp on the stack. But there could be peculiar arch for which -fomit-frame-pointer is not implemented, thus leading in build error. |
I played with few more gcc options related to stack, but none of them helped with the above. So, the only solution seems to be avoiding definition of any variables in "dispatcher" functions like *_binary_op, and instead indeed just dispatch to a particular op handler, hoping for tail calls. This hoping works (for popular platforms). E.g.:
leads to the good code like:
But doing it this way means duplicating validation/arg type dispatching code. So again, consideration of what's better in a particular case should be done. In the case of mp_obj_str_binary_op, it's definitely makes sense to do that, because |
But it has controversial effect on uPy codebase. First of all, with unwind table bloat not disabled, it has horrendous effect on x86_32:
With crap disabled, it's more tolerable, though still bloating:
Doesn't have effect on arm-linux-gnueabihf-gcc (because it's default?). |
Master has issues with pretty big stack usage in some (many?) cases. This is due to fact that medern compilers prefer (only able to?) to allocate single big static stack frame for entire function at once, instead of growing/shrinking frame as they go thru various subblocks of a function. For example fo,
1024 + 4 bytes of stack will be allocated always, even though, as suggested by the code, 1024 of those will be needed rarely. It would be smarter to allocate those 1024 only when control flow goes into the "if" body, but this seems to be the lost skill for modern gcc's and clang's. (Indeed, adjusting stack pointer in multiple places would take more code and more cycles, so they optimized it to allocate frame once at function start - ad absurdum, as the code above shows).
This is all aggravated by function inlining. As it's again the codesize saving measure, it happens not just for functions explicitly marked inline, but also may happen for static functions (for example, always would happen for a static function with a single caller).
Many times, this is a bit less serious than completely grave, because other stack-hungry functions aren't called from such functions. But they always can be, e.g. an interrupt can occur when inside such stack-hoarding function, or it may deal with user (Python) type and call Python-level method, leading to really huge stack usage.
The text was updated successfully, but these errors were encountered: