-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move large uop bodies into functions. #117224
Comments
Most uops do not change the frame, so the form frame = _UOP_func(tstate, frame, stack_pointer, [oparg]);
stack_pointer += ... |
Some experimentation notes: Just moving
All uops greater than 200 bytes39: ['_LIST_EXTEND', '_CONTAINS_OP_DICT', '_CALL_METHOD_DESCRIPTOR_NOARGS', '_INIT_CALL_PY_EXACT_ARGS_0', '_INIT_CALL_PY_EXACT_ARGS_1', '_LOAD_ATTR', '_STORE_SLICE', '_BUILD_SLICE', '_BINARY_SUBSCR_LIST_INT', '_STORE_SUBSCR_LIST_INT', '_CALL_BUILTIN_FAST', '_COMPARE_OP_INT', '_BINARY_SUBSCR_DICT', '_BINARY_SUBSCR_STR_INT', '_BINARY_SUBSCR_TUPLE_INT', '_CALL_ISINSTANCE', '_CALL_LEN', '_CONTAINS_OP_SET', '_CALL_BUILTIN_CLASS', '_CALL_BUILTIN_O', '_CALL_METHOD_DESCRIPTOR_FAST', '_CALL_METHOD_DESCRIPTOR_O', '_COMPARE_OP', '_FOR_ITER_TIER_TWO', '_INIT_CALL_PY_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS_2', '_INIT_CALL_PY_EXACT_ARGS_3', '_INIT_CALL_PY_EXACT_ARGS_4', '_LOAD_GLOBAL', '_STORE_SUBSCR', '_BUILD_MAP', '_LOAD_SUPER_ATTR_METHOD', '_BUILD_STRING', '_CALL_BUILTIN_FAST_WITH_KEYWORDS', '_BUILD_CONST_KEY_MAP', '_CALL_METHOD_DESCRIPTOR_FAST_WITH_KEYWORDS', '_GET_ANEXT', '_STORE_NAME', '_COMPARE_OP_FLOAT']
All uops greater than 300 bytes16: ['_CALL_METHOD_DESCRIPTOR_NOARGS', '_BUILD_SLICE', '_CALL_BUILTIN_FAST', '_BINARY_SUBSCR_STR_INT', '_CALL_ISINSTANCE', '_CALL_LEN', '_CALL_BUILTIN_CLASS', '_CALL_BUILTIN_O', '_CALL_METHOD_DESCRIPTOR_FAST', '_CALL_METHOD_DESCRIPTOR_O', '_INIT_CALL_PY_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS_4', '_LOAD_GLOBAL', '_LOAD_SUPER_ATTR_METHOD', '_CALL_BUILTIN_FAST_WITH_KEYWORDS', '_CALL_METHOD_DESCRIPTOR_FAST_WITH_KEYWORDS'] All uops greater than 350 bytes also has a net zero effect. (A threshold of 400 would be equivalent to Looking at both code size and execution counts together may be a better metric: The dot on the far right is As a first rough cut, I tried taking the uops that are in the top 25%-ile in code size and the bottom 25%-ile in frequency, which gives us a set of 10 opcodes. This also has a net-zero effect on runtime. 25%-ile in both size and execution counts10: ['_CALL_METHOD_DESCRIPTOR_O', '_INIT_CALL_PY_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS_3', '_BUILD_MAP', '_LOAD_SUPER_ATTR_METHOD', '_BUILD_STRING', '_CALL_BUILTIN_FAST_WITH_KEYWORDS', '_LOAD_ATTR_MODULE', '_BUILD_CONST_KEY_MAP', '_STORE_NAME'] I have a couple of theories about why none of this is working. Perhaps the code size has to be pretty large before the overhead of the function call becomes worth it (and Also, I wonder if part of what is "hurting" with these additional functions is that they have exits (they have
(and only including cases that were actually used in each uop). This means that each exit has an additional branch ... I don't know how much that matters. There may be a better "trick" to make this work -- I have to admit that low-level C tricks aren't my forte. If there's a better way to "jump directly" somehow let me know. The "simple" (non-exiting) uops are marked in yellow here: I thought I would try only including uops that don't have exits, but not worrying as much about code size If I take non-exiting UOps, with counts of less than 10^8, and code sizes larger than 100, I get: ['_INIT_CALL_BOUND_METHOD_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS', '_COPY_FREE_VARS', '_SET_FUNCTION_ATTRIBUTE', '_COMPARE_OP_FLOAT']. That is 1% faster, and notably is faster for every interpreter-heavy benchmark. @gvanrossum observed that the main thing that bulks out |
Conclusion: It seems like externalizing these uops make things the fastest of anything I tried. There may be a trick to making uops-with-exits fast enought to externalize as well (@brandtbucher, thoughts?). That could also be left as follow-on work. |
Haven't thought about it too much, but one option could be to pass PyAPI_DATA(void) _JIT_CONTINUE;
PyAPI_DATA(void) _JIT_EXIT_TARGET;
jit_func next = _Py_FOO_func(tstate, frame, stack_pointer, CURRENT_OPARG(),
&_JIT_CONTINUE, &_JIT_EXIT_TARGET);
stack_pointer += ...;
__attribute__((musttail))
return ((jit_func)next)(frame, stack_pointer, tstate); |
At @brandtbucher's suggestion, I instead tried large uops that are more frequent, giving the set I think the next thing to tackle will be a faster way to handle exits which should make the set of uops that we can externalize larger. |
An additional motivation for doing this, apart from the impact on the JIT, is the ability to analyze the micro-ops. |
Many of the micro-op bodies are quite large, and are likely to bloat jitted code, harming performance.
We should move these larger bodies into helper functions in the tier2 code generator.
For example, the generator code for
_INIT_CALL_PY_EXACT_ARGS
looks like this:By moving the bulk of this into a helper function, we can generate the much shorter:
with the helper function:
Linked PRs
The text was updated successfully, but these errors were encountered: