Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move large uop bodies into functions. #117224

Open
markshannon opened this issue Mar 25, 2024 · 6 comments
Open

Move large uop bodies into functions. #117224

markshannon opened this issue Mar 25, 2024 · 6 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@markshannon
Copy link
Member

markshannon commented Mar 25, 2024

Many of the micro-op bodies are quite large, and are likely to bloat jitted code, harming performance.

We should move these larger bodies into helper functions in the tier2 code generator.

For example, the generator code for _INIT_CALL_PY_EXACT_ARGS looks like this:

case _INIT_CALL_PY_EXACT_ARGS: {
    PyObject **args;
    PyObject *self_or_null;
    PyObject *callable;
    _PyInterpreterFrame *new_frame;
    oparg = CURRENT_OPARG();
    args = &stack_pointer[-oparg];
    self_or_null = stack_pointer[-1 - oparg];
    callable = stack_pointer[-2 - oparg];
    int has_self = (self_or_null != NULL);
    STAT_INC(CALL, hit);
    PyFunctionObject *func = (PyFunctionObject *)callable;
    new_frame = _PyFrame_PushUnchecked(tstate, func, oparg + has_self);
    PyObject **first_non_self_local = new_frame->localsplus + has_self;
    new_frame->localsplus[0] = self_or_null;
    for (int i = 0; i < oparg; i++) {
        first_non_self_local[i] = args[i];
    }
    stack_pointer[-2 - oparg] = (PyObject *)new_frame;
    stack_pointer += -1 - oparg;
    break;
}

By moving the bulk of this into a helper function, we can generate the much shorter:

case _INIT_CALL_PY_EXACT_ARGS: {
    stack_pointer = _INIT_CALL_PY_EXACT_ARGS_func(tstate, frame, stack_pointer, oparg);
    break;
}

with the helper function:

PyObject ** _INIT_CALL_PY_EXACT_ARGS_func(PyThreadState *tstate, _PyInterpreterFrame *frame, PyObject ** stack_pointer, int oparg) {
    PyObject **args;
    PyObject *self_or_null;
    PyObject *callable;
    _PyInterpreterFrame *new_frame;
    args = &stack_pointer[-oparg];
    self_or_null = stack_pointer[-1 - oparg];
    callable = stack_pointer[-2 - oparg];
    int has_self = (self_or_null != NULL);
    STAT_INC(CALL, hit);
    PyFunctionObject *func = (PyFunctionObject *)callable;
    new_frame = _PyFrame_PushUnchecked(tstate, func, oparg + has_self);
    PyObject **first_non_self_local = new_frame->localsplus + has_self;
    new_frame->localsplus[0] = self_or_null;
    for (int i = 0; i < oparg; i++) {
        first_non_self_local[i] = args[i];
    }
    stack_pointer[-2 - oparg] = (PyObject *)new_frame;
    stack_pointer += -1 - oparg;
    return stack_pointer;
}

Linked PRs

@markshannon
Copy link
Member Author

Most uops do not change the frame, so the form stack_pointer = _UOP_func(tstate, frame, stack_pointer, [oparg]); works.
If the frame is changed, then we can change the form to

    frame = _UOP_func(tstate, frame, stack_pointer, [oparg]);
    stack_pointer += ...

@mdboom
Copy link
Contributor

mdboom commented Apr 4, 2024

Some experimentation notes:

Just moving _INIT_CALL_PY_EXACT_ARGS has a 1% overall speedup. Trying to move more functions than that, I'm struggling to make an improvement.

  • Moving all uops greater than 200 bytes (on x86) has net zero effect.
All uops greater than 200 bytes

39: ['_LIST_EXTEND', '_CONTAINS_OP_DICT', '_CALL_METHOD_DESCRIPTOR_NOARGS', '_INIT_CALL_PY_EXACT_ARGS_0', '_INIT_CALL_PY_EXACT_ARGS_1', '_LOAD_ATTR', '_STORE_SLICE', '_BUILD_SLICE', '_BINARY_SUBSCR_LIST_INT', '_STORE_SUBSCR_LIST_INT', '_CALL_BUILTIN_FAST', '_COMPARE_OP_INT', '_BINARY_SUBSCR_DICT', '_BINARY_SUBSCR_STR_INT', '_BINARY_SUBSCR_TUPLE_INT', '_CALL_ISINSTANCE', '_CALL_LEN', '_CONTAINS_OP_SET', '_CALL_BUILTIN_CLASS', '_CALL_BUILTIN_O', '_CALL_METHOD_DESCRIPTOR_FAST', '_CALL_METHOD_DESCRIPTOR_O', '_COMPARE_OP', '_FOR_ITER_TIER_TWO', '_INIT_CALL_PY_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS_2', '_INIT_CALL_PY_EXACT_ARGS_3', '_INIT_CALL_PY_EXACT_ARGS_4', '_LOAD_GLOBAL', '_STORE_SUBSCR', '_BUILD_MAP', '_LOAD_SUPER_ATTR_METHOD', '_BUILD_STRING', '_CALL_BUILTIN_FAST_WITH_KEYWORDS', '_BUILD_CONST_KEY_MAP', '_CALL_METHOD_DESCRIPTOR_FAST_WITH_KEYWORDS', '_GET_ANEXT', '_STORE_NAME', '_COMPARE_OP_FLOAT']

  • Moving all functions greater than 300 bytes (on x86) has net zero effect.
All uops greater than 300 bytes

16: ['_CALL_METHOD_DESCRIPTOR_NOARGS', '_BUILD_SLICE', '_CALL_BUILTIN_FAST', '_BINARY_SUBSCR_STR_INT', '_CALL_ISINSTANCE', '_CALL_LEN', '_CALL_BUILTIN_CLASS', '_CALL_BUILTIN_O', '_CALL_METHOD_DESCRIPTOR_FAST', '_CALL_METHOD_DESCRIPTOR_O', '_INIT_CALL_PY_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS_4', '_LOAD_GLOBAL', '_LOAD_SUPER_ATTR_METHOD', '_CALL_BUILTIN_FAST_WITH_KEYWORDS', '_CALL_METHOD_DESCRIPTOR_FAST_WITH_KEYWORDS']

All uops greater than 350 bytes also has a net zero effect.

(A threshold of 400 would be equivalent to _INIT_CALL_PY_EXACT_ARGS alone, which we've already seen improves speed by 1%).

Looking at both code size and execution counts together may be a better metric:

image

The dot on the far right is _INIT_CALL_PY_EXACT_ARGS.

As a first rough cut, I tried taking the uops that are in the top 25%-ile in code size and the bottom 25%-ile in frequency, which gives us a set of 10 opcodes. This also has a net-zero effect on runtime.

25%-ile in both size and execution counts

image

10: ['_CALL_METHOD_DESCRIPTOR_O', '_INIT_CALL_PY_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS_3', '_BUILD_MAP', '_LOAD_SUPER_ATTR_METHOD', '_BUILD_STRING', '_CALL_BUILTIN_FAST_WITH_KEYWORDS', '_LOAD_ATTR_MODULE', '_BUILD_CONST_KEY_MAP', '_STORE_NAME']

I have a couple of theories about why none of this is working. Perhaps the code size has to be pretty large before the overhead of the function call becomes worth it (and _INIT_PY_CALL_EXACT_ARGS is a real outlier in code size).

Also, I wonder if part of what is "hurting" with these additional functions is that they have exits (they have DEOPT_IF and/or EXIT_IF (unlike _INIT_CALL_PY_EXACT_ARGS). Since the function can not just jump to a label in the caller (as would normally happen in a uop), I was handling this by returning 0, 1 or 2 from the function and handling the result in the caller:

switch (result) {
case 0:
  break;
case 1:
  JUMP_TO_JUMP_TARGET();
case 2:
  JUMP_TO_ERROR();
}

(and only including cases that were actually used in each uop). This means that each exit has an additional branch ... I don't know how much that matters. There may be a better "trick" to make this work -- I have to admit that low-level C tricks aren't my forte. If there's a better way to "jump directly" somehow let me know.

The "simple" (non-exiting) uops are marked in yellow here:

image

I thought I would try only including uops that don't have exits, but not worrying as much about code size If I take non-exiting UOps, with counts of less than 10^8, and code sizes larger than 100, I get: ['_INIT_CALL_BOUND_METHOD_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS', '_COPY_FREE_VARS', '_SET_FUNCTION_ATTRIBUTE', '_COMPARE_OP_FLOAT']. That is 1% faster, and notably is faster for every interpreter-heavy benchmark.

Perhaps a good selection of non-exiting uops

image

@gvanrossum observed that the main thing that bulks out _INIT_CALL_PY_EXACT_ARGS is the static inlining of _PyFrame_PushUnchecked. If all of this experimentation still points to "only externalizing _INIT_CALL_PY_EXACT_ARGS helps", it probably makes sense to just not inline _PyFrame_PushUnchecked in the context of compiling the JIT templates, which could probably be done with a couple lines of C-preprocessor hackery. That would be much simpler than modifying the generator to generate these "external" uops. UPDATE: This is not noticeably faster.

@mdboom
Copy link
Contributor

mdboom commented Apr 4, 2024

Conclusion: It seems like externalizing these uops make things the fastest of anything I tried.
(This is non-exiting uops, not too frequently used, bigger than double the size of the code for making the function call). ['_INIT_CALL_BOUND_METHOD_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS', '_COPY_FREE_VARS', '_SET_FUNCTION_ATTRIBUTE', '_COMPARE_OP_FLOAT'].

There may be a trick to making uops-with-exits fast enought to externalize as well (@brandtbucher, thoughts?). That could also be left as follow-on work.

@brandtbucher
Copy link
Member

brandtbucher commented Apr 5, 2024

This means that each exit has an additional branch ... I don't know how much that matters. There may be a better "trick" to make this work -- I have to admit that low-level C tricks aren't my forte. If there's a better way to "jump directly" somehow let me know.

Haven't thought about it too much, but one option could be to pass _JIT_CONTINUE and _JIT_JUMP_TARGET/_JIT_ERROR_TARGET into the function, and it could return one, which is jumped to:

PyAPI_DATA(void) _JIT_CONTINUE;
PyAPI_DATA(void) _JIT_EXIT_TARGET;
jit_func next = _Py_FOO_func(tstate, frame, stack_pointer, CURRENT_OPARG(),
                             &_JIT_CONTINUE, &_JIT_EXIT_TARGET);
stack_pointer += ...;
__attribute__((musttail))
return ((jit_func)next)(frame, stack_pointer, tstate);

@mdboom
Copy link
Contributor

mdboom commented Apr 9, 2024

At @brandtbucher's suggestion, I instead tried large uops that are more frequent, giving the set _INIT_CALL_PY_EXACT_ARGS, _COMPARE_OP_FLOAT (there's just not that many large ones without exits). This, unfortunately, made no net change.

I think the next thing to tackle will be a faster way to handle exits which should make the set of uops that we can externalize larger.

@markshannon
Copy link
Member Author

An additional motivation for doing this, apart from the impact on the JIT, is the ability to analyze the micro-ops.
Large micro-ops often have complex control flow, which makes it hard to analyze the code for stack spilling and top-of-stack caching.

brandtbucher pushed a commit to brandtbucher/cpython that referenced this issue Aug 7, 2024
blhsing pushed a commit to blhsing/cpython that referenced this issue Aug 22, 2024
@JacobCoffee JacobCoffee added performance Performance or resource usage interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
Projects
None yet
Development

No branches or pull requests

4 participants