Move large uop bodies into functions. #117224

markshannon · 2024-03-25T17:47:58Z

Many of the micro-op bodies are quite large, and are likely to bloat jitted code, harming performance.

We should move these larger bodies into helper functions in the tier2 code generator.

For example, the generator code for _INIT_CALL_PY_EXACT_ARGS looks like this:

case _INIT_CALL_PY_EXACT_ARGS: {
    PyObject **args;
    PyObject *self_or_null;
    PyObject *callable;
    _PyInterpreterFrame *new_frame;
    oparg = CURRENT_OPARG();
    args = &stack_pointer[-oparg];
    self_or_null = stack_pointer[-1 - oparg];
    callable = stack_pointer[-2 - oparg];
    int has_self = (self_or_null != NULL);
    STAT_INC(CALL, hit);
    PyFunctionObject *func = (PyFunctionObject *)callable;
    new_frame = _PyFrame_PushUnchecked(tstate, func, oparg + has_self);
    PyObject **first_non_self_local = new_frame->localsplus + has_self;
    new_frame->localsplus[0] = self_or_null;
    for (int i = 0; i < oparg; i++) {
        first_non_self_local[i] = args[i];
    }
    stack_pointer[-2 - oparg] = (PyObject *)new_frame;
    stack_pointer += -1 - oparg;
    break;
}

By moving the bulk of this into a helper function, we can generate the much shorter:

case _INIT_CALL_PY_EXACT_ARGS: {
    stack_pointer = _INIT_CALL_PY_EXACT_ARGS_func(tstate, frame, stack_pointer, oparg);
    break;
}

with the helper function:

PyObject ** _INIT_CALL_PY_EXACT_ARGS_func(PyThreadState *tstate, _PyInterpreterFrame *frame, PyObject ** stack_pointer, int oparg) {
    PyObject **args;
    PyObject *self_or_null;
    PyObject *callable;
    _PyInterpreterFrame *new_frame;
    args = &stack_pointer[-oparg];
    self_or_null = stack_pointer[-1 - oparg];
    callable = stack_pointer[-2 - oparg];
    int has_self = (self_or_null != NULL);
    STAT_INC(CALL, hit);
    PyFunctionObject *func = (PyFunctionObject *)callable;
    new_frame = _PyFrame_PushUnchecked(tstate, func, oparg + has_self);
    PyObject **first_non_self_local = new_frame->localsplus + has_self;
    new_frame->localsplus[0] = self_or_null;
    for (int i = 0; i < oparg; i++) {
        first_non_self_local[i] = args[i];
    }
    stack_pointer[-2 - oparg] = (PyObject *)new_frame;
    stack_pointer += -1 - oparg;
    return stack_pointer;
}

Linked PRs

The text was updated successfully, but these errors were encountered:

markshannon · 2024-03-25T17:52:03Z

Most uops do not change the frame, so the form stack_pointer = _UOP_func(tstate, frame, stack_pointer, [oparg]); works.
If the frame is changed, then we can change the form to

    frame = _UOP_func(tstate, frame, stack_pointer, [oparg]);
    stack_pointer += ...

mdboom · 2024-04-04T13:59:25Z

Some experimentation notes:

Just moving _INIT_CALL_PY_EXACT_ARGS has a 1% overall speedup. Trying to move more functions than that, I'm struggling to make an improvement.

Moving all uops greater than 200 bytes (on x86) has net zero effect.

All uops greater than 200 bytes

39: ['_LIST_EXTEND', '_CONTAINS_OP_DICT', '_CALL_METHOD_DESCRIPTOR_NOARGS', '_INIT_CALL_PY_EXACT_ARGS_0', '_INIT_CALL_PY_EXACT_ARGS_1', '_LOAD_ATTR', '_STORE_SLICE', '_BUILD_SLICE', '_BINARY_SUBSCR_LIST_INT', '_STORE_SUBSCR_LIST_INT', '_CALL_BUILTIN_FAST', '_COMPARE_OP_INT', '_BINARY_SUBSCR_DICT', '_BINARY_SUBSCR_STR_INT', '_BINARY_SUBSCR_TUPLE_INT', '_CALL_ISINSTANCE', '_CALL_LEN', '_CONTAINS_OP_SET', '_CALL_BUILTIN_CLASS', '_CALL_BUILTIN_O', '_CALL_METHOD_DESCRIPTOR_FAST', '_CALL_METHOD_DESCRIPTOR_O', '_COMPARE_OP', '_FOR_ITER_TIER_TWO', '_INIT_CALL_PY_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS_2', '_INIT_CALL_PY_EXACT_ARGS_3', '_INIT_CALL_PY_EXACT_ARGS_4', '_LOAD_GLOBAL', '_STORE_SUBSCR', '_BUILD_MAP', '_LOAD_SUPER_ATTR_METHOD', '_BUILD_STRING', '_CALL_BUILTIN_FAST_WITH_KEYWORDS', '_BUILD_CONST_KEY_MAP', '_CALL_METHOD_DESCRIPTOR_FAST_WITH_KEYWORDS', '_GET_ANEXT', '_STORE_NAME', '_COMPARE_OP_FLOAT']

Moving all functions greater than 300 bytes (on x86) has net zero effect.

All uops greater than 300 bytes

16: ['_CALL_METHOD_DESCRIPTOR_NOARGS', '_BUILD_SLICE', '_CALL_BUILTIN_FAST', '_BINARY_SUBSCR_STR_INT', '_CALL_ISINSTANCE', '_CALL_LEN', '_CALL_BUILTIN_CLASS', '_CALL_BUILTIN_O', '_CALL_METHOD_DESCRIPTOR_FAST', '_CALL_METHOD_DESCRIPTOR_O', '_INIT_CALL_PY_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS_4', '_LOAD_GLOBAL', '_LOAD_SUPER_ATTR_METHOD', '_CALL_BUILTIN_FAST_WITH_KEYWORDS', '_CALL_METHOD_DESCRIPTOR_FAST_WITH_KEYWORDS']

All uops greater than 350 bytes also has a net zero effect.

(A threshold of 400 would be equivalent to _INIT_CALL_PY_EXACT_ARGS alone, which we've already seen improves speed by 1%).

Looking at both code size and execution counts together may be a better metric:

The dot on the far right is _INIT_CALL_PY_EXACT_ARGS.

As a first rough cut, I tried taking the uops that are in the top 25%-ile in code size and the bottom 25%-ile in frequency, which gives us a set of 10 opcodes. This also has a net-zero effect on runtime.

25%-ile in both size and execution counts

10: ['_CALL_METHOD_DESCRIPTOR_O', '_INIT_CALL_PY_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS_3', '_BUILD_MAP', '_LOAD_SUPER_ATTR_METHOD', '_BUILD_STRING', '_CALL_BUILTIN_FAST_WITH_KEYWORDS', '_LOAD_ATTR_MODULE', '_BUILD_CONST_KEY_MAP', '_STORE_NAME']

I have a couple of theories about why none of this is working. Perhaps the code size has to be pretty large before the overhead of the function call becomes worth it (and _INIT_PY_CALL_EXACT_ARGS is a real outlier in code size).

Also, I wonder if part of what is "hurting" with these additional functions is that they have exits (they have DEOPT_IF and/or EXIT_IF (unlike _INIT_CALL_PY_EXACT_ARGS). Since the function can not just jump to a label in the caller (as would normally happen in a uop), I was handling this by returning 0, 1 or 2 from the function and handling the result in the caller:

switch (result) {
case 0:
  break;
case 1:
  JUMP_TO_JUMP_TARGET();
case 2:
  JUMP_TO_ERROR();
}

(and only including cases that were actually used in each uop). This means that each exit has an additional branch ... I don't know how much that matters. There may be a better "trick" to make this work -- I have to admit that low-level C tricks aren't my forte. If there's a better way to "jump directly" somehow let me know.

The "simple" (non-exiting) uops are marked in yellow here:

I thought I would try only including uops that don't have exits, but not worrying as much about code size If I take non-exiting UOps, with counts of less than 10^8, and code sizes larger than 100, I get: ['_INIT_CALL_BOUND_METHOD_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS', '_COPY_FREE_VARS', '_SET_FUNCTION_ATTRIBUTE', '_COMPARE_OP_FLOAT']. That is 1% faster, and notably is faster for every interpreter-heavy benchmark.

Perhaps a good selection of non-exiting uops

@gvanrossum observed that the main thing that bulks out _INIT_CALL_PY_EXACT_ARGS is the static inlining of _PyFrame_PushUnchecked. If all of this experimentation still points to "only externalizing _INIT_CALL_PY_EXACT_ARGS helps", it probably makes sense to just not inline _PyFrame_PushUnchecked in the context of compiling the JIT templates, which could probably be done with a couple lines of C-preprocessor hackery. That would be much simpler than modifying the generator to generate these "external" uops. UPDATE: This is not noticeably faster.

mdboom · 2024-04-04T18:56:07Z

Conclusion: It seems like externalizing these uops make things the fastest of anything I tried.
(This is non-exiting uops, not too frequently used, bigger than double the size of the code for making the function call). ['_INIT_CALL_BOUND_METHOD_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS', '_COPY_FREE_VARS', '_SET_FUNCTION_ATTRIBUTE', '_COMPARE_OP_FLOAT'].

There may be a trick to making uops-with-exits fast enought to externalize as well (@brandtbucher, thoughts?). That could also be left as follow-on work.

brandtbucher · 2024-04-05T18:55:40Z

This means that each exit has an additional branch ... I don't know how much that matters. There may be a better "trick" to make this work -- I have to admit that low-level C tricks aren't my forte. If there's a better way to "jump directly" somehow let me know.

Haven't thought about it too much, but one option could be to pass _JIT_CONTINUE and _JIT_JUMP_TARGET/_JIT_ERROR_TARGET into the function, and it could return one, which is jumped to:

PyAPI_DATA(void) _JIT_CONTINUE;
PyAPI_DATA(void) _JIT_EXIT_TARGET;
jit_func next = _Py_FOO_func(tstate, frame, stack_pointer, CURRENT_OPARG(),
                             &_JIT_CONTINUE, &_JIT_EXIT_TARGET);
stack_pointer += ...;
__attribute__((musttail))
return ((jit_func)next)(frame, stack_pointer, tstate);

mdboom · 2024-04-09T12:42:40Z

At @brandtbucher's suggestion, I instead tried large uops that are more frequent, giving the set _INIT_CALL_PY_EXACT_ARGS, _COMPARE_OP_FLOAT (there's just not that many large ones without exits). This, unfortunately, made no net change.

I think the next thing to tackle will be a faster way to handle exits which should make the set of uops that we can externalize larger.

markshannon · 2024-08-02T14:03:40Z

An additional motivation for doing this, apart from the impact on the JIT, is the ability to analyze the micro-ops.
Large micro-ops often have complex control flow, which makes it hard to analyze the code for stack spilling and top-of-stack caching.

…ctions (GH-122601)

…er functions (pythonGH-122601)

mdboom self-assigned this Mar 29, 2024

mdboom added a commit to mdboom/cpython that referenced this issue Mar 29, 2024

pythongh-117224: Move large uop bodies into functions

e429ad2

mdboom mentioned this issue Mar 29, 2024

gh-117224: Move large uop bodies into functions #117380

Closed

mdboom added a commit to mdboom/cpython that referenced this issue Apr 5, 2024

pythongh-117224: Move some large uops to external functions

6d74608

mdboom mentioned this issue Apr 5, 2024

gh-117224: Move some large uops to external functions #117570

Closed

mdboom mentioned this issue Aug 2, 2024

GH-117224: Move the body of a few large-ish micro-ops into helper functions #122601

Merged

markshannon mentioned this issue Aug 2, 2024

Move the bodies of large instructions into helper functions. #122598

Closed

markshannon added a commit that referenced this issue Aug 2, 2024

GH-117224: Move the body of a few large-ish micro-ops into helper fun…

7aca84e

…ctions (GH-122601)

brandtbucher pushed a commit to brandtbucher/cpython that referenced this issue Aug 7, 2024

pythonGH-117224: Move the body of a few large-ish micro-ops into help…

f0c1bfb

…er functions (pythonGH-122601)

blhsing pushed a commit to blhsing/cpython that referenced this issue Aug 22, 2024

pythonGH-117224: Move the body of a few large-ish micro-ops into help…

b2ded75

…er functions (pythonGH-122601)

JacobCoffee added performance Performance or resource usage interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move large uop bodies into functions. #117224

Move large uop bodies into functions. #117224

markshannon commented Mar 25, 2024 •

edited by bedevere-app bot

Loading

markshannon commented Mar 25, 2024

mdboom commented Apr 4, 2024 •

edited

Loading

mdboom commented Apr 4, 2024 •

edited

Loading

brandtbucher commented Apr 5, 2024 •

edited

Loading

mdboom commented Apr 9, 2024

markshannon commented Aug 2, 2024

Move large uop bodies into functions. #117224

Move large uop bodies into functions. #117224

Comments

markshannon commented Mar 25, 2024 • edited by bedevere-app bot Loading

Linked PRs

markshannon commented Mar 25, 2024

mdboom commented Apr 4, 2024 • edited Loading

mdboom commented Apr 4, 2024 • edited Loading

brandtbucher commented Apr 5, 2024 • edited Loading

mdboom commented Apr 9, 2024

markshannon commented Aug 2, 2024

markshannon commented Mar 25, 2024 •

edited by bedevere-app bot

Loading

mdboom commented Apr 4, 2024 •

edited

Loading

mdboom commented Apr 4, 2024 •

edited

Loading

brandtbucher commented Apr 5, 2024 •

edited

Loading