Data structures and code for exits from executors. #644

markshannon · 2024-01-10T14:41:19Z

To support trace stitching we need data for exits embedded into trace objects.

First of all we need an exit struct:
https://github.com/faster-cpython/ideas/blob/main/3.13/engine.md#making-cold-exits-small lists the fields we need.

typedef struct _exit_data {
    _PyExecutorObject *executor; // The executor to jump to
    uint32_t target; // offset in the code object of the target code unit.
    int16_t hotness; // How hot this exit is. Start at a negative number and count up?
} _PyExitData;

Since all executors are now micro-ops based, we can get rid of the distinction between _PyExecutorObject and _PyUOpExecutorObject.
The new combined executor struct would look something like:

typedef struct _PyExecutorObject {
    PyObject_VAR_HEAD
    _PyVMData vm_data;
    union {
        void *jit_code; // Will be a function pointer
        _PyUOpInstruction *trace;
    };
    _PyExitData exits[1];
} _PyExecutorObject;

We will pre-allocate an array of executors for cold exits, one for each possible index into the exits array. Since there cannot be more exits than the maximum number of uops, the size of the array will be _Py_UOP_MAX_TRACE_LENGTH, currently 512.

Cold exits will be implemented as a single micro-op _COLD_EXIT with oparg set to the index in the array.
_COLD_EXIT needs to do the following:

Find the exit: exit = current_executor->exit[oparg]
Increment exit->hotness
If exit->hotness == 0 then trigger generator on a new executor to replace exit->executor.
Otherwise drop into tier 1: next_instr = PyCode_CODE(code) + exit->target; ENTER_TIER1();

The JIT will need to compile the code for these cold exits on start up, but we can static allocate the data structures.

The text was updated successfully, but these errors were encountered:

gvanrossum · 2024-01-10T18:56:37Z

Since there cannot be more exits than half the maximum number of uops,

Uh, why?

the size of the array will be _Py_UOP_MAX_TRACE_LENGTH/2, currently 256.

Why not dynamically allocated based on the size of the trace? (Which we know before we create the executor object (in make_executor_from_uops() in optimizer.c).

I take it that the targets are basically moved from the _PyOptimizerObject struct into the _PyExitData struct; exit and hotness correspond to the executors and counters arrays in my defunct side-exits PR (gh-112902).

Cold exits will be implemented as a single micro-op _COLD_EXIT with oparg set to the index in the array.

I'm missing something here. My understanding was that exits are (largely) reached by calls to DEOPT_IF() (or EXIT_IF() if you wish) included in the C code for the bytecode definition in bytecodes.c. How does a new uop help here? (Fine if that's explained in another design issue, just point me to it.)

markshannon · 2024-01-10T19:07:50Z

Since there cannot be more exits than half the maximum number of uops,

Uh, why?

Because at least half will be _SET_IP/_CHECK_VALID.

Why not dynamically allocated based on the size of the trace? (Which we know before we create the executor object (in make_executor_from_uops() in optimizer.c).

There will 256 globally (per-process) as they are const, so can be shared.

I'm missing something here. My understanding was that exits are (largely) reached by calls to DEOPT_IF() (or EXIT_IF() if you wish) included in the C code for the bytecode definition in bytecodes.c. How does a new uop help here? (Fine if that's explained in another design issue, just point me to it.)

Each EXIT_IF will have an executor/exit attached (in the exits array), which is jumped to unconditionally. It is then the job of that exit to handle counts and further optimization.
The _COLD_EXIT micro-op will perform that task.

markshannon · 2024-01-10T19:10:47Z

The _COLD_EXIT code will look something like this:
(This untested, not even compiled)

        op(_COLD_EXIT, (--)) {
            TIER_TWO_ONLY
            _PyExitData *exit = current_executor->exits[oparg];
            Py_DECREF(current_executor);
            exit->hotness++;
            UNLIKELY_EXIT_IF(exit->hotness < 0);  
            PyCodeObject *code = _PyFrame_GetCode(frame);
            _Py_CODEUNIT *target = _PyCode_CODE(code) + exit->target;
            _PyOptimizerObject *opt = interp->optimizer;
            _PyExecutorObject *executor = NULL;
            int optimized = opt->optimize(opt, code, target, &executor, (int)(stack_pointer - _PyFrame_Stackbase(frame)));
            ERROR_IF(optimized < 0, error);
            if (optimized) {
                exit->executor = executor;
                Py_INCREF(executor);
                GOTO_TIER_TWO();
            }
            else {
                exit->hotness = -10000; /* Choose a better number */
            }
        }

markshannon · 2024-01-10T19:14:39Z

And the code for EXIT_IF will be something like this:

    _PyExitData *exit = current_executor->exits[index];
    Py_INCREF(exit->exit);
    Py_DECREF(current_executor);
    current_executor = exit->executor;
    GOTO_TIER_TWO();

index is computed when creating the executor from the uop IR.

gvanrossum · 2024-01-11T03:38:44Z

Oh, I see what I didn't get.

The (dynamically allocated) array of exit structs is part of the executor, and each possibly-deoptimizing uop in the trace array (or each group of uops belonging to the same Tier 1 instruction?) has a corresponding exit struct, with its own target and hotness counter.

However, the exit field in those exit structs initially all point to one of 256 statically allocated executors (i.e., to the one whose index in that static array equals to the index of the exit struct in the array of such structs in the executor). And each of those static executors contains a single _COLD_EXIT uop whose oparg is that same index.

When the hotness counter warms up, eventually the exit field in the exit struct is made to point to a new, dynamically created executor representing the code where the hot side exit continues. (There's the issue of how ownership of that executor is determined, but that's solvable.)

I'm not sure I follow the reasoning about how many cold exit executors we need (initially 1/3rd of the uops are _SET_IP and 1/3rd are _CHECK_VALID, so maybe only MAX/3 instead of MAX/2?) but I can see the attraction in making this array shorter than the trace array. Possibly the allocation of entries can be different when using the Tier 2 interpreter than for the JIT, since the index can be baked into the JIT code. Anyways, this is a minor detail; we just must be able to calculate the index of the exit struct corresponding to the current micro instruction efficiently in the Tier 2 interpreter. (Echoes of an early, later abandoned, design for the 3.11 Tier 1 "inline" cache. :-)

PS 1: I take it that the target field is moved from the UOpInstruction struct to the exit struct?

PS 2: Maybe the field referencing an executor in the exit struct should be called executor instead of exit -- the reuse of "exit" added to my confusion.

markshannon · 2024-01-11T11:31:54Z

(There's the issue of how ownership of that executor is determined, but that's solvable.)

No one need own the executor, we have reference counting. Each exit->executor field would be a strong reference to the executor it points to. Since the cold exits are immortal, we don't need to worry about reference counts when initializing the field.

I'm not sure I follow the reasoning about how many cold exit executors...

1/2 is an approximation.
Currently we only need a 1/3, but some instructions expand to several micro-ops, so we could end up needed more than half.
Maybe we should just make the array MAX for safety. As you say, it isn't important for the design.

PS 1: I take it that the target field is moved from the UOpInstruction struct to the exit struct?

Yes.

gvanrossum · 2024-01-11T17:23:58Z

Each exit->executor field would be a strong reference to the executor it points to

Makes sense (as does the rest). The destructor for executors needs to know how many exit structs an executor has, and decref the executors in those exits. The static ones can be immortal.

gvanrossum · 2024-01-11T17:24:14Z

I would still plead for renaming the exit field.

markshannon · 2024-01-12T16:10:33Z

I would still plead for renaming the exit field.

Already done 🙂

gvanrossum · 2024-01-16T18:58:38Z

@markshannon Presumably before we can work on this, we need to implement having separate structs for uops during the trace collection/optimization chain and uops in the executor, right? Because in the executor the 'target' field lives in the exit blocks, but during optimization, 'target' is part of the instruction struct.

markshannon · 2024-02-22T12:00:34Z

Done 🎉

This was referenced Jan 12, 2024

gh-112354: Add executor for less-taken branch python/cpython#112902

Closed

Trace stitching python/cpython#112354

Closed

gvanrossum mentioned this issue Jan 18, 2024

GH-112354: Initial implementation of warm up on exits and trace-stitching python/cpython#114142

Merged

markshannon closed this as completed Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data structures and code for exits from executors. #644

Data structures and code for exits from executors. #644

markshannon commented Jan 10, 2024 •

edited

Loading

gvanrossum commented Jan 10, 2024

markshannon commented Jan 10, 2024

markshannon commented Jan 10, 2024 •

edited

Loading

markshannon commented Jan 10, 2024 •

edited

Loading

gvanrossum commented Jan 11, 2024

markshannon commented Jan 11, 2024 •

edited

Loading

gvanrossum commented Jan 11, 2024

gvanrossum commented Jan 11, 2024

markshannon commented Jan 12, 2024

gvanrossum commented Jan 16, 2024

markshannon commented Feb 22, 2024

Data structures and code for exits from executors. #644

Data structures and code for exits from executors. #644

Comments

markshannon commented Jan 10, 2024 • edited Loading

gvanrossum commented Jan 10, 2024

markshannon commented Jan 10, 2024

markshannon commented Jan 10, 2024 • edited Loading

markshannon commented Jan 10, 2024 • edited Loading

gvanrossum commented Jan 11, 2024

markshannon commented Jan 11, 2024 • edited Loading

gvanrossum commented Jan 11, 2024

gvanrossum commented Jan 11, 2024

markshannon commented Jan 12, 2024

gvanrossum commented Jan 16, 2024

markshannon commented Feb 22, 2024

markshannon commented Jan 10, 2024 •

edited

Loading

markshannon commented Jan 10, 2024 •

edited

Loading

markshannon commented Jan 10, 2024 •

edited

Loading

markshannon commented Jan 11, 2024 •

edited

Loading