Register-based VM #22

gvanrossum · 2021-03-20T17:03:55Z

Skip Montanaro has been keeping this idea alive. See his latest review on python-dev.

gvanrossum · 2021-03-22T04:36:22Z

Note that Victor has also worked on such a thing: see https://mail.python.org/archives/list/[email protected]/thread/X72OYMPH2HLTY4SIGVPKSTIRWL2XFY7G/

Code: https://hg.python.org/sandbox/registervm/
README: https://hg.python.org/sandbox/registervm/file/tip/REGISTERVM.txt
Benchmarks: https://hg.python.org/sandbox/registervm/file/tip/BENCHMARKS.txt

(It's actually faster!)

gvanrossum · 2021-03-22T04:45:59Z

As an indication of how complex this transition would be, both Skip and Victor chose to keep the existing bytecode compiler and tack on a separate translation phase, written in Python!

markshannon · 2021-03-24T10:51:53Z

CPython is significantly faster than it was when Victor did those experiments. So I wouldn't trust those numbers.

As an indication of how complex this transition would be, both Skip and Victor chose to keep the existing bytecode compiler and tack on a separate translation phase, written in Python!

That might just be a comment on how difficult the compiler is to work with (but that's a separate topic).

markshannon · 2021-03-24T11:29:09Z

I'm 👎 on this. Not that it is a bad idea, just that the effort and disruption outweigh the marginal benefits.

The important issue here is not "register" or "stack", but is a wider instruction with explicit operands faster than the narrower instruction with implicit operand(s) that we currently have?

Maybe, but is it faster than the judicious use of super-instructions (such as LOAD_FAST_LOAD_FAST) and static specializations
(such as ADD_INT)?
Probably not, at least not by much, and it is a lot more work.

Experimental evidence

[Virtual Machine Showdown: Stack Versus Registers] (https://www.usenix.org/legacy/events%2Fvee05%2Ffull_papers/p153-yunhe.pdf) shows a 47% reduction in VM instructions executed. However, for Lua, the change from stack machine in Lua 4 to register machine in Lua 5 showed a 35% reduction. Since Python is even more dynamic and higher level than Lua, I would expect the reduction to be less than that, maybe in the 25-30% range.

In both cases considerable effort was put into bytecode optimization for the register machine, but not the stack machine.
I've not done any measurements, but improvements to the CPython bytecode since early 3.x days, have probably reduced instructions executed by at least 10%. "register" machine instructions do more work, so the number of instructions executed must be reduced to outweigh that. I don't know what the break-even point would be.

Another thing to consider is that branch predictors have improved a lot since 2005, so per-instruction overhead is reduced.

Comparison with super-instructions

Super-instructions are much simpler to implement.
Super-instructions push and pop to the stack which simplifies refcount handling a bit.

Does it help or hinder the important optimizations?

An infinite stack is a lot easier to reason about and optimize than a fixed-size array. Consequently a "register" machine may ultimately slow things down by making specialization and compilation harder.

gvanrossum · 2021-03-24T15:10:39Z

Agreed. One idea I had was to just go systematically over all the opcodes that currently have no arguments and where appropriate add a variant (or variants, like ADD_INT, ADD_CONST, ADD_FAST) that takes an extra operand. This is nearly free (the code can use a goto) and saves instruction count (the oparg field is otherwise unused) and decoding time. We can either analyze output from runs with dxpairs or analysis of code objects to determine which opcode pairs are common.

markshannon · 2021-03-25T11:52:08Z

A downside of adding ADD_CONST, ADD_FAST, etc. is that if we want to specialize BINARY_ADD we need to duplicate those specializations.

kumpera · 2021-06-15T15:12:11Z

Stack caching is an alternative that results in similar benefits and introduces much less code churn.
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.5.4929&rep=rep1&type=pdf

sweeneyde · 2021-07-26T06:54:21Z

Stack caching is an alternative that results in similar benefits and introduces much less code churn.
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.5.4929&rep=rep1&type=pdf

I'm pretty curious about this. I think it would mean making (say) 3 copies of the whole dispatch loop, somewhat like below. It would severely hurt code size/locality, but I think it would also significantly decrease the number of memory reads/writes into the stack on most instructions that get executed. I don't think the bytecode compiler would have to change at all.

PyObject *first_register, *second_register; // Sometimes use these instead of touching the stack

dispatch_zero: // zero of the registers in use

    switch (opcode) {
        ...
        case TARGET0(LOAD_FAST): {
            first_register = GETLOCAL(oparg);
            if (first_register == NULL) {
                ...
            }
            Py_INCREF(first_register);
            DISPATCH_ONE();
        }
        case TARGET0(STORE_FAST): {
            PyObject *value = BASIC_POP();
            SETLOCAL(oparg, value);
            DISPATCH_ZERO();
        }
        case TARGET0(POP_TOP): {
            PyObject *value = BASIC_POP();
            Py_DECREF(value);
            DISPATCH_ZERO();
        }
        case TARGET0(BINARY_SUBTRACT): {
            PyObject *right = stack_pointer[-1];
            PyObject *left = stack_pointer[-2];
            stack_pointer -= 2;
            first_register = PyNumber_Subtract(left, right);
            Py_DECREF(right);
            Py_DECREF(left);
            if (first_register == NULL)
                goto error0;
            DISPATCH_ONE();
        }
        ...
    }

dispatch_one: // one of the registers in use

    switch (opcode) {
        ...
        case TARGET1(LOAD_FAST): {
            second_register = GETLOCAL(oparg);
            if (second_register == NULL) {
                ...
            }
            Py_INCREF(second_register);
            DISPATCH_TWO();
        }
        case TARGET1(STORE_FAST): {
            SETLOCAL(oparg, first_register);
            DISPATCH_ZERO();
        }
        case TARGET1(POP_TOP): {
            Py_DECREF(first_register);
            DISPATCH_ZERO();
        }
        case TARGET1(BINARY_SUBTRACT): {
            PyObject *right = first_register;
            PyObject *left = stack_pointer[-1];
            stack_pointer -= 1;
            first_register = PyNumber_Subtract(left, right);
            Py_DECREF(right);
            Py_DECREF(left);
            if (first_register == NULL)
                goto error1;
            DISPATCH_ONE();
        }
        ...
    }

dispatch_two: // two of the registers in use

    switch (opcode) {
        ...
        case TARGET2(LOAD_FAST): {
            BASIC_PUSH(first_register);
            first_register = second_register;
            second_register = GETLOCAL(oparg);
            if (second_register == NULL) {
                ...
            }
            Py_INCREF(second_register);
            DISPATCH_TWO();
        }
        case TARGET2(STORE_FAST): {
            SETLOCAL(oparg, second_register);
            DISPATCH_ONE();
        }
        case TARGET2(POP_TOP): {
            Py_DECREF(second_register);
            DISPATCH_ONE();
        }
        case TARGET2(BINARY_SUBTRACT): {
            PyObject *left = first_register;
            first_register = PyNumber_Subtract(left, second_register);
            Py_DECREF(second_register);
            Py_DECREF(left);
            if (first_register == NULL)
                goto error2;
            DISPATCH_ONE();
        }
        ...
    }

This was referenced Mar 20, 2021

List of ideas #4

Closed

Pointer tagging aka tagged numbers #7

Closed

faster-cpython locked and limited conversation to collaborators Dec 1, 2021

gramster closed this as completed Dec 1, 2021

gramster added this to Fancy CPython Board Jan 10, 2022

gramster moved this to Todo in Fancy CPython Board Jan 10, 2022

gramster moved this from Todo to Other in Fancy CPython Board Jan 10, 2022

gramster removed this from Fancy CPython Board Jan 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Register-based VM #22

Register-based VM #22

gvanrossum commented Mar 20, 2021 •

edited

Loading

gvanrossum commented Mar 22, 2021 •

edited

Loading

gvanrossum commented Mar 22, 2021

markshannon commented Mar 24, 2021

markshannon commented Mar 24, 2021 •

edited

Loading

gvanrossum commented Mar 24, 2021

markshannon commented Mar 25, 2021 •

edited

Loading

kumpera commented Jun 15, 2021 •

edited by gvanrossum

Loading

sweeneyde commented Jul 26, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

Register-based VM #22

Register-based VM #22

Comments

gvanrossum commented Mar 20, 2021 • edited Loading

gvanrossum commented Mar 22, 2021 • edited Loading

gvanrossum commented Mar 22, 2021

markshannon commented Mar 24, 2021

markshannon commented Mar 24, 2021 • edited Loading

Experimental evidence

Comparison with super-instructions

Does it help or hinder the important optimizations?

gvanrossum commented Mar 24, 2021

markshannon commented Mar 25, 2021 • edited Loading

kumpera commented Jun 15, 2021 • edited by gvanrossum Loading

sweeneyde commented Jul 26, 2021

This issue was moved to a discussion.

gvanrossum commented Mar 20, 2021 •

edited

Loading

gvanrossum commented Mar 22, 2021 •

edited

Loading

markshannon commented Mar 24, 2021 •

edited

Loading

markshannon commented Mar 25, 2021 •

edited

Loading

kumpera commented Jun 15, 2021 •

edited by gvanrossum

Loading