Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register-based VM #22

Closed
gvanrossum opened this issue Mar 20, 2021 · 8 comments
Closed

Register-based VM #22

gvanrossum opened this issue Mar 20, 2021 · 8 comments

Comments

@gvanrossum
Copy link
Collaborator

gvanrossum commented Mar 20, 2021

Skip Montanaro has been keeping this idea alive. See his latest review on python-dev.

This was referenced Mar 20, 2021
@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Mar 22, 2021

@gvanrossum
Copy link
Collaborator Author

As an indication of how complex this transition would be, both Skip and Victor chose to keep the existing bytecode compiler and tack on a separate translation phase, written in Python!

@markshannon
Copy link
Member

CPython is significantly faster than it was when Victor did those experiments. So I wouldn't trust those numbers.

As an indication of how complex this transition would be, both Skip and Victor chose to keep the existing bytecode compiler and tack on a separate translation phase, written in Python!

That might just be a comment on how difficult the compiler is to work with (but that's a separate topic).

@markshannon
Copy link
Member

markshannon commented Mar 24, 2021

I'm 👎 on this. Not that it is a bad idea, just that the effort and disruption outweigh the marginal benefits.

The important issue here is not "register" or "stack", but is a wider instruction with explicit operands faster than the narrower instruction with implicit operand(s) that we currently have?

Maybe, but is it faster than the judicious use of super-instructions (such as LOAD_FAST_LOAD_FAST) and static specializations
(such as ADD_INT)?
Probably not, at least not by much, and it is a lot more work.

Experimental evidence

[Virtual Machine Showdown: Stack Versus Registers] (https://www.usenix.org/legacy/events%2Fvee05%2Ffull_papers/p153-yunhe.pdf) shows a 47% reduction in VM instructions executed. However, for Lua, the change from stack machine in Lua 4 to register machine in Lua 5 showed a 35% reduction. Since Python is even more dynamic and higher level than Lua, I would expect the reduction to be less than that, maybe in the 25-30% range.

In both cases considerable effort was put into bytecode optimization for the register machine, but not the stack machine.
I've not done any measurements, but improvements to the CPython bytecode since early 3.x days, have probably reduced instructions executed by at least 10%. "register" machine instructions do more work, so the number of instructions executed must be reduced to outweigh that. I don't know what the break-even point would be.

Another thing to consider is that branch predictors have improved a lot since 2005, so per-instruction overhead is reduced.

Comparison with super-instructions

Super-instructions are much simpler to implement.
Super-instructions push and pop to the stack which simplifies refcount handling a bit.

Does it help or hinder the important optimizations?

An infinite stack is a lot easier to reason about and optimize than a fixed-size array. Consequently a "register" machine may ultimately slow things down by making specialization and compilation harder.

@gvanrossum
Copy link
Collaborator Author

Agreed. One idea I had was to just go systematically over all the opcodes that currently have no arguments and where appropriate add a variant (or variants, like ADD_INT, ADD_CONST, ADD_FAST) that takes an extra operand. This is nearly free (the code can use a goto) and saves instruction count (the oparg field is otherwise unused) and decoding time. We can either analyze output from runs with dxpairs or analysis of code objects to determine which opcode pairs are common.

@markshannon
Copy link
Member

markshannon commented Mar 25, 2021

A downside of adding ADD_CONST, ADD_FAST, etc. is that if we want to specialize BINARY_ADD we need to duplicate those specializations.

@kumpera
Copy link

kumpera commented Jun 15, 2021

Stack caching is an alternative that results in similar benefits and introduces much less code churn.
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.5.4929&rep=rep1&type=pdf

@sweeneyde
Copy link

Stack caching is an alternative that results in similar benefits and introduces much less code churn.
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.5.4929&rep=rep1&type=pdf

I'm pretty curious about this. I think it would mean making (say) 3 copies of the whole dispatch loop, somewhat like below. It would severely hurt code size/locality, but I think it would also significantly decrease the number of memory reads/writes into the stack on most instructions that get executed. I don't think the bytecode compiler would have to change at all.

PyObject *first_register, *second_register; // Sometimes use these instead of touching the stack

dispatch_zero: // zero of the registers in use

    switch (opcode) {
        ...
        case TARGET0(LOAD_FAST): {
            first_register = GETLOCAL(oparg);
            if (first_register == NULL) {
                ...
            }
            Py_INCREF(first_register);
            DISPATCH_ONE();
        }
        case TARGET0(STORE_FAST): {
            PyObject *value = BASIC_POP();
            SETLOCAL(oparg, value);
            DISPATCH_ZERO();
        }
        case TARGET0(POP_TOP): {
            PyObject *value = BASIC_POP();
            Py_DECREF(value);
            DISPATCH_ZERO();
        }
        case TARGET0(BINARY_SUBTRACT): {
            PyObject *right = stack_pointer[-1];
            PyObject *left = stack_pointer[-2];
            stack_pointer -= 2;
            first_register = PyNumber_Subtract(left, right);
            Py_DECREF(right);
            Py_DECREF(left);
            if (first_register == NULL)
                goto error0;
            DISPATCH_ONE();
        }
        ...
    }

dispatch_one: // one of the registers in use

    switch (opcode) {
        ...
        case TARGET1(LOAD_FAST): {
            second_register = GETLOCAL(oparg);
            if (second_register == NULL) {
                ...
            }
            Py_INCREF(second_register);
            DISPATCH_TWO();
        }
        case TARGET1(STORE_FAST): {
            SETLOCAL(oparg, first_register);
            DISPATCH_ZERO();
        }
        case TARGET1(POP_TOP): {
            Py_DECREF(first_register);
            DISPATCH_ZERO();
        }
        case TARGET1(BINARY_SUBTRACT): {
            PyObject *right = first_register;
            PyObject *left = stack_pointer[-1];
            stack_pointer -= 1;
            first_register = PyNumber_Subtract(left, right);
            Py_DECREF(right);
            Py_DECREF(left);
            if (first_register == NULL)
                goto error1;
            DISPATCH_ONE();
        }
        ...
    }

dispatch_two: // two of the registers in use

    switch (opcode) {
        ...
        case TARGET2(LOAD_FAST): {
            BASIC_PUSH(first_register);
            first_register = second_register;
            second_register = GETLOCAL(oparg);
            if (second_register == NULL) {
                ...
            }
            Py_INCREF(second_register);
            DISPATCH_TWO();
        }
        case TARGET2(STORE_FAST): {
            SETLOCAL(oparg, second_register);
            DISPATCH_ONE();
        }
        case TARGET2(POP_TOP): {
            Py_DECREF(second_register);
            DISPATCH_ONE();
        }
        case TARGET2(BINARY_SUBTRACT): {
            PyObject *left = first_register;
            first_register = PyNumber_Subtract(left, second_register);
            Py_DECREF(second_register);
            Py_DECREF(left);
            if (first_register == NULL)
                goto error2;
            DISPATCH_ONE();
        }
        ...
    }

@faster-cpython faster-cpython locked and limited conversation to collaborators Dec 1, 2021
@gramster gramster closed this as completed Dec 1, 2021
@gramster gramster moved this to Todo in Fancy CPython Board Jan 10, 2022
@gramster gramster moved this from Todo to Other in Fancy CPython Board Jan 10, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants