-
Notifications
You must be signed in to change notification settings - Fork 50
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Register-based VM #22
Comments
Note that Victor has also worked on such a thing: see https://mail.python.org/archives/list/[email protected]/thread/X72OYMPH2HLTY4SIGVPKSTIRWL2XFY7G/
(It's actually faster!) |
As an indication of how complex this transition would be, both Skip and Victor chose to keep the existing bytecode compiler and tack on a separate translation phase, written in Python! |
CPython is significantly faster than it was when Victor did those experiments. So I wouldn't trust those numbers.
That might just be a comment on how difficult the compiler is to work with (but that's a separate topic). |
I'm 👎 on this. Not that it is a bad idea, just that the effort and disruption outweigh the marginal benefits. The important issue here is not "register" or "stack", but is a wider instruction with explicit operands faster than the narrower instruction with implicit operand(s) that we currently have? Maybe, but is it faster than the judicious use of super-instructions (such as LOAD_FAST_LOAD_FAST) and static specializations Experimental evidence[Virtual Machine Showdown: Stack Versus Registers] (https://www.usenix.org/legacy/events%2Fvee05%2Ffull_papers/p153-yunhe.pdf) shows a 47% reduction in VM instructions executed. However, for Lua, the change from stack machine in Lua 4 to register machine in Lua 5 showed a 35% reduction. Since Python is even more dynamic and higher level than Lua, I would expect the reduction to be less than that, maybe in the 25-30% range. In both cases considerable effort was put into bytecode optimization for the register machine, but not the stack machine. Another thing to consider is that branch predictors have improved a lot since 2005, so per-instruction overhead is reduced. Comparison with super-instructionsSuper-instructions are much simpler to implement. Does it help or hinder the important optimizations?An infinite stack is a lot easier to reason about and optimize than a fixed-size array. Consequently a "register" machine may ultimately slow things down by making specialization and compilation harder. |
Agreed. One idea I had was to just go systematically over all the opcodes that currently have no arguments and where appropriate add a variant (or variants, like ADD_INT, ADD_CONST, ADD_FAST) that takes an extra operand. This is nearly free (the code can use a goto) and saves instruction count (the oparg field is otherwise unused) and decoding time. We can either analyze output from runs with dxpairs or analysis of code objects to determine which opcode pairs are common. |
A downside of adding ADD_CONST, ADD_FAST, etc. is that if we want to specialize BINARY_ADD we need to duplicate those specializations. |
Stack caching is an alternative that results in similar benefits and introduces much less code churn. |
I'm pretty curious about this. I think it would mean making (say) 3 copies of the whole dispatch loop, somewhat like below. It would severely hurt code size/locality, but I think it would also significantly decrease the number of memory reads/writes into the stack on most instructions that get executed. I don't think the bytecode compiler would have to change at all. PyObject *first_register, *second_register; // Sometimes use these instead of touching the stack
dispatch_zero: // zero of the registers in use
switch (opcode) {
...
case TARGET0(LOAD_FAST): {
first_register = GETLOCAL(oparg);
if (first_register == NULL) {
...
}
Py_INCREF(first_register);
DISPATCH_ONE();
}
case TARGET0(STORE_FAST): {
PyObject *value = BASIC_POP();
SETLOCAL(oparg, value);
DISPATCH_ZERO();
}
case TARGET0(POP_TOP): {
PyObject *value = BASIC_POP();
Py_DECREF(value);
DISPATCH_ZERO();
}
case TARGET0(BINARY_SUBTRACT): {
PyObject *right = stack_pointer[-1];
PyObject *left = stack_pointer[-2];
stack_pointer -= 2;
first_register = PyNumber_Subtract(left, right);
Py_DECREF(right);
Py_DECREF(left);
if (first_register == NULL)
goto error0;
DISPATCH_ONE();
}
...
}
dispatch_one: // one of the registers in use
switch (opcode) {
...
case TARGET1(LOAD_FAST): {
second_register = GETLOCAL(oparg);
if (second_register == NULL) {
...
}
Py_INCREF(second_register);
DISPATCH_TWO();
}
case TARGET1(STORE_FAST): {
SETLOCAL(oparg, first_register);
DISPATCH_ZERO();
}
case TARGET1(POP_TOP): {
Py_DECREF(first_register);
DISPATCH_ZERO();
}
case TARGET1(BINARY_SUBTRACT): {
PyObject *right = first_register;
PyObject *left = stack_pointer[-1];
stack_pointer -= 1;
first_register = PyNumber_Subtract(left, right);
Py_DECREF(right);
Py_DECREF(left);
if (first_register == NULL)
goto error1;
DISPATCH_ONE();
}
...
}
dispatch_two: // two of the registers in use
switch (opcode) {
...
case TARGET2(LOAD_FAST): {
BASIC_PUSH(first_register);
first_register = second_register;
second_register = GETLOCAL(oparg);
if (second_register == NULL) {
...
}
Py_INCREF(second_register);
DISPATCH_TWO();
}
case TARGET2(STORE_FAST): {
SETLOCAL(oparg, second_register);
DISPATCH_ONE();
}
case TARGET2(POP_TOP): {
Py_DECREF(second_register);
DISPATCH_ONE();
}
case TARGET2(BINARY_SUBTRACT): {
PyObject *left = first_register;
first_register = PyNumber_Subtract(left, second_register);
Py_DECREF(second_register);
Py_DECREF(left);
if (first_register == NULL)
goto error2;
DISPATCH_ONE();
}
...
} |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Skip Montanaro has been keeping this idea alive. See his latest review on python-dev.
The text was updated successfully, but these errors were encountered: