-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set up tagged pointers in the evaluation stack #117139
Comments
So the mentions of unboxed ints are speculative and not part of the initial implementation, right? And the tagged pointers just have their tag bit(s) removed when used in the "blob of C code" that's the instruction definition's body. For example,
would expand its body to SETLOCAL(oparg, (PyObject *)(value.bits & ~TAGS)); How do pointers receive their tag bit? Does every stack push just "or in" the tag bit(s)? I recall trying to do this manually (before we had the generator) and it was a lot of work -- the code generator should make this much simpler. It also was quite a bit slower -- have you thought about this? My biggest question is what does the tag bit mean? Is there a specific section of PEP 703 I should read to learn more about this? |
PEP 703 doesn't go into detail about the tag bits. I'd expect it to match the description in Mark's issue, except with unboxed ints not part of the initial implementation:
We should only enable this in the free-threaded build to start. We do not want to introduce any performance regressions in the default build. If it turns out to be a net win for the default build, then we can consider enabling it there. Otherwise, we can revisit that in 3.14, if we pursue tagged integers.
Yeah, most pushes just "or in" the tag bit, which is an immortal check. So for, example cpython/Python/generated_cases.c.h Line 2922 in 40d75c2
That might be generated instead as: #define PACK(obj) ((PyTaggedObject){.bits = (uintptr_t)obj | _Py_IsImmortal(obj)})
stack_pointer[-1] = PACK(iter);
I think we want to use tagged pointers in the locals as well as the evaluation stack, but it may be easier to start with just using tagged pointers the evaluation stack. In that case
If we use tagged pointers for the locals as well, then we could skip the extra work in MotivationThe primary motivation is to avoid reference count contention in multithreaded programs on things like functions/methods (in the free-threaded build). In particular, the benefit (from avoid reference counting contention) is going to be in For example, currently PyObject *attr = ...;
Py_INCREF(attr);
stack_pointer[-1] = attr; We want the generated code to instead look something like the following, but probably refactored with helper macros/functions: PyObject *attr = ...;
PyTaggedObject attr_tag;
if (_Py_IsImmortalOrDeferred(attr)) {
attr_tag.bits = (uintptr_t)attr | IS_DEFERRED;
}
else {
attr_tag.obj = attr;
}
stack_pointer[-1] = attr_tag; |
Okay... So from this definition #define PACK(obj) ((PyTaggedObject){.bits = (uintptr_t)obj | _Py_IsImmortal(obj)}) I gather that the tag bit is (by default) only set if the object is immortal. From your Motivation here (and also from the section on deferred refcounting in PEP 703) I take it that the tag bit would also be set for selected objects like functions and methods. Mark's original issue describes variants of INCREF/DECREF to check for the tag bit; presumably these variants should only be used in the interpreter. @Fidget-Spinner Good luck! |
--------- Co-authored-by: Sam Gross <[email protected]>
--------- Co-authored-by: Sam Gross <[email protected]>
One thing to note: If we are going to use tagged ints, all plain "borrow" transformations will need to be removed, as it is impossible to borrow a In cases where it is known that the stack ref is not an int we could add |
This PR sets up tagged pointers for CPython. The general idea is to create a separate struct _PyStackRef for everything on the evaluation stack to store the bits. This forces the C compiler to warn us if we try to cast things or pull things out of the struct directly. Only for free threading: We tag the low bit if something is deferred - that means we skip incref and decref operations on it. This behavior may change in the future if Mark's plans to defer all objects in the interpreter loop pans out. This implies a strict stack reference discipline is required. ALL incref and decref operations on stackrefs must use the stackref variants. It is unsafe to untag something then do normal incref/decref ops on it. The new incref and decref variants are called dup and close. They mimic a "handle" API operating on these stackrefs. Please read Include/internal/pycore_stackref.h for more information! --------- Co-authored-by: Mark Shannon <[email protected]>
Fix a few wrong steals in bytecodes.c
…18450) This PR sets up tagged pointers for CPython. The general idea is to create a separate struct _PyStackRef for everything on the evaluation stack to store the bits. This forces the C compiler to warn us if we try to cast things or pull things out of the struct directly. Only for free threading: We tag the low bit if something is deferred - that means we skip incref and decref operations on it. This behavior may change in the future if Mark's plans to defer all objects in the interpreter loop pans out. This implies a strict stack reference discipline is required. ALL incref and decref operations on stackrefs must use the stackref variants. It is unsafe to untag something then do normal incref/decref ops on it. The new incref and decref variants are called dup and close. They mimic a "handle" API operating on these stackrefs. Please read Include/internal/pycore_stackref.h for more information! --------- Co-authored-by: Mark Shannon <[email protected]>
Fix a few wrong steals in bytecodes.c
Avoids the extra conversion from stack refs to PyObjects.
Avoids the extra conversion from stack refs to PyObjects.
…1244) Avoids the extra conversion from stack refs to PyObjects.
…18450) This PR sets up tagged pointers for CPython. The general idea is to create a separate struct _PyStackRef for everything on the evaluation stack to store the bits. This forces the C compiler to warn us if we try to cast things or pull things out of the struct directly. Only for free threading: We tag the low bit if something is deferred - that means we skip incref and decref operations on it. This behavior may change in the future if Mark's plans to defer all objects in the interpreter loop pans out. This implies a strict stack reference discipline is required. ALL incref and decref operations on stackrefs must use the stackref variants. It is unsafe to untag something then do normal incref/decref ops on it. The new incref and decref variants are called dup and close. They mimic a "handle" API operating on these stackrefs. Please read Include/internal/pycore_stackref.h for more information! --------- Co-authored-by: Mark Shannon <[email protected]>
Fix a few wrong steals in bytecodes.c
…1244) Avoids the extra conversion from stack refs to PyObjects.
…18450) This PR sets up tagged pointers for CPython. The general idea is to create a separate struct _PyStackRef for everything on the evaluation stack to store the bits. This forces the C compiler to warn us if we try to cast things or pull things out of the struct directly. Only for free threading: We tag the low bit if something is deferred - that means we skip incref and decref operations on it. This behavior may change in the future if Mark's plans to defer all objects in the interpreter loop pans out. This implies a strict stack reference discipline is required. ALL incref and decref operations on stackrefs must use the stackref variants. It is unsafe to untag something then do normal incref/decref ops on it. The new incref and decref variants are called dup and close. They mimic a "handle" API operating on these stackrefs. Please read Include/internal/pycore_stackref.h for more information! --------- Co-authored-by: Mark Shannon <[email protected]>
Fix a few wrong steals in bytecodes.c
…1244) Avoids the extra conversion from stack refs to PyObjects.
`_PyDict_SetItem_Take2` steals both the key (i.e., `sub`) and the value.
`_PyDict_SetItem_Take2` steals both the key (i.e., `sub`) and the value.
This replaces `_PyList_FromArraySteal` with `_PyList_FromStackRefSteal`. It's functionally equivalent, but takes a `_PyStackRef` array instead of an array of `PyObject` pointers. Co-authored-by: Ken Jin <[email protected]>
This replaces `_PyList_FromArraySteal` with `_PyList_FromStackRefSteal`. It's functionally equivalent, but takes a `_PyStackRef` array instead of an array of `PyObject` pointers. Co-authored-by: Ken Jin <[email protected]>
BUILD_SET should use a borrow instead of a steal. The cleanup in _DO_CALL CONVERSION_FAILED was incorrect. Co-authored-by: Ken Jin <[email protected]>
`BUILD_SET` should use a borrow instead of a steal. The cleanup in `_DO_CALL` `CONVERSION_FAILED` was incorrect. Co-authored-by: Ken Jin <[email protected]>
…2830) This replaces `_PyList_FromArraySteal` with `_PyList_FromStackRefSteal`. It's functionally equivalent, but takes a `_PyStackRef` array instead of an array of `PyObject` pointers. Co-authored-by: Ken Jin <[email protected]>
* The free-threaded GC now visits interpreter stacks to keep objects that use deferred reference counting alive. * Interpreter frames are zero initialized in the free-threaded GC so that the GC doesn't see garbage data. This is a temporary measure until stack spilling around escaping calls is implemented. Co-authored-by: Ken Jin <[email protected]>
The free-threaded GC now visits interpreter stacks to keep objects that use deferred reference counting alive. Interpreter frames are zero initialized in the free-threaded GC so that the GC doesn't see garbage data. This is a temporary measure until stack spilling around escaping calls is implemented. Co-authored-by: Ken Jin <[email protected]>
`_PyDict_SetItem_Take2` steals both the key (i.e., `sub`) and the value.
`BUILD_SET` should use a borrow instead of a steal. The cleanup in `_DO_CALL` `CONVERSION_FAILED` was incorrect. Co-authored-by: Ken Jin <[email protected]>
python#122830) This replaces `_PyList_FromArraySteal` with `_PyList_FromStackRefSteal`. It's functionally equivalent, but takes a `_PyStackRef` array instead of an array of `PyObject` pointers. Co-authored-by: Ken Jin <[email protected]>
…ython#122956) The free-threaded GC now visits interpreter stacks to keep objects that use deferred reference counting alive. Interpreter frames are zero initialized in the free-threaded GC so that the GC doesn't see garbage data. This is a temporary measure until stack spilling around escaping calls is implemented. Co-authored-by: Ken Jin <[email protected]>
Feature or enhancement
Proposal:
This issue mostly mirrors Mark's original one at faster-cpython/ideas#632.
As part of PEP 703, tagged pointers will help achieve deferred reference counting.
Tagged pointers will also help the default GIL builds in 3.14 and future with unboxed ints.
Here's the initial design (open for comments):
We change the evaluation stack from
PyObject *
toPyTaggedObject
.PyTaggedObject
looks like this.Note: I am limiting the tagged pointers to just the evaluation stack (this is up for debate as well). The main reason is to keep the change as minimally invasive as possible for 3.13 and not leak out any implementation details for now. For example, I don't want to support tagged pointers in C API functions. This may/will change in 3.14.
Thanks to the cases generator, we can then autogenerate all the proper accessors to said object. For example, a stack effect of
PyObject *
will generate the codeeffect.obj
, while a stack effect ofPyTaggedObject
(the new default) will generate the code(PyObject *)(effect.bits & ~TAGS)
or something similar.Unboxed ints will also help both free-threaded and non-free-threaded builds by reducing refcounting even more for simple arithmetic operations (and making integer operations native). However, I am aiming for deferred tags in 3.13, and anything else is a plus.
Further note: due to wanting to respect immediate finalizers, we are not deferring that many objects. So this may not be a win on default builds. On free-threaded builds, I see a slight (10%) speedup on fibonacci microbenchmarks. Will benchmark on the default bulid as well and see the results.
Has this already been discussed elsewhere?
No response given
Links to previous discussion of this feature:
No response
Linked PRs
_PyStackRef
related bugs #122831Tasks
The text was updated successfully, but these errors were encountered: