Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-116168: Remove extra _CHECK_STACK_SPACE uops #117242

Conversation

lazorchakp
Copy link
Contributor

@lazorchakp lazorchakp commented Mar 26, 2024

This PR reduces the number of _CHECK_STACK_SPACE operands in tier2 bytecode.

Here's the basic design:

  • Introduce a new tier2-only uop, _CHECK_STACK_SPACE_OPERAND, which contains the amount space to check in its operand.
  • during uops optimization, calculate how much stack space we will need for the whole trace. Combine all _CHECK_STACK_SPACEs into a single _CHECK_STACK_SPACE_OPERAND.

potential changes / refactors to consider:

  • Merge combine_stack_space_checks with peephole_opt or remove_unneeded_uops?
  • All the #ifdef Py_DEBUGs in the core logic here are kind of distracting. Not sure if there's a more preferable way to implement these safety checks.

I'm wide open to feedback / criticism / suggestions!

Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Just one comment for now.

Python/optimizer.c Outdated Show resolved Hide resolved
@lazorchakp lazorchakp marked this pull request as ready for review March 28, 2024 04:42
@lazorchakp
Copy link
Contributor Author

Not sure if this needs news, but I'm happy to add it if necessary. The failing test is a timeout that seems to be occurring on more than just my branch

Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome work! I just have a few comments about the tests.

@@ -957,6 +957,16 @@ iframe_getlasti(PyObject *self, PyObject *frame)
return PyLong_FromLong(PyUnstable_InterpreterFrame_GetLasti(f));
}

static PyObject *
get_co_framesize(PyObject *self, PyObject *arg) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how comprehensive this test is. But I don't think we shouldn't test for such implementation details. We should just test the bytecode trace is correct and that's the limit.

Also, _CHECK_STACK_SPACE shouldn't affect frame size. It's a check after all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to remove the assertions for exact framesizes, but I'd still like to keep this function. It's useful to help determine that _CHECK_STACK_SPACE_OPERAND is assigned the correct operand.

To be clear, I'm not worried about my changes affecting the framesizes themselves, just that the calculated total stack space required is accurate.

}
}
}
Py_FatalError("No terminating instruction");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Py_UNREACHABLE alone is enough usually.

Suggested change
Py_FatalError("No terminating instruction");

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this.

The approach looks sound, and the tests thorough.
I have a few suggestions, specifically that the pass by added to optimize_uops to avoid adding another pass.

def _assert_framesize_cross_platform(self, fn_obj, expected_framesize):
self.assertEqual(
_testinternalcapi.get_co_framesize(fn_obj.__code__),
expected_framesize + self.FRAMESIZE_ADJUSTMENT
Copy link
Member

@markshannon markshannon Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't clear why this adjustment is necessary.
Also, would it be cleaner to do the adjustment in _testinternalcapi.get_co_framesize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't clear to me either why framesizes vary across platforms 😅 but based on other comments I'm thinking that this assertion is overly specific and a bit brittle, so I'm just going to remove it. I don't think I want to adjust get_co_framesize, since I need this to report the actual framesize in every case

return co;
}

// _PUSH_FRAME is 3 ops in front of _CHECK_STACK_SPACE
Copy link
Member

@markshannon markshannon Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fragile, and I don't think you need it if you track the current stack depth record the location of the previous _CHECK_STACK_SPACE and patch it in _PUSH_FRAME.

* calls in this trace.
*/
static void
combine_stack_space_checks(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than make this a separate pass, could you integrate into optimize_uops by adding it to optimizer_bytecodes.c.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to do this, but to confirm - do you definitely prefer integration into optimize_uops over peephole_opt or remove_unneeded_uops? Any of these would prevent an extra pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fidget-Spinner may also have an opinion here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both peephole_opt or optimize_uops is fine to me. peephole_opt would likely be clearer as this pass does not use any information from the abstract interpreter.

bool must_exit = false;
#endif
int depth = 0;
uint32_t space_at_depth[MAX_ABSTRACT_FRAME_DEPTH];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You only need to track the current depth, as this is a single pass algorithm.
In fact, this may be incorrect, if more than one function is called with differing stack requirements.

I think, only the following variables are needed:

  • The first check (address or offset)
  • The current stack size
  • The maximum requested stack size

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus the last seen stack check, to avoid hard-coding the offset between CHECK_STACK_SPACE and _PUSH_FRAME.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion - I was able to remove my makeshift stack. I had to use one additional variable to keep track of the current frame's size, since there's no way I could see to access this from _POP_FRAME (we only have the code object being returned to, not the one we're returning from)

if (first_valid_check_stack != NULL) {
assert(first_valid_check_stack->opcode == _CHECK_STACK_SPACE);
assert(max_space > 0);
assert(max_space < INT_MAX);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert(max_space < INT_MAX);
if (max_space < INT_MAX) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe reduce the limit to UNIT16_MAX?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, setting a valid _CHECK_STACK_SPACE_OPERAND is mandatory if even a single _CHECK_STACK_SPACE is deleted during this pass, so I can't make this conditional on the value of max_space. I'm going to add some additional checks to ensure max_space is always an acceptable value and rework the exit logic

@bedevere-app
Copy link

bedevere-app bot commented Mar 28, 2024

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@lazorchakp
Copy link
Contributor Author

I have made the requested changes; please review again

@bedevere-app
Copy link

bedevere-app bot commented Mar 30, 2024

Thanks for making the requested changes!

@markshannon: please review the changes made to this pull request.

@bedevere-app bedevere-app bot requested a review from markshannon March 30, 2024 01:55
@Fidget-Spinner
Copy link
Member

I think this is fine to merge now. Will wait a few days if Mark has any reviews left.

@gvanrossum
Copy link
Member

I am working on one last review. Looks like just some style nits.

Copy link
Member

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I do have a few style nits. Once you fix those I will gladly merge it!

/* _PUSH_FRAME/_POP_FRAME's operand can be 0, a PyFunctionObject *, or a
* PyCodeObject *. Retrieve the code object if possible.
*/
static PyCodeObject *get_co(_PyUOpInstruction *op) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit:

Suggested change
static PyCodeObject *get_co(_PyUOpInstruction *op) {
static PyCodeObject *
get_co(_PyUOpInstruction *op)
{

@@ -4,6 +4,7 @@
import unittest
import gc
import os
import platform
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unused?

self.assertEqual(res, 832)
self.assertIsNotNone(ex)

uops_and_operands = [(opcode, operand) for opcode, _, _, operand in list(ex)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list() call is superfluous:

Suggested change
uops_and_operands = [(opcode, operand) for opcode, _, _, operand in list(ex)]
uops_and_operands = [(opcode, operand) for opcode, _, _, operand in ex]

(Ditto below.)

c, d = dummy13(9)
a += b + c + d
return a

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One blank line is enough. :-)

(Ditto below.)

self.assertEqual(uop_names.count("_PUSH_FRAME"), 2)
self.assertEqual(uop_names.count("_CHECK_STACK_SPACE_OPERAND"), 1)

# this hits a different case during trace projection in refcount test runs only,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now you've made me curious -- what does the refcount test do different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gvanrossum I'm still not sure why this is happening, but here's the behavior:

On the first round of a refcount test, trace projection appears to proceed normally. We end up with a large trace that's cut off part of the way through the uops for dummy_large. On all subsequent rounds, we end up in the new_code == NULL case while emitting _PUSH_FRAME in trace projection (code).

That prevents me from having access to a code object in peephole optimization, so I can't remove the corresponding _CHECK_STACK_SPACE.

Interestingly, we never hit new_code == NULL in testing with --forever.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably refcount tests do other weird stuff and maybe somehow the func/code version cache entry is overwritten? Let's not worry about it, refcount tests are always weird.

@@ -957,6 +957,16 @@ iframe_getlasti(PyObject *self, PyObject *frame)
return PyLong_FromLong(PyUnstable_InterpreterFrame_GetLasti(f));
}

static PyObject *
get_co_framesize(PyObject *self, PyObject *arg) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit

Suggested change
get_co_framesize(PyObject *self, PyObject *arg) {
get_co_framesize(PyObject *self, PyObject *arg)
{

int framesize = co->co_framesize;
assert(framesize > 0);
curr_space += framesize;
if (curr_space < 0 || curr_space > INT32_MAX) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second check cannot succeed when int is 32 bits, right? I guess the compiler will then omit the check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. I was worried about cases where int is larger than the 32-bit operand. I'm comparing to INT32_MAX instead of UINT32_MAX to avoid compiler warnings; technically that makes this comparison 1 bit more strict than necessary, but I imagine this is an edge case that will almost never be hit anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 actually, the comparison isn't overly strict, since I'm already requiring this number to be a positive int. Anyway, I'm still convinced this is a necessary check.

assert(framesize > 0);
curr_space += framesize;
if (curr_space < 0 || curr_space > INT32_MAX) {
// overflow or won't fit in operand
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will fit in operand (which is always 64 bits) but not in the cache entry for the new instruction (which is always unsigned 32 bits) or won't be able to be converted to signed int (which it must be for the _PyThreadState_HasStackSpace() call). So the check is right but the comment feels like it is misrepresenting the situation. My suggestion:

Suggested change
// overflow or won't fit in operand
// won't fit in signed 32-bit int

@lazorchakp
Copy link
Contributor Author

I've addressed all comments and the pipeline looks good, so this is once again ready for review.

Comment on lines 1158 to 1159
# a99999 = a99998 + 1
# return a99999
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably more this, since repetitions is 10000:

Suggested change
# a99999 = a99998 + 1
# return a99999
# a9999 = a9998 + 1
# return a9999

@gvanrossum
Copy link
Member

Can you run make regen-case? Then we'll merge.

@lazorchakp
Copy link
Contributor Author

👍 just ran make regen-cases with the latest updates from main

@gvanrossum gvanrossum enabled auto-merge (squash) April 3, 2024 16:44
@gvanrossum gvanrossum merged commit 1c43468 into python:main Apr 3, 2024
52 checks passed
@lazorchakp lazorchakp deleted the gh-116168-remove-extra-_check_stack_space-ops branch April 3, 2024 17:14
@lazorchakp
Copy link
Contributor Author

Thank you all for your patience and guidance on this work! I learned a lot throughout the process and quite enjoyed it.

@gvanrossum
Copy link
Member

Thank you! Looking forward to more contributions from you.

diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Apr 17, 2024
This merges all `_CHECK_STACK_SPACE` uops in a trace into a single `_CHECK_STACK_SPACE_OPERAND` uop that checks whether there is enough stack space for all calls included in the entire trace.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants