-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-123516: Improve JIT memory consumption by invalidating cold executors #123402
GH-123516: Improve JIT memory consumption by invalidating cold executors #123402
Conversation
2887733
to
622c266
Compare
Looking at this now! Since we'll probably iterate on heuristics for invalidation later, do you mind creating a new issue just for cold executor invalidation, and link that one instead? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So excited to see this! Some thoughts I had while reading:
Misc/NEWS.d/next/Core_and_Builtins/2024-08-27-21-44-14.gh-issue-116017.ZY3yBY.rst
Outdated
Show resolved
Hide resolved
Co-authored-by: Bénédikt Tran <[email protected]>
Co-authored-by: Bénédikt Tran <[email protected]>
…trowski/cpython into jit-mem-invalidate-10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks promising. I have a couple of comments.
Do we have any stats or performance numbers on this?
Python/optimizer.c
Outdated
@@ -182,6 +182,12 @@ _PyOptimizer_Optimize( | |||
if (err <= 0) { | |||
return err; | |||
} | |||
|
|||
if (++interp->executors_created >= JIT_CLEANUP_THRESHOLD) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could behave quite badly with many executors.
Suppose we have a 1000 executors:
Even in the ideal case, we need to scan all 1000, every time we create 10 new ones.
But what worries me is that we could clear all executors far too easily.
Suppose the VM starts compiling some new region of code, resulting in 20 new executors.
After the first 10, we will mark all executors as cold. After the second 10 we will clear all executors.
I'm not sure what the solution is, but I think a ticker that is independent of the number of executors created would be better. Maybe the GC trigger, or something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In earlier runs of the benchmarks, there was no impact to performance, even in instances where I tried pretty aggressive thresholds. In discussing this with Brandt, it seemed cheap to re-create in cases where we might be overzealous in invalidation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably also worth mentioning that I did experiment with triggering invalidation on gc earlier but in chatting with Brandt, it seemed like better design to have this tied to executor creation (i.e. since the invalidation isn't really at all coupled to gc today).
I'm sure that @pablogsal probably has thoughts here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure that @pablogsal probably has thoughts here too?
I think it may be sensible to do it in the eval breaker or at any other similar point in the VM execution where we can know it's safe to do. This is equivalent to the GC (as both are triggered there) and still allows us to add custom triggering logic and decuples the run from the GC.
For context: I would still prefer not to tie this with the gc as in general we have been holding off coupling the gc with things for a long time because any bug in the GC or on GC objects is already a very non trivial and hard to debug spooky-action-at-a-distance and we have learn that (1) controlling any side-effects that are not purely derived from the generic algorithm and (2) keeping the GC from having to know what's operating on has been key to keep complexity at bay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the solution is, but I think a ticker that is independent of the number of executors created would be better. Maybe the GC trigger, or something similar?
I think the current approach is probably good enough for now, and we can iterate on the heuristics. The potential quadratic behavior (and wipe-out-everything-ness) of the current approach is a bit scary, but is unlikely to seriously impact most programs. It can also be mitigated by a follow-up PR to change "every ten executors" to "10% more executors" or "20% more JIT memory" or some other ratio-based approach, like the GC uses.
I also think the GC is probably the wrong place for this. We could add a new dedicated eval-breaker bit, but that sort of feels like overkill. The creation of new executors is a known safe point where (a) we expect to spend a bit of time managing JIT stuff and (b) we know JIT memory pressure will increase by some fixed amount. I don't think it would be any better to just set an eval breaker bit here and leave our new JIT code almost immediately to perform this same cleanup.
@markshannon Thanks for taking a look. The preliminary results looked something like this: #123402 (comment). The approach has been updated since then, so it's worth running the benchmarks again. @brandtbucher, what are your thoughts about re-running the benchmarks at this point? |
I don't believe the main heuristic has changed, and the little improvements in bit-twiddling and such are unlikely to impact performance or memory significantly. We probably don't need to re-run anything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple notes on naming, otherwise looks like a clear improvement to me. Thanks for your patience!
Succeeded by #124443 |
This PR aims to improve memory usage of the JIT by periodically invalidating "cold" executors (more about preliminary results in #123402 (comment)). More specifically, this PR:
PyExecutorObject
'svm_data
calledwarm
, which serves as a boolean to track whether an executor has been run (warm
is incremented in a new uop_MAKE_WARM
, tacked on to the beginning of the trace)JIT_CLEANUP_THRESHOLD
), we run_Py_Executors_InvalidateCold
as part ofPyOptimizer_Optimize
(where executors are created)_Py_Executors_InvalidateCold
is a new function that enables us to traverse the linked list of executors (stored onPyExecutorObject
) and invalidate any executors that have not been run (because it's not being used and it's fairly cheap to recreate if need be).I'll also note that there's more that we could try here (I've got some other experiments that I'd like to run), but I wanted to at least start by landing these changes.