Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-123516: Improve JIT memory consumption by invalidating cold executors #123402

Closed

Conversation

savannahostrowski
Copy link
Member

@savannahostrowski savannahostrowski commented Aug 27, 2024

This PR aims to improve memory usage of the JIT by periodically invalidating "cold" executors (more about preliminary results in #123402 (comment)). More specifically, this PR:

  • Introduces a new field on PyExecutorObject's vm_data called warm, which serves as a boolean to track whether an executor has been run (warm is incremented in a new uop _MAKE_WARM, tacked on to the beginning of the trace)
  • Every 10 executor creations (set via JIT_CLEANUP_THRESHOLD), we run _Py_Executors_InvalidateCold as part of PyOptimizer_Optimize (where executors are created)
  • _Py_Executors_InvalidateCold is a new function that enables us to traverse the linked list of executors (stored on PyExecutorObject) and invalidate any executors that have not been run (because it's not being used and it's fairly cheap to recreate if need be).

I'll also note that there's more that we could try here (I've got some other experiments that I'd like to run), but I wanted to at least start by landing these changes.

@brandtbucher brandtbucher changed the title [gh-116017] Improve JIT memory consumption by invalidating cold executors GH-116017: Improve JIT memory consumption by invalidating cold executors Aug 27, 2024
Python/optimizer.c Outdated Show resolved Hide resolved
@savannahostrowski savannahostrowski added 3.14 new features, bugs and security fixes topic-JIT labels Aug 27, 2024
@brandtbucher
Copy link
Member

Looking at this now!

Since we'll probably iterate on heuristics for invalidation later, do you mind creating a new issue just for cold executor invalidation, and link that one instead?

Copy link
Member

@brandtbucher brandtbucher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So excited to see this! Some thoughts I had while reading:

Include/internal/pycore_optimizer.h Outdated Show resolved Hide resolved
Python/bytecodes.c Outdated Show resolved Hide resolved
Python/optimizer.c Outdated Show resolved Hide resolved
Python/optimizer.c Outdated Show resolved Hide resolved
Python/optimizer.c Outdated Show resolved Hide resolved
Python/optimizer.c Outdated Show resolved Hide resolved
Python/optimizer.c Outdated Show resolved Hide resolved
Python/optimizer.c Show resolved Hide resolved
Python/optimizer.c Show resolved Hide resolved
Python/optimizer.c Outdated Show resolved Hide resolved
Python/optimizer.c Outdated Show resolved Hide resolved
Python/optimizer.c Outdated Show resolved Hide resolved
Python/optimizer.c Outdated Show resolved Hide resolved
Include/internal/pycore_optimizer.h Outdated Show resolved Hide resolved
Include/internal/pycore_optimizer.h Show resolved Hide resolved
Include/internal/pycore_interp.h Outdated Show resolved Hide resolved
Python/optimizer.c Outdated Show resolved Hide resolved
@brandtbucher brandtbucher self-requested a review September 6, 2024 20:24
Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks promising. I have a couple of comments.

Do we have any stats or performance numbers on this?

Include/internal/pycore_optimizer.h Outdated Show resolved Hide resolved
@@ -182,6 +182,12 @@ _PyOptimizer_Optimize(
if (err <= 0) {
return err;
}

if (++interp->executors_created >= JIT_CLEANUP_THRESHOLD) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could behave quite badly with many executors.

Suppose we have a 1000 executors:
Even in the ideal case, we need to scan all 1000, every time we create 10 new ones.

But what worries me is that we could clear all executors far too easily.
Suppose the VM starts compiling some new region of code, resulting in 20 new executors.
After the first 10, we will mark all executors as cold. After the second 10 we will clear all executors.

I'm not sure what the solution is, but I think a ticker that is independent of the number of executors created would be better. Maybe the GC trigger, or something similar?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In earlier runs of the benchmarks, there was no impact to performance, even in instances where I tried pretty aggressive thresholds. In discussing this with Brandt, it seemed cheap to re-create in cases where we might be overzealous in invalidation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably also worth mentioning that I did experiment with triggering invalidation on gc earlier but in chatting with Brandt, it seemed like better design to have this tied to executor creation (i.e. since the invalidation isn't really at all coupled to gc today).

I'm sure that @pablogsal probably has thoughts here too?

Copy link
Member

@pablogsal pablogsal Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure that @pablogsal probably has thoughts here too?

I think it may be sensible to do it in the eval breaker or at any other similar point in the VM execution where we can know it's safe to do. This is equivalent to the GC (as both are triggered there) and still allows us to add custom triggering logic and decuples the run from the GC.

For context: I would still prefer not to tie this with the gc as in general we have been holding off coupling the gc with things for a long time because any bug in the GC or on GC objects is already a very non trivial and hard to debug spooky-action-at-a-distance and we have learn that (1) controlling any side-effects that are not purely derived from the generic algorithm and (2) keeping the GC from having to know what's operating on has been key to keep complexity at bay.

Copy link
Member

@brandtbucher brandtbucher Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the solution is, but I think a ticker that is independent of the number of executors created would be better. Maybe the GC trigger, or something similar?

I think the current approach is probably good enough for now, and we can iterate on the heuristics. The potential quadratic behavior (and wipe-out-everything-ness) of the current approach is a bit scary, but is unlikely to seriously impact most programs. It can also be mitigated by a follow-up PR to change "every ten executors" to "10% more executors" or "20% more JIT memory" or some other ratio-based approach, like the GC uses.

I also think the GC is probably the wrong place for this. We could add a new dedicated eval-breaker bit, but that sort of feels like overkill. The creation of new executors is a known safe point where (a) we expect to spend a bit of time managing JIT stuff and (b) we know JIT memory pressure will increase by some fixed amount. I don't think it would be any better to just set an eval breaker bit here and leave our new JIT code almost immediately to perform this same cleanup.

@savannahostrowski
Copy link
Member Author

@markshannon Thanks for taking a look. The preliminary results looked something like this: #123402 (comment). The approach has been updated since then, so it's worth running the benchmarks again. @brandtbucher, what are your thoughts about re-running the benchmarks at this point?

@brandtbucher
Copy link
Member

@brandtbucher, what are your thoughts about re-running the benchmarks at this point?

I don't believe the main heuristic has changed, and the little improvements in bit-twiddling and such are unlikely to impact performance or memory significantly. We probably don't need to re-run anything.

Copy link
Member

@brandtbucher brandtbucher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple notes on naming, otherwise looks like a clear improvement to me. Thanks for your patience!

Include/internal/pycore_interp.h Outdated Show resolved Hide resolved
Include/internal/pycore_optimizer.h Outdated Show resolved Hide resolved
Python/bytecodes.c Outdated Show resolved Hide resolved
@savannahostrowski
Copy link
Member Author

Succeeded by #124443

@savannahostrowski savannahostrowski deleted the jit-mem-invalidate-10 branch September 27, 2024 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.14 new features, bugs and security fixes awaiting merge topic-JIT
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants