GH-123516: Improve JIT memory consumption by invalidating cold executors #123402

savannahostrowski · 2024-08-27T18:26:35Z

This PR aims to improve memory usage of the JIT by periodically invalidating "cold" executors (more about preliminary results in #123402 (comment)). More specifically, this PR:

Introduces a new field on PyExecutorObject's vm_data called warm, which serves as a boolean to track whether an executor has been run (warm is incremented in a new uop _MAKE_WARM, tacked on to the beginning of the trace)
Every 10 executor creations (set via JIT_CLEANUP_THRESHOLD), we run _Py_Executors_InvalidateCold as part of PyOptimizer_Optimize (where executors are created)
_Py_Executors_InvalidateCold is a new function that enables us to traverse the linked list of executors (stored on PyExecutorObject) and invalidate any executors that have not been run (because it's not being used and it's fairly cheap to recreate if need be).

I'll also note that there's more that we could try here (I've got some other experiments that I'd like to run), but I wanted to at least start by landing these changes.

Issue: Improve JIT memory consumption by invalidating cold executors #123516

Python/optimizer.c

brandtbucher · 2024-08-29T22:13:04Z

Looking at this now!

Since we'll probably iterate on heuristics for invalidation later, do you mind creating a new issue just for cold executor invalidation, and link that one instead?

brandtbucher

So excited to see this! Some thoughts I had while reading:

Include/internal/pycore_optimizer.h

Misc/NEWS.d/next/Core_and_Builtins/2024-08-27-21-44-14.gh-issue-116017.ZY3yBY.rst

Python/bytecodes.c

Python/optimizer.c

Include/internal/pycore_optimizer.h

Include/internal/pycore_interp.h

Python/optimizer.c

Co-authored-by: Bénédikt Tran <[email protected]>

…trowski/cpython into jit-mem-invalidate-10

Python/optimizer.c

markshannon

Looks promising. I have a couple of comments.

Do we have any stats or performance numbers on this?

Include/internal/pycore_optimizer.h

markshannon · 2024-09-12T13:10:48Z

Python/optimizer.c

@@ -182,6 +182,12 @@ _PyOptimizer_Optimize(
    if (err <= 0) {
        return err;
    }
+
+    if (++interp->executors_created >= JIT_CLEANUP_THRESHOLD) {


This could behave quite badly with many executors.

Suppose we have a 1000 executors:
Even in the ideal case, we need to scan all 1000, every time we create 10 new ones.

But what worries me is that we could clear all executors far too easily.
Suppose the VM starts compiling some new region of code, resulting in 20 new executors.
After the first 10, we will mark all executors as cold. After the second 10 we will clear all executors.

I'm not sure what the solution is, but I think a ticker that is independent of the number of executors created would be better. Maybe the GC trigger, or something similar?

In earlier runs of the benchmarks, there was no impact to performance, even in instances where I tried pretty aggressive thresholds. In discussing this with Brandt, it seemed cheap to re-create in cases where we might be overzealous in invalidation.

Probably also worth mentioning that I did experiment with triggering invalidation on gc earlier but in chatting with Brandt, it seemed like better design to have this tied to executor creation (i.e. since the invalidation isn't really at all coupled to gc today).

I'm sure that @pablogsal probably has thoughts here too?

I'm sure that @pablogsal probably has thoughts here too?

I think it may be sensible to do it in the eval breaker or at any other similar point in the VM execution where we can know it's safe to do. This is equivalent to the GC (as both are triggered there) and still allows us to add custom triggering logic and decuples the run from the GC.

For context: I would still prefer not to tie this with the gc as in general we have been holding off coupling the gc with things for a long time because any bug in the GC or on GC objects is already a very non trivial and hard to debug spooky-action-at-a-distance and we have learn that (1) controlling any side-effects that are not purely derived from the generic algorithm and (2) keeping the GC from having to know what's operating on has been key to keep complexity at bay.

I'm not sure what the solution is, but I think a ticker that is independent of the number of executors created would be better. Maybe the GC trigger, or something similar?

I think the current approach is probably good enough for now, and we can iterate on the heuristics. The potential quadratic behavior (and wipe-out-everything-ness) of the current approach is a bit scary, but is unlikely to seriously impact most programs. It can also be mitigated by a follow-up PR to change "every ten executors" to "10% more executors" or "20% more JIT memory" or some other ratio-based approach, like the GC uses.

I also think the GC is probably the wrong place for this. We could add a new dedicated eval-breaker bit, but that sort of feels like overkill. The creation of new executors is a known safe point where (a) we expect to spend a bit of time managing JIT stuff and (b) we know JIT memory pressure will increase by some fixed amount. I don't think it would be any better to just set an eval breaker bit here and leave our new JIT code almost immediately to perform this same cleanup.

savannahostrowski · 2024-09-12T17:51:42Z

@markshannon Thanks for taking a look. The preliminary results looked something like this: #123402 (comment). The approach has been updated since then, so it's worth running the benchmarks again. @brandtbucher, what are your thoughts about re-running the benchmarks at this point?

brandtbucher · 2024-09-13T19:07:08Z

@brandtbucher, what are your thoughts about re-running the benchmarks at this point?

I don't believe the main heuristic has changed, and the little improvements in bit-twiddling and such are unlikely to impact performance or memory significantly. We probably don't need to re-run anything.

brandtbucher

Just a couple notes on naming, otherwise looks like a clear improvement to me. Thanks for your patience!

Include/internal/pycore_interp.h

Include/internal/pycore_optimizer.h

Python/bytecodes.c

savannahostrowski · 2024-09-24T16:40:41Z

Succeeded by #124443

savannahostrowski requested a review from brandtbucher August 27, 2024 18:26

savannahostrowski requested a review from markshannon as a code owner August 27, 2024 18:26

bedevere-app bot mentioned this pull request Aug 27, 2024

Compiling tiny traces wastes lots of memory #116017

Open

bedevere-app bot added the awaiting review label Aug 27, 2024

brandtbucher changed the title ~~[gh-116017] Improve JIT memory consumption by invalidating cold executors~~ GH-116017: Improve JIT memory consumption by invalidating cold executors Aug 27, 2024

savannahostrowski commented Aug 27, 2024

View reviewed changes

Python/optimizer.c Outdated Show resolved Hide resolved

savannahostrowski added 3.14 new features, bugs and security fixes topic-JIT labels Aug 27, 2024

savannahostrowski and others added 18 commits August 27, 2024 18:55

resolve conflict

0eac77b

tests pass except ssl

d576296

remove file

68e95d6

this is broken

c903af4

gc approach

5ca6b7f

rebase

beb4f65

Update has_run to run_count

427dbf5

update initialized run_count and move invalidate old

92d5590

set threshold to 1`

0cdf638

move incrementing run count into a new op

58e7447

add invalidation threshold in gc of 10

6c047e4

move back to incremenet

2645023

remove print

7c7ae98

move invalidation to executor creation

6d6d306

change threshold

4d086fe

new line

d08e45a

update constant

6315877

📜🤖 Added by blurb_it.

622c266

savannahostrowski force-pushed the jit-mem-invalidate-10 branch from 2887733 to 622c266 Compare August 28, 2024 01:55

Merge branch 'main' into jit-mem-invalidate-10

e5117b2

brandtbucher reviewed Aug 29, 2024

View reviewed changes

rename uop

d2e8e29

sobolevn reviewed Aug 31, 2024

View reviewed changes

Python/optimizer.c Outdated Show resolved Hide resolved

savannahostrowski added 2 commits August 31, 2024 16:33

dedent and initialize executors_created

a894598

Merge branch 'main' into jit-mem-invalidate-10

758ee03

picnixz reviewed Sep 1, 2024

View reviewed changes

savannahostrowski and others added 10 commits September 1, 2024 09:57

address some PR comments

1927bfe

Update Python/optimizer.c

8ee0d7f

Co-authored-by: Bénédikt Tran <[email protected]>

Update Python/optimizer.c

cedd65d

Co-authored-by: Bénédikt Tran <[email protected]>

make was_run uint`

0a9b5b6

add comment for JIT_CLEANUP_THRESHOLD

180a68e

Remove extraneous reset of executors_created

7cb9cba

Merge branch 'main' into jit-mem-invalidate-10

8939ecf

Merge branch 'main' into jit-mem-invalidate-10

00f03d2

Merge branch 'jit-mem-invalidate-10' of https://github.com/savannahos…

4981ab2

…trowski/cpython into jit-mem-invalidate-10

condense conditional statements

3c59316

picnixz reviewed Sep 4, 2024

View reviewed changes

Python/optimizer.c Show resolved Hide resolved

brandtbucher self-requested a review September 6, 2024 20:24

Merge branch 'main' into jit-mem-invalidate-10

306c3c3

markshannon reviewed Sep 12, 2024

View reviewed changes

brandtbucher approved these changes Sep 13, 2024

View reviewed changes

Include/internal/pycore_interp.h Outdated Show resolved Hide resolved

Include/internal/pycore_optimizer.h Outdated Show resolved Hide resolved

Python/bytecodes.c Outdated Show resolved Hide resolved

bedevere-app bot added awaiting merge and removed awaiting review labels Sep 13, 2024

savannahostrowski added 2 commits September 13, 2024 12:56

Address PR comments from Brandt and Mark

fe50615

Merge branch 'main' into jit-mem-invalidate-10

d51817b

savannahostrowski mentioned this pull request Sep 24, 2024

GH-123516: Improve JIT memory consumption by invalidating cold executors #124443

Merged

savannahostrowski closed this Sep 24, 2024

savannahostrowski deleted the jit-mem-invalidate-10 branch September 27, 2024 16:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-123516: Improve JIT memory consumption by invalidating cold executors #123402

GH-123516: Improve JIT memory consumption by invalidating cold executors #123402

savannahostrowski commented Aug 27, 2024 •

edited

Loading

brandtbucher commented Aug 29, 2024

brandtbucher left a comment

markshannon left a comment

markshannon Sep 12, 2024

savannahostrowski Sep 12, 2024

savannahostrowski Sep 12, 2024

pablogsal Sep 13, 2024 •

edited

Loading

brandtbucher Sep 13, 2024 •

edited

Loading

savannahostrowski commented Sep 12, 2024

brandtbucher commented Sep 13, 2024

brandtbucher left a comment

savannahostrowski commented Sep 24, 2024

GH-123516: Improve JIT memory consumption by invalidating cold executors #123402

GH-123516: Improve JIT memory consumption by invalidating cold executors #123402

Conversation

savannahostrowski commented Aug 27, 2024 • edited Loading

brandtbucher commented Aug 29, 2024

brandtbucher left a comment

Choose a reason for hiding this comment

markshannon left a comment

Choose a reason for hiding this comment

markshannon Sep 12, 2024

Choose a reason for hiding this comment

savannahostrowski Sep 12, 2024

Choose a reason for hiding this comment

savannahostrowski Sep 12, 2024

Choose a reason for hiding this comment

pablogsal Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

brandtbucher Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

savannahostrowski commented Sep 12, 2024

brandtbucher commented Sep 13, 2024

brandtbucher left a comment

Choose a reason for hiding this comment

savannahostrowski commented Sep 24, 2024

savannahostrowski commented Aug 27, 2024 •

edited

Loading

pablogsal Sep 13, 2024 •

edited

Loading

brandtbucher Sep 13, 2024 •

edited

Loading