JIT: Try to retain entry weight during profile synthesis #111971

amanasifkhalid · 2025-01-29T18:40:43Z

Part of #107749. Prerequisite to #111915. Regardless of the profile synthesis option used, we ought to maintain the method's entry weight, which is computed by summing all non-flow weight into the entry block. Ideally, we'd use fgCalledCount here, but this isn't computed until after morph, and we need to tolerate the existence of multiple entry blocks for methods with OSR pre-morph.

dotnet-policy-service · 2025-01-29T18:41:18Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

amanasifkhalid · 2025-01-29T19:57:59Z

cc @dotnet/jit-contrib, @AndyAyersMS PTAL. Diffs show numerous methods where we can compute a more precise call count, rather than falling back to BB_UNITY_WEIGHT. Thanks!

AndyAyersMS · 2025-01-29T21:54:32Z

There's kind of a chicken and egg problem here, because we're about to recompute the weights of all the blocks that are preds of the entry block, so relying on the existing weights of those blocks seems a bit odd.

Can you post a simple(-ish) non-OSR example where this changes things?

If the entry is a loop head (not sure that is possible anymore with ominpresent scratch BB) there is a $C_p$ and you could possibly rely on that instead, as it will have already computed how much flow comes back to the entry from its preds.

jakobbotsch · 2025-01-29T22:16:47Z

I had #110693 where I tried to compute fgCalledCount during profile incorporation, but IIRC I ended up with some massive diffs that I did not look into.

amanasifkhalid · 2025-01-30T21:36:47Z

If the entry is a loop head (not sure that is possible anymore with ominpresent scratch BB) there is a $C_p$ and you could possibly rely on that instead, as it will have already computed how much flow comes back to the entry from its preds.

Thanks for pointing this out; this seems better than relying on the old weights. One thing I notice with this approach is ComputeCyclicProbabilities sets all loop headers' weights to 1.0, so we end up deriving an entry weight <= 1.0, giving the impression that methods beginning with loops are infrequently called (though I suppose relying on something canned like BB_UNITY_WEIGHT isn't necessarily better). Is an accurate call count unknowable in these cases? I suppose this isn't all that limiting considering the few places we use fgCalledCount to drive decisions.

Can you post a simple(-ish) non-OSR example where this changes things?

Sure. For Microsoft.CodeAnalysis.CSharp.CSharpCompilation+UsingsFromOptionsAndDiagnostics:Complete in benchmarks.run_pgo, the baseline JIT bails when it sees the backedge to fgFirstBB, and uses BB_UNITY_WEIGHT as the entry weight:

BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  2       BB10                100.08 100 [000..016)-> BB08(0.999),BB02(0.000766)    ( cond )                     IBC bwd bwd-target
BB02 [0001]  1       BB01                  0.08   0 [016..01B)-> BB05(0.5),BB03(0.5)     ( cond )                     IBC bwd
BB03 [0002]  1       BB02                  0.04   0 [01B..020)-> BB07(0),BB04(1)         ( cond )                     IBC bwd
BB04 [0003]  1       BB03                  0.04   0 [020..022)-> BB09(1)                 (always)                     IBC bwd
BB05 [0004]  1       BB02                  0.04   0 [022..031)-> BB10(0),BB06(1)         ( cond )                     IBC bwd
BB06 [0005]  1       BB05                  0.04   0 [031..048)-> BB10(1)                 (always)                     IBC bwd
BB07 [0006]  1       BB03                  0      0 [048..058)-> BB10(1)                 (always)                     IBC rare bwd
BB08 [0007]  1       BB01                100.00 100 [058..059)                           (return)                     IBC
BB09 [0008]  1       BB04                  0.04   0 [059..06A)-> BB10(1)                 (always)                     IBC bwd
BB10 [0009]  4       BB05,BB06,BB07,BB09   0.08   0 [06A..079)-> BB01(1)                 (always)                     IBC bwd bwd-src
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Whereas if we derive the entry weight from the loop's cyclic probability:

For loop at BB01 cyclic weight is 0.0007662835 cyclic probability is 1.000767
Synthesis: entry BB01 has input weight 0.9992337
Synthesis: exception weight 1e-05
Computing block weights
Synthesis solver: flow graph has 0 improper loop headers
converged at iteration 0 rel residual inf eigenvalue 0
Checking Profile Weights (flags:0x24)
  Method entry BB08 is loop head, can't check entry/exit balance
Profile is self-consistent (10 profiled blocks, 0 unprofiled)

*************** Finishing PHASE Profile incorporation
Trees after Profile incorporation

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  2       BB10                  1      1 [000..016)-> BB08(0.999),BB02(0.000766)    ( cond )                     IBC bwd bwd-target
BB02 [0001]  1       BB01                  0.00   0 [016..01B)-> BB05(0.5),BB03(0.5)     ( cond )                     IBC bwd
BB03 [0002]  1       BB02                  0.00   0 [01B..020)-> BB07(0),BB04(1)         ( cond )                     IBC bwd
BB04 [0003]  1       BB03                  0.00   0 [020..022)-> BB09(1)                 (always)                     IBC bwd
BB05 [0004]  1       BB02                  0.00   0 [022..031)-> BB10(0),BB06(1)         ( cond )                     IBC bwd
BB06 [0005]  1       BB05                  0.00   0 [031..048)-> BB10(1)                 (always)                     IBC bwd
BB07 [0006]  1       BB03                  0      0 [048..058)-> BB10(1)                 (always)                     IBC rare bwd
BB08 [0007]  1       BB01                  1.00   1 [058..059)                           (return)                     IBC
BB09 [0008]  1       BB04                  0.00   0 [059..06A)-> BB10(1)                 (always)                     IBC bwd
BB10 [0009]  4       BB05,BB06,BB07,BB09   0.00   0 [06A..079)-> BB01(1)                 (always)                     IBC bwd bwd-src
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

I had #110693 where I tried to compute fgCalledCount during profile incorporation, but IIRC I ended up with some massive diffs that I did not look into.

I'm tempted to revive this, since we'd ideally compute this early when the profile is still consistent: I suspect some of diffs you got on that PR had to do with OSR methods having nonsensical weights on fgFirstBB.

jakobbotsch · 2025-01-30T22:45:15Z

I'm tempted to revive this, since we'd ideally compute this early when the profile is still consistent: I suspect some of diffs you got on that PR had to do with OSR methods having nonsensical weights on fgFirstBB.

Sounds right. IMO it would be best to go that route since otherwise we may just end up churning things twice... fgCalledCount definitely seems to be the thing we want to use here.

amanasifkhalid · 2025-01-30T22:51:39Z

fgCalledCount definitely seems to be the thing we want to use here.

I think we still run into an ordering problem where we don't know how much flow makes it back into the entry block until profile synthesis has run, but perhaps we can make profile synthesis responsible for updating/setting fgCalledCount (though I wouldn't expect subsequent runs of profile synthesis to change the initial answer)?

AndyAyersMS · 2025-01-30T23:21:58Z

If you have a $C_p$ and the entry block originally had count $X$, then the called count is $X/C_p$. If you make that the input weight then later on when the code scales up the entry block it will set the count to $(X/C_p) * C_p = X$ so you end up back where you started.

If the entry block has backedges I believe we'll always find a loop there, since there is no other possible entry (ignoring OSR for the time being).

amanasifkhalid · 2025-01-31T02:19:26Z

If you have a $C_p$ and the entry block originally had count $X$, then the called count is $X/C_p$. If you make that the input weight then later on when the code scales up the entry block it will set the count to $(X/C_p) * C_p = X$ so you end up back where you started.

So in the case where the entry block is also a loop header, we have to cache the block's original weight before running ComputeCyclicProbabilities so we can get the correct call count in AssignInputWeights. I don't love how the API shape looks, but I updated my approach to cache the profile-derived entry block count, and use it to compute the call count.

AndyAyersMS

As a follow up, seems like this value should be propagated to fgCalledCount.

amanasifkhalid · 2025-01-31T16:01:22Z

As a follow up, seems like this value should be propagated to fgCalledCount.

That's my plan; I'll take a look at the diffs in another PR.

Part of #107749. Follow-up to #111971 and #110693. For methods without profile data, ensure the default call count is available throughout compilation (this had no diffs for me locally). For methods with profile data, compute the call count after synthesis runs to ensure it is available early, and reasonably accurate. I'm only seeing diffs in OSR methods locally, due to the logic in `fgFixEntryFlowForOSR` (which runs right after profile incorporation) no longer affecting `fgCalledCount`. This method guesses that the loop iterates about 100x the method call count, and scales the method entry block's weight down accordingly. This gives the impression later on that `fgCalledCount` is much lower than what we calculated using `fgEntryBB`. The actual diffs seem to manifest largely in LSRA, which uses `fgCalledCount` to normalize block weights, though there are a few other phases that use `BasicBlock::getBBWeight` in lieu of the raw weight as well. I think we ought to consolidate our block weight strategy at some point, especially if we have newfound faith in `fgCalledCount`. For example, instead of this check in if conversion: ``` if (m_startBlock->getBBWeight(m_comp) > BB_UNITY_WEIGHT * 1.05) ``` Perhaps we could do: ``` if (m_startBlock->bbWeight > fgCalledCount * 1.05) ``` But that's for another PR.

Retain entry weight during profile synthesis

e3a6484

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 29, 2025

dotnet-policy-service bot assigned amanasifkhalid Jan 29, 2025

amanasifkhalid mentioned this pull request Jan 29, 2025

JIT: Run profile repair after frontend phases #111915

Open

This was referenced Jan 29, 2025

slow macOS - "##[error]The job running on agent Azure Pipelines 9 ran longer than the maximum time of 60 minutes." dotnet/dnceng#1883

Open

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

Use cyclic probabilities to compute entry weight

b4d29a8

Save entry block weight before it's overwritten

52bb9a6

AndyAyersMS approved these changes Jan 31, 2025

View reviewed changes

amanasifkhalid merged commit 572533b into dotnet:main Jan 31, 2025
108 of 113 checks passed

amanasifkhalid deleted the profile-repair-entry-weight branch January 31, 2025 16:02

amanasifkhalid mentioned this pull request Jan 31, 2025

JIT: Compute fgCalledCount after synthesis #112041

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Try to retain entry weight during profile synthesis #111971

JIT: Try to retain entry weight during profile synthesis #111971

amanasifkhalid commented Jan 29, 2025

dotnet-policy-service bot commented Jan 29, 2025

amanasifkhalid commented Jan 29, 2025

AndyAyersMS commented Jan 29, 2025

jakobbotsch commented Jan 29, 2025

amanasifkhalid commented Jan 30, 2025

jakobbotsch commented Jan 30, 2025

amanasifkhalid commented Jan 30, 2025

AndyAyersMS commented Jan 30, 2025

amanasifkhalid commented Jan 31, 2025

AndyAyersMS left a comment

amanasifkhalid commented Jan 31, 2025

JIT: Try to retain entry weight during profile synthesis #111971

JIT: Try to retain entry weight during profile synthesis #111971

Conversation

amanasifkhalid commented Jan 29, 2025

dotnet-policy-service bot commented Jan 29, 2025

amanasifkhalid commented Jan 29, 2025

AndyAyersMS commented Jan 29, 2025

jakobbotsch commented Jan 29, 2025

amanasifkhalid commented Jan 30, 2025

jakobbotsch commented Jan 30, 2025

amanasifkhalid commented Jan 30, 2025

AndyAyersMS commented Jan 30, 2025

amanasifkhalid commented Jan 31, 2025

AndyAyersMS left a comment

Choose a reason for hiding this comment

amanasifkhalid commented Jan 31, 2025