-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Flowgraph Modernization and Improved Block Layout in .NET 10 #107749
Labels
area-CodeGen-coreclr
CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
User Story
A single user-facing feature. Can be grouped under an epic.
Milestone
Comments
amanasifkhalid
added
the
area-CodeGen-coreclr
CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
label
Sep 12, 2024
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
JulieLeeMSFT
added
the
User Story
A single user-facing feature. Can be grouped under an epic.
label
Sep 12, 2024
This was referenced Sep 17, 2024
amanasifkhalid
added a commit
that referenced
this issue
Oct 10, 2024
Part of #107749, and follow-up to #107927. When computing a RPO of the flow graph, ensuring that the entirety of a loop body is visited before any of the loop's successors has the benefit of keeping the loop body compact in the traversal. This is certainly ideal when computing an initial block layout, and may be preferable for register allocation, too. Thus, this change formalizes loop-aware RPO creation as part of the flowgraph API surface, and uses it for LSRA's block sequence.
rzikm
pushed a commit
to rzikm/dotnet-runtime
that referenced
this issue
Oct 11, 2024
) Part of dotnet#107749, and follow-up to dotnet#107927. When computing a RPO of the flow graph, ensuring that the entirety of a loop body is visited before any of the loop's successors has the benefit of keeping the loop body compact in the traversal. This is certainly ideal when computing an initial block layout, and may be preferable for register allocation, too. Thus, this change formalizes loop-aware RPO creation as part of the flowgraph API surface, and uses it for LSRA's block sequence.
7 tasks
This was referenced Oct 15, 2024
amanasifkhalid
added a commit
that referenced
this issue
Oct 18, 2024
…ntiguity later (#108914) Part of #107749. `Compiler::fgMoveColdBlocks` currently moves cold try blocks to the end of their innermost regions. This is problematic for our 3-opt layout plans: When identifying a candidate span of blocks to reorder, assuming that all cold blocks are at the end of the method vastly simplifies our implementation. However, if we have EH regions with their own cold sections, `fgMoveColdBlocks` will interleave hot and cold blocks. To facilitate later layout passes, we can simplify `fgMoveColdBlocks` to naively move all cold blocks to the end of the method, regardless of EH region, and rely on a "fixup" pass for making EH regions contiguous again. To start, I've tweaked `fgMoveColdBlocks` to break up try regions only. When handlers are placed in the funclet section, we don't need to do anything extra to get cold EH blocks out of the main method body's hot section. However, for jitted x86 code, we don't use the funclet model (yet), so cold handler blocks can still litter the main method body, hindering 3-opt's candidate space. I'd rather not expand on this PR's logic to rebuild handler regions if we can do something simpler, such as getting #101613 merged in, and using `fgRelocateEHRegions` to move all handlers to the end of the method under the assumption that they're cold (i.e. a pseudo-funclet region). Moving try entry blocks in `fgMoveColdBlocks` proved painful enough that I think we're better off leaving them as-is. Leaving each try region's entry in-place gives us a nice breadcrumb for reinserting the remaining blocks, and it might be beneficial to leave these entries in the candidate span of blocks for 3-opt, so we can effectively move entire try regions just by moving the entry. For try regions that are entirely cold, I can look into calling `fgRelocateEHRegions` before `fgMoveColdBlocks` on all platforms to quickly get these out of the way. All of this would be unnecessary if we could remove the VM's requirement of contiguous EH regions, and the codegen improvements would likely outweigh the additional VM complexity, though that's a conversation for another day.
This was referenced Oct 30, 2024
amanasifkhalid
added a commit
that referenced
this issue
Nov 13, 2024
Part of #107749. Follow-up to #103450. This refactors 3-opt to minimize layout cost instead of maximizing layout score. This is arguably more intuitive, and it should facilitate implementing a more sophisticated cost model. This PR also adds a mechanism for evaluating the total cost of a given layout, which means we can assert at each move that we actually improved the global layout cost.
This was referenced Nov 13, 2024
mikelle-rogers
pushed a commit
to mikelle-rogers/runtime
that referenced
this issue
Dec 10, 2024
Part of dotnet#107749. Follow-up to dotnet#103450. To facilitate implementing a global variant of 3-opt alongside the greedy variant, this moves some shared components to helper methods. I want to do this as a separate PR to ensure this change is truly no-diff, and has minimal (if any) TP impact.
mikelle-rogers
pushed a commit
to mikelle-rogers/runtime
that referenced
this issue
Dec 10, 2024
…0034) Part of dotnet#107749. Prerequisite for dotnet#110026. Use postorder number-based BitVecs in RBO and block layout. Use bbID-based BitVecs in fgIncorporateProfileData. This runs early enough in the JIT frontend such that I would expect bbIDs and bbNums to be 1:1, so I don't expect any TP impact from this change. Switch descriptor creation still uses bbNums as a key into a BitVec as a workaround for BB epoch invariants -- I'll try switching this over to bbID in a follow-up to evaluate the TP cost of a sparser bitset.
mikelle-rogers
pushed a commit
to mikelle-rogers/runtime
that referenced
this issue
Dec 10, 2024
Fixes dotnet#107076. Part of dotnet#107749. Instead of relying on the "not equal" target of each test block being the next block, explicitly follow the chain of "not equal" targets.
mikelle-rogers
pushed a commit
to mikelle-rogers/runtime
that referenced
this issue
Dec 10, 2024
Part of dotnet#107749. Follow-up to dotnet#103450. Greedy 3-opt (i.e. an implementation that requires each move to be profitable on its own) is not well-suited for discovering profitable moves for backward jumps, as such movement requires an unrelated move to first place the source block lexically behind the destination block. Thus, the 3-opt implementation added in dotnet#103450 incorporates a 4-opt move for backward jumps, where we partition 1) before the destination block, 2) before the source block, and 3) directly after the source block. This 4-opt implementation can be expanded to search for the best cut point between the destination and source blocks to maximize its efficacy.
mikelle-rogers
pushed a commit
to mikelle-rogers/runtime
that referenced
this issue
Dec 10, 2024
…t#109534) Part of dotnet#107749. Follow-up to dotnet#103450. If 3-opt fails to create fallthrough on an edge because it isn't initially profitable, allow the edge to be considered again, in case future moves make it profitable.
This was referenced Jan 3, 2025
amanasifkhalid
added a commit
that referenced
this issue
Jan 7, 2025
This was referenced Jan 9, 2025
amanasifkhalid
added a commit
that referenced
this issue
Jan 14, 2025
Part of #107749. Enables profile checks for morph and post-morph phases. For benchmarks.run_pgo, 45383 methods are consistent before inlining; after, we're down to 37215, or 82%. By the time we make it to morph, 33461 methods (~74% of the original) are consistent; after morph, we're down to 29402 (~65%). The decline isn't too dramatic for this collection, though I imagine we fare worse elsewhere. --------- Co-authored-by: Andy Ayers <[email protected]>
amanasifkhalid
added a commit
that referenced
this issue
Jan 21, 2025
amanasifkhalid
added a commit
that referenced
this issue
Jan 22, 2025
This was referenced Jan 22, 2025
amanasifkhalid
added a commit
that referenced
this issue
Jan 24, 2025
) Part of #107749. Prerequisite to #111764 (comment). Add a post-phase check that ensures BBF_PROF_WEIGHT is set for every reachable block when we have profile data. This check revealed only one site -- when incorporating an inlinee's blocks -- where we couldn't rely on normal flow propagation to set the flag.
amanasifkhalid
added a commit
that referenced
this issue
Jan 25, 2025
Part of #107749. For PGO scenarios, we plan to rely entirely on profile synthesis to propagate profile-derived weights throughout the flowgraph. This is the first step in transitioning away from the existing weight propagation heuristics. We want to keep these around for non-PGO scenarios for now, though ideally, we'll eventually be able to replace these entirely with profile synthesis (plus branch prediction heuristics when we don't have profile data), and consolidate how the profile is maintained for PGO and non-PGO scenarios.
This was referenced Jan 27, 2025
This was referenced Jan 29, 2025
amanasifkhalid
added a commit
that referenced
this issue
Jan 30, 2025
amanasifkhalid
added a commit
that referenced
this issue
Jan 30, 2025
Part of #107749. 3-opt layout already refuses to align branches that cross EH regions, so there doesn't seem to be much utility in reordering each EH region independently. Thus, remove 3-opt's per-region ordering constraints, and run 3-opt once.
amanasifkhalid
added a commit
that referenced
this issue
Jan 31, 2025
Part of #107749. Prerequisite to #111915. Regardless of the profile synthesis option used, we ought to maintain the method's entry weight, which is computed by summing all non-flow weight into the entry block. Ideally, we'd use fgCalledCount here, but this isn't computed until after morph, and we need to tolerate the existence of multiple entry blocks for methods with OSR pre-morph.
amanasifkhalid
added a commit
that referenced
this issue
Feb 1, 2025
Part of #107749. Follow-up to #111971 and #110693. For methods without profile data, ensure the default call count is available throughout compilation (this had no diffs for me locally). For methods with profile data, compute the call count after synthesis runs to ensure it is available early, and reasonably accurate. I'm only seeing diffs in OSR methods locally, due to the logic in `fgFixEntryFlowForOSR` (which runs right after profile incorporation) no longer affecting `fgCalledCount`. This method guesses that the loop iterates about 100x the method call count, and scales the method entry block's weight down accordingly. This gives the impression later on that `fgCalledCount` is much lower than what we calculated using `fgEntryBB`. The actual diffs seem to manifest largely in LSRA, which uses `fgCalledCount` to normalize block weights, though there are a few other phases that use `BasicBlock::getBBWeight` in lieu of the raw weight as well. I think we ought to consolidate our block weight strategy at some point, especially if we have newfound faith in `fgCalledCount`. For example, instead of this check in if conversion: ``` if (m_startBlock->getBBWeight(m_comp) > BB_UNITY_WEIGHT * 1.05) ``` Perhaps we could do: ``` if (m_startBlock->bbWeight > fgCalledCount * 1.05) ``` But that's for another PR.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area-CodeGen-coreclr
CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
User Story
A single user-facing feature. Can be grouped under an epic.
Continuation of #93020. During the .NET 9 development cycle, we removed much of the JIT flowgraph implementation's implicit fall-through invariants, and introduced a new block layout strategy based on a reverse post-order traversal of the graph. For .NET 10, we'd like to push this work further in both directions, with the ultimate goals of zero dependence on lexical block ordering in the JIT's frontend, and a global cost-optimizing layout algorithm in the JIT's backend. Below is an early estimate of what each item entails:
Flowgraph Modernization
fgUpdateFlowGraph
that we usually run in conjunction with layout aren't designed to run after lowering, so we will likely need to decouple flow opts from layout to make this work.fgNewBBinRegion
may search the block list for insertion points that won't break up existing fall-through. In the JIT frontend, it should make no difference to optimization potential if we just insert new blocks at the end of the list, or at the end of an EH region. Doing this work early should help expose frontend phases that are still sensitive to lexical block ordering (see next task).Compiler::fgSplitEdge
#107419optSetBlockWeights
(see Profile Data section)BBJ_NONE
block type left behind breadcrumbs in various phases that we ought to clean up, now that we can model flow explicitly.fgUpdateFlowGraph
.Block layout
Ideally, the below items get us to a state where block layout produces the "best" ordering it can, given the profile data it has on-hand. If the layout is subpar due to missing/inconsistent profile data, we can at least eliminate the layout strategy as the culprit.
Profile Maintenance
optSetBlockWeights
with the new profile synthesis implementation. The former frequently produces nonsensical weights for loops, as it relies on a lexical traversal of the block list to identify loops. Fixing this may improveJitOptRepeat
performance.BBJ_THROW
block is hot, then order it as such (this particular example is not as perf-sensitive, though).BBJ_THROW
example in particular was largely handled by JIT: Move profile consistency checks to after morph #111253.cc @dotnet/jit-contrib, @AndyAyersMS
The text was updated successfully, but these errors were encountered: