-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: implement tail merging #77103
JIT: implement tail merging #77103
Conversation
Add a phase that looks for common tail statements in a block's predecessors and merges them. Run it both before and after morph. Closes dotnet#8795. Closes dotnet#76872.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
cc @dotnet/jit-contrib Depends critically on Decent code size improvements. Improvement:Regression byte ratio is 20:1 or better. Haven't looked into regressions that deeply but the few I looked at were LSRA related. Local TP runs show no cost or even a net improvement. We might be able to up the merge limit a bit more. There are still size wins to be had in ASP.NET (a no-limit x64 windows gets -127K size improvement) -- evidently a lot of async state machines have big switches with lots of common tails in the switch cases. Unlimited merging has a ~0.4% TP cost in libraries tests. |
Looks like I need to update the block flags.
|
Nice! I assume if blockA and blockB semantically the same but blockB has e.g. COMMA(NOP, tree) or even a NOP-statements - it won't be taken into account? |
Right, it is doing very literal matching right now (modulo allowing swaps). Would not be hard to make it a bit smarter and allow a bigger range of things to match. I am more worried about the cases where |
Skip past GT_NOP, no point considering those for merging. Fix logic error when finding cross jump victim -- need to assess the first block in the loop.
NAOT failure seems like it could be #76801. |
Wow, nice diffs 🙂 |
Perhaps worth running some outerloop? |
Seems like this is in reasonably good shape.
Sure. |
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn |
Azure Pipelines successfully started running 3 pipeline(s). |
src/coreclr/jit/fgopt.cpp
Outdated
if (GenTree::Compare(baseStmt->GetRootNode(), otherStmt->GetRootNode(), true)) | ||
{ | ||
matchedPredInfo.Push(predInfo.TopRef(j)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GenTree::Compare
does not pay attention to GTF_IND_VOLATILE
, so presumably this can do something like below, which does not seem right.
[ p1 ] [ p1 ]
[ind<volatile>(addr)] |
| |
| [ p2 ] --> | [ p2 ]
| [ind(addr)] | |
| | [ ind(addr) ]
[ block ] [ block ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Was actually expecting to quickly hit more bugs in GenTree::Compare
, but so far haven't run across any.
@@ -99,7 +99,7 @@ | |||
// as this is used for the managed-ret-val feature, but the debugger filters out these mappings and does not | |||
// report them in the ETW event. We should probably change this, those mappings should be useful in any case. | |||
property int32[] Debug = int32[10]( 0x0 0x6 0xe 0x12 0x1a 0x1c 0x24 0x28 0x2c 0x34 ) | |||
property int32[] Opts = int32[5]( 0x0 0x6 0x12 0x1c 0x2c ) | |||
property int32[] Opts = int32[4]( 0x0 0x6 0x12 0x1c ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this break stepping in the debuggers, or are they able to map the new IP into the shared tail? What about profilers using sampling?
We should think about if we need to add some new debug information for tooling to have a chance to handle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can lose debug info here -- note we can also merge calls which may create confusing looking stack traces.
But I don't think we can express this sort of many to one mapping. So not sure what to do about it.
Nice! We should probably audit Is this purely a size-decreasing optimization, or do we also expect it to benefit performance? What is the impact on TP in the contexts where no merging is done? |
Perf is tricky to assess. There are some knock-on optimizations this can enable, but absent those, tail merging can actually hurt perf because it might increase the relative density/frequency of taken branches.
Not sure. Any suggestions on how I could measure that? |
Looks like libraries jitstress and fuzzlyn are hitting an assert:
|
I guess the easy way would be to not actually do the transformation after finding an opportunity. The harder, maybe more accurate way, would be to add an assert that triggers in the contexts without the optimization, which should allow you to produce a |
@BruceForstall any idea what this assert is guarding against? runtime/src/coreclr/jit/optimizer.cpp Lines 1871 to 1872 in 4574ccb
Offhand I don't see the problem if head branches to top, and that's what we have here:
This won't become a loop anyways. Previously BB04 and BB07 would branch to BB12. Now they cross-jump to BB05. |
I ended up handing the GT_NOP case (at top level anyways). Didn't make a lot of difference. Probably I should remove those statements in the pre-screening, but then tail merge phase might have diffs even if it did not do any merging. |
Add indir flag checking to `GenTree::Compare`.
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn |
Azure Pipelines successfully started running 3 pipeline(s). |
A bit tricky since it runs twice;, but I suppose I can try this. |
Jitstress should now be clean (ish...) may retry in a bit. |
Adding the ability to disable even in release. |
Here is some contextual TP data for the ASP.NET collection:
My hunch is that the methods that are tail merge candidates tend to be more complex so despite this kicking in for only 10% or so of methods it is still a net TP improvement. There is also a subset ~55K min opts methods in here which is not explicitly accounted for. Also note that in about 10% of the cases where we tail merge we were able to get the same codegen w/o tail merge. |
That 0.26% we pay for not optimizing seems a bit high, let me see if there's some way to trim it down a bit... |
Did a couple of perf tweaks. I had thought |
TP impact is down a little bit more (did not remeasure the splits above). Diffs similar to before. @dotnet/jit-contrib ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Looking forward to seeing it merged, want to experiment further :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nice, elegant code.
if ((op1->gtFlags & (GTF_IND_FLAGS)) != (op2->gtFlags & (GTF_IND_FLAGS))) | ||
{ | ||
return false; | ||
} | ||
FALLTHROUGH; | ||
|
||
case GT_IND: | ||
case GT_NULLCHECK: | ||
if ((op1->gtFlags & (GTF_IND_FLAGS)) != (op2->gtFlags & (GTF_IND_FLAGS))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is odd. GT_BLK/GT_OBJ/GT_IND/GT_NULLCHECK are all equivalent; why not share the code? Why does GT_BLK/GT_OBJ have "FALLTHROUGH" to the same code it just executed? also, (GTF_IND_FLAGS)
doesn't need to be parenthesized.
Possibly also: dotnet/perf-autofiling-issues#9468 |
Add a phase that looks for common tail statements in a block's predecessors and merges them.
Run it both before and after morph.
Closes #8795.
Closes #76872.