-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Make isRemovableJmpCandidate less restrictive on AMD64 #95493
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsOn AMD64, the emitter-level jump-to-next removal optimization is skipped if succeeding a call instruction, as we need an instruction after a call to support exception handling semantics. We already compute a more specific condition for this in the block-level optimization, and reusing that condition should allow us to do the emitter-level optimization in a few more places. The size improvements were small locally, but I think this should be done if the conditions are identical between the two optimizations.
|
bcf5ed1
to
1b45b83
Compare
@dotnet/jit-contrib PTAL. Diffs are small as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change leads to removing a jump-to-next after a call
that immediately precedes an epilog. This is not allowed, by the comments in this function (line 608) and the documentation in clr-abi.md: https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/botr/clr-abi.md#amd64-padding-info
Note that it would be ok to convert these JMP to NOP.
(btw, if you're changing this file, can you change the reference to "X64 and ARM ABIs.docx" to be "clr-abi.md"?)
@BruceForstall thank you for clarifying that; I'll update this change to insert a nop when needed.
Sure thing. |
I've adjusted my implementation to always remove a useless jump succeeding a call instruction on AMD64 (assuming all the other optimization checks are met) and emit a nop instead. I added a new member, Size improvements were bigger locally. I noticed the newly-introduced |
@BruceForstall PTAL. SuperPMI replay failure is the "Method is of low integrity" issue. Thank you! Edit: Diffs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a few problems:
- The NOP is not showing up in disassembly. This is a problem, because it is very confusing when looking at asm diffs or debugging code and comparing against JIT dump/disasm. We accept it for INS_align, but at least that displays something.
- It appears that you are replacing all CALL + JMP-to-next with CALL+NOP? That's not required -- only when a CALL precedes an OS epilog.
- Several places you use the constant "1" as a code size, because that is the size of an x64 NOP instruction. That needs to be well commented (if that code remains).
- We already have
emitOutputPreEpilogNOP()
. I guess this doesn't work because it happens when we are generating code and go to generate an epilog. In the problematic case, the last instruction is a JMP, we callemitOutputPreEpilogNOP()
, it doesn't see the last instruction as a CALL, and so it doesn't insert a NOP. Later on, theemitter::emitRemoveJumpToNextInst()
phase comes along and removes theJMP
. At that point, it's "too late" to insert the necessary NOP.
We might want to insert a provisional NOP after a CALL before a removable JMP. If a JMP is removed, and the next IG is an epilog, the NOP becomes non-provisional and should have actual size. Unfortunately, that's not great, because we'd also need to "zero out" these NOPs (set code size to zero), and there is no obvious place to do that.
Another (ugly?) alternative is to "bash" the JMP instrDesc with a NOP instrDesc when necessary. Then the NOP would be properly output. This might work if it's the very last instruction in an IG but is pretty gross. This is another implication of the fact we can't add/remove instrDescs once generated.
@BruceForstall thank you for the review! I adjusted my implementation to only emit the I think this implementation should cover the problematic case you mentioned with Edit: TP diffs are surprisingly big. This might have to do with the optimization being able to consider more jumps, but the previous run of |
I've determined the TP diffs are from the new logic for printing removed jumps as nops in disasms. I've tweaked this logic so that we only incur the TP hit if we're printing disasms, which doesn't seem to be a performance-critical path, anyway. |
TP diffs are better, but still a bit high in MinOpts. @BruceForstall do we have a threshold for how much TP this change is worth? I have a few ideas for improving TP a bit more, but the code is slightly less intuitive; I'll try it locally and see if it's worth a revision. Edit: Scratch that, I found reverting my changes to |
I'm surprised TP worsened on Linux x64. I'll investigate in WSL. |
@BruceForstall sorry for all the pings -- TP diffs now peak at +0.1% on Linux x64. I found casting an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just a few minor suggestions. The TP change is ok with me.
src/coreclr/jit/emit.h
Outdated
// a candidate for being removed if it jumps to the next instruction | ||
unsigned idjIsRemovableJmpCandidate : 1; | ||
// Indicates the jump succeeds a call instruction and precedes an OS epilog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use "follows" instead of "succeeds"?
src/coreclr/jit/emit.h
Outdated
#else | ||
30; | ||
#endif | ||
#endif // !defined(TARGET_XARCH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this comment is not actually helpful.
src/coreclr/jit/emit.h
Outdated
// Indicates the jump succeeds a call instruction and precedes an OS epilog. | ||
// If this jump is removed, a nop will need to be emitted instead (see clr-abi.md for details). | ||
unsigned idjIsAfterCallBeforeEpilog : 1; | ||
#elif defined(TARGET_XARCH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You split off amd64, so the only "xarch" left is x86
#elif defined(TARGET_XARCH) | |
#elif defined(TARGET_X86) |
#if defined(TARGET_AMD64) | ||
28; | ||
// Indicates the jump was added at the end of a BBJ_ALWAYS basic block and is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, all this bit fiddling is silly on 64-bit, where we're going to have 32 bits of padding after the unsigned
fields, in which case we should just move all the bits to the end and make them bool :1
fields. Maybe not for 32-bit, though.
But that shouldn't be done in this PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll come back to this in a follow-up PR.
Feedback changes are just comments and an ifdef. Failures from previous CI run were known NativeAOT failures. |
On AMD64, the emitter-level jump-to-next removal optimization is skipped if succeeding a call instruction, as we need an instruction after a call to support exception handling semantics. We already compute a more specific condition for this in the block-level optimization, and reusing that condition should allow us to do the emitter-level optimization in a few more places. The size improvements were small locally, but I think this should be done if the conditions are identical between the two optimizations.