-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NativeAOT/ARM64] Generate frames compatible with Apple compact unwinding #107766
Conversation
The dominant part are the JIT changes, hence area-CodeGen-coreclr is the right label. cc @dotnet/ilc-contrib @VSadov @ivanpovazan |
Tagging subscribers to 'os-ios': @vitek-karas, @kotlarmilos, @ivanpovazan, @steveisok, @akoeplinger |
Tagging subscribers to 'os-tvos': @vitek-karas, @kotlarmilos, @ivanpovazan, @steveisok, @akoeplinger |
@jkotas would could help to move this PR forward? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkotas would could help to move this PR forward?
The change looks good to me. Somebody from the codegen team needs to sign-off on the JIT changes.
/azp run runtime-nativeaot-outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
@EgorBo, please do code review for Filip's compact unwinding PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went over it and they sounds reasonable. @BruceForstall - you might want to double check the changes in lclvars.cpp
. I don't understand those much.
src/coreclr/jit/compiler.hpp
Outdated
@@ -2807,6 +2807,16 @@ inline | |||
{ | |||
*pBaseReg = REG_SPBASE; | |||
} | |||
#elif defined(TARGET_ARM64) | |||
if (FPbased && !codeGen->isFramePointerRequired() && varOffset < 0 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be under compiler->IsTargetAbi(CORINFO_NATIVEAOT_ABI)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and other changes in genRestoreCalleeSavedRegisterGroup
/ genSaveCalleeSavedRegisterGroup
of codegenarm64.cpp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The optimization is valid even for non-NativeAOT scenarios. The reason I explicitly opted not to guard it is that this is practically only reachable in stress tests that enable SaveFpLrWithAllCalleeSavedRegisters
on CoreCLR and the general guidance was that we rely on CoreCLR stress testing coverage for JIT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in gen[Restore/Save]CalleeSavedRegisterGroup
are guarded by genReverseAndPairCalleeSavedRegisters
which is only set for NativeAOT/Apple. Additional guards would only contribute to the throughput cost. The change where LR/FP is separated into its own condition is no-op for other ARM64 scenarios, it just moves one iteration of the loop into its own condition. We don't want to reverse the order for them so this was the easiest way to achieve the desired behavior without too many if
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that sounds good to me
@@ -845,12 +845,19 @@ void CodeGen::genSaveCalleeSavedRegisterGroup(regMaskTP regsMask, int spDelta, i | |||
|
|||
for (int i = 0; i < regStack.Height(); ++i) | |||
{ | |||
RegPair regPair = regStack.Bottom(i); | |||
RegPair regPair = genReverseAndPairCalleeSavedRegisters ? regStack.Top(i) : regStack.Bottom(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how an #ifdef
could be used here. There's a single universal build of the ARM64 Unix JIT and it's the same for Crossgen2 and ILC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant compiler->IsTargetAbi(CORINFO_NATIVEAOT_ABI)
check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expect that to make things worse than checking a single [more local] boolean. If this is a concern we can try to manually duplicate the loop and move the check outside of the loop.
@filipnavara I'm investigating how much trouble it would be to do a fast stack walking on macOS arm64 for profiling purpose, and I have a question. The Writing ARM64 code for Apple platforms mandates that:
So, while trying to reconnect the dots with the various issues related to this PR, I'm unclear what kind of frames the JIT today generates for Apple ARM64? Is it following the fp recommendation above? |
TL;DR: Yes, it is following the recommendation AFAICT. It doesn't follow the relative stack layout of FP/LR in relation to locals and callee saved registers. Unless you need to do precise unwinding you should be fine. The long description is https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/jit/arm64-jit-frame-layout.md (not all frame types are used, eg. frame type 9). |
@EgorBo, and @kunalspathak, this PR is waiting for JIT sign off. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please merge with main? I want to run some jitstress tests because compJitSaveFpLrWithCalleeSavedRegisters
was updated.
else | ||
{ | ||
// Default configuration | ||
codeGen->SetSaveFpLrWithAllCalleeSavedRegisters((getNeedsGSSecurityCookie() && compLocallocUsed) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be (getNeedsGSSecurityCookie() || compLocallocUsed)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's just a moved code from here with unchanged condition. I think it was intentional.
// Decide where to save FP and LR registers. We store FP/LR registers at the bottom of the frame if there is
// a frame pointer used (so we get positive offsets from the frame pointer to access locals), but not if we
// need a GS cookie AND localloc is used, since we need the GS cookie to protect the saved return value,
// and also the saved frame pointer. See CodeGen::genPushCalleeSavedRegisters() for more details about the
// frame types. Since saving FP/LR at high addresses is a relatively rare case, force using it during stress.
// (It should be legal to use these frame types for every frame).
unwinding format. For NativeAOT/ARM64/Apple API do the following: - Save callee registers in opposite order and in pairs. - Prefer saving FP/LR on the top of the frame. Heuristics are used to avoid worse code quality outside of prolog/epilog due to addressing range limits of the ARM64 instruction set. - Added optimization to lvaFrameAddress to rewrite FP-x references to SP+y when possible. This allows efficient addressing using positive indexes when FP points to the top of the frame. It mimics similar optimization on ARM32.
I re-run the tests locally and they no longer fail. Can you retrigger the test runs please? |
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress |
Azure Pipelines successfully started running 2 pipeline(s). |
@filipnavara - the original failures no longer repros and some tests are failing which is most likely #83659, but can you please double check if jitstress is kicking in for it? |
So they tests are failing with the new jitstress switch? |
No, they are NOT failing, and they stress the new code path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* main: JIT: Model GT_RETURN kills with contained operand (dotnet#111230) Update dependencies from https://github.com/dotnet/runtime-assets build 20250110.2 (dotnet#111290) [NativeAOT/ARM64] Generate frames compatible with Apple compact unwinding (dotnet#107766) Cleanup unused JIT stubs in vm (dotnet#111237) Ensure that Shuffle is marked as HW_Flag_CanBenefitFromConstantProp (dotnet#111303) Fix CMP0173 policy warning with cmake 3.31 (dotnet#110522) [RISC-V] Fix HostActivation.Tests unknown-rid (dotnet#110687) Fix accidentally duplicated global-build-step.yml in runtime-official.yml (dotnet#111302) JIT: run extra SPMI queries for arrays (dotnet#111293) Split the Runtime Shared Framework project and combine legs in the official build (dotnet#111136) Do not ignore `MemoryMarshal.TryWrite` result (dotnet#108661) Update dependencies from https://github.com/dotnet/emsdk build 20250109.1 (dotnet#111263) Clean up in Number.Formatting.cs (dotnet#110955)
@filipnavara Looks like this PR introduced GC hole like asserts in gcstress pipeline (linux-arm64), I've just confirmed it. The simplest way to reproduce is to take BufferMemmove test and run with
(many times) cc @dotnet/jit-contrib |
…ct unwinding (dotnet#107766)" This reverts commit 4536b96.
…ding (dotnet#107766) * JIT/ARM64: Add ability to generate frames compatible with Apple compact unwinding format. For NativeAOT/ARM64/Apple API do the following: - Save callee registers in opposite order and in pairs. - Prefer saving FP/LR on the top of the frame. Heuristics are used to avoid worse code quality outside of prolog/epilog due to addressing range limits of the ARM64 instruction set. - Added optimization to lvaFrameAddress to rewrite FP-x references to SP+y when possible. This allows efficient addressing using positive indexes when FP points to the top of the frame. It mimics similar optimization on ARM32. * ObjWriter: For Mach-O ARM64 try to convert the DWARF CFI unwinding codes into compact unwinding code * Disable lvaFrameAddress FP->SP optimization for OSR methods
…ding (#111451) * [NativeAOT/ARM64] Generate frames compatible with Apple compact unwinding (#107766) * JIT/ARM64: Add ability to generate frames compatible with Apple compact unwinding format. For NativeAOT/ARM64/Apple API do the following: - Save callee registers in opposite order and in pairs. - Prefer saving FP/LR on the top of the frame. Heuristics are used to avoid worse code quality outside of prolog/epilog due to addressing range limits of the ARM64 instruction set. - Added optimization to lvaFrameAddress to rewrite FP-x references to SP+y when possible. This allows efficient addressing using positive indexes when FP points to the top of the frame. It mimics similar optimization on ARM32. * ObjWriter: For Mach-O ARM64 try to convert the DWARF CFI unwinding codes into compact unwinding code * Disable lvaFrameAddress FP->SP optimization for OSR methods * Fix variable offsets used in emitGCvarLiveUpd by suppressing the FP-n => SP+m (for m = frameSize - n) optimization. --------- Co-authored-by: Kunal Pathak <[email protected]>
Contributes to #76371
The are two changes in the PR that work in tandem. The first is a JIT change for generating slightly different frame layout on NativeAOT/ARM64/Apple platforms (iOS, macOS and tvOS). The second change is ObjWriter code that recognizes the structurally compatible unwinding information and generates compact unwinding codes in the object files instead of verbose DWARF unwinding information.
For NativeAOT/ARM64/Apple ABI do the following:
lvaFrameAddress
to rewrite FP-x references to SP+y when possible. This allows efficient addressing using positive indexes when FP points to the top of the frame. It mimics similar optimization on ARM32.Each of these changes comes with some caveats:
lvaFrameAddress
. The additional increase in prolog size also causes a cascading effect where loop alignment and related code alignment (32-byte alignment for method start) significantly contribute to the code section size. In some cases the same alignment logic may also reduce the size.As you can see, this is a bit of a trade-off and it's valid to ask if it's worth it.
Firstly, let me address the size question. The loop alignment seems to be the biggest contributor to the code size variation, and I filed issue #107284 to investigate whether we can come up with a better defaults for Apple platforms. The code size changes are predominantly contained to the small fixed change in the prolog size and the additional alignment. When testing the prototype on MAUI apps the biggest culprit to increased code size was the code generated from XAML that generates methods with ton of local variables. The change in
lvaFrameAddress
practically eliminated any negative effect of the frame layout on this type of code. Without the change the regression was nearly 50% due to stack references needing an indirect load with a register and extra instruction. Outside of MAUI the biggest visible effect is on code from Regex source generator but in that case it's quite evenly split between size improvements and regressions. That suggests the regex code may be a good candidate for measuring the performance characteristics of the loop alignment. The code size regressions on few examples I tried amount to around 2% (incl. the alignment). The saved space in the DWARF unwinding section is hovering around 90% +- 3%. To put that into absolute numbers we are looking at savings around 0.7 Mb fordotnet new maui
app and around 3 Mb forSystem.Runtime.Tests
in the linked executables. The savings in the size of the unwinding information far outweigh any increase in the code size.Secondly, part of the motivation is that the Apple linker is notoriously buggy with processing the DWARF unwinding data. The compact unwinding tables are used as an index to the DWARF data for anything that cannot be expressed using compact unwinding code directly. Due to the structure of the tables this limits the effective offset of DWARF info to 24 bits and places a hard limit on the DWARF unwinding info size. At least with some versions of the Apple linker, breaking this limit results in silent corruption and runtime failures.
Comparison between `main` and PR for System.Runtime.Tests in Release build
Compare raw size of linked binaries:
Bloaty check of linked binaries:
Bloaty check of object files:
Bloaty check of object files (detailed):