-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a bug in frame pointer chaining codegen for frames with localloc. #4651
Comments
Just to clarify. The issue happens only on frames that have localloc or very largs local var count. It works for the rest. It potentially affects debuggers and profilers that use frame chaining to unwind the stack. In the VM for GC and exception handling we use the PData/XData stype unwind info for jitted frames and that works correctly. |
@tzwlai - Can you please confirm if this affects debugger on Mac/Linux? |
I have talked to Mike on Friday. This will affect the non SOS debugger on Linux/Mac. The SOS debugger (as the GC unwinding and EH) is using the PDatas/XData unwind info for managed frames and that behaves correctly. The GDB/LLDB bt command is affected by this since it doesn't use the PData/XData (Windows) unwind debug info. @tzwlai please correct me if I am wrong. |
FYI, this will affect performance tracing on Linux and Mac. |
@brianrob: is this because performance tracing on Linux/Mac x64 uses RBP chains exclusively (and not the Windows AMD64 unwind codes that we still emit) for stack traces? |
@BruceForstall, that's correct. Linux does not know anything about Windows AMD64 unwind info. |
@brianrob if it would be possible to plug in our own stack walker, then we might be able to use the copy of the windows unwinder in CoreCLR that we have there to be able to unwind stack for exception handling and GC purposes. |
@janvorli, unfortunately, there is no extension point in perf for this. There actually aren't really any extension points to my knowledge. It's something that we could consider building, but we'd likely need to engage the maintainers early - I'm not sure if there are others that would want this as well. |
This is a discussion of the problem and proposal for a fix. The problem is that the JIT for non-Windows, x64, in functions with localloc, uses the RBP register as a frame pointer, but does not point RBP and the saved version of the caller's RBP. Thus, the "RBP frame chain" is broken, which confuses non-Windows x64 tools such as LLDB/GDB which depend on accurate RBP chains. In functions without localloc, RBP is either unused, or it points at the caller's saved RBP. If it is unused, that function simply doesn't appear in the frame chain, but the frame chain isn't "corrupted". On Windows x64, the RBP register can be freely used as an extra register (this is controlled by the ETW_EBP_FRAMED define, which is currently set to zero, thus enabling the free use of RBP). Windows x64 itself does not depend on frame chaining. Instead, it uses the Windows x64 defined unwind codes for implementing stack walking/unwinding. These are defined in the Windows x64 software conventions, here: https://msdn.microsoft.com/en-us/library/7kcdt6fy.aspx. Many non-Windows x64 .NET features also use these unwind codes for stack walks, such as the SOS debugger extension, and the .NET runtime itself. One goal of a fix is to support both the Windows x64 unwind codes as well as accurate, uncorrupted RBP chains, at the same time, in functions with localloc. The fix would ideally apply to all x64 platforms, to avoid the need for per-platform special-case code, and allow us to leverage our extensive existing Windows x64 testing assets. There are some important restrictions to consider with the Windows x64 unwind codes: Some implications of these restrictions: Note that for localloc functions with small fixed-size frames (about 240 bytes or less), we can establish the frame pointer just as we do today, after the full fixed-size stack allocation. We can easily do this for all platforms. Otherwise, for localloc functions with large fixed-size frames, the frame will be allocated as follows:
This introduces an extra SUB RSP, but more than that, it complicates the prolog logic for both main function and funclet prologs by moving various things around. An option to this design is to establish RBP chains using the tradition "push RBP; mov RBP, RSP" prolog, but don't report that in the Windows x64 unwind codes, or use that for local variable addressing. Instead, reserve another callee-saved register in this situation, e.g. R15, and use that where we previously used RBP. Report R15 as the frame pointer in the Windows x64 unwind codes. This solution has the disadvantage of "burning" a register. On advantage is that it doesn't require changing the layout of the frame. Implementation-wise, it isn't clear which will require more pervasive changes. Note that in funclets, RBP points where the RBP of the parent pointed. That means that during RBP-based stackwalks, the funclet frames, as well as any intermediary frames such as VM exception-handling frames, will not be visible. This is determined to be acceptable. |
Thanks for writing this Bruce! It is very useful.
|
I believe you're suggesting "burning" a register in the funclets only (but not the main function) to hold the "frame pointer", instead of using RBP to access locals. It needs to be a callee-saved register (if the funclet has a call); it's easiest just to designate one in advance. It might be hard to plumb through the JIT which "frame pointer register" is being used at a particular point. |
Yes. This is exactly what I meant above. Burning a register in the funclet might not be as bad for CQ since funclets are invoked for exception handling. The finally case might be a bit questionable, but I think it should be OK too. |
Agree with @LLITCHEV --the analysis is useful. I'm not sure burning a register is a big deal if it's only for localloc frames with uncommonly large fixed sizes, but if the complexity is the same, might as well implement the one that has better CQ, which I assume is the first ("two SUB") solution. Unless you want to investigate more to determine which will require more pervasive changes, I would go for the first. If you don't mind pointing out the affected code in question to save newbies like me time, that would be great but isn't required. |
There's another restriction to consider: GC encoding. For non-x86 platforms, a local variable can be reported to the GC as an offset relative to SP, Caller-SP, or a frame register. The frame register can be specified as any non-volatile register; it doesn't need to be RBP. In the localloc case here, if we established RBP in the 2-step fashion, we could continue referencing locals with RBP and reporting them that way to the GC. If we burned an extra register to be used as a frame pointer for the purpose of locals access, we could report that to the GC. Or, we could choose to report locals as CallerSP-relative. For x86 GC encoding, however, EBP is the only allowed frame pointer. |
Code quality issue aside about burning a register (real impact of which I've rarely seen in .NET code I write which is in a high-performance situation) there is something to be said about the difference this introduces in code-generation for Windows and non-Windows. I write tools that work off EBP walking and it's really annoying that only on Windows x64 do I not have access to my tools, even for source code I compile because all Microsoft compilers choose to thrash it. And an even stronger point, some of us would love to use CoreCLR x64 on Windows 7 (one of your supported platforms), where XPERF (the recommended Microsoft performance tool) does not work because you break the EBP chain. Now I have a question, can the generation of code that does not trash the EBP be a command line option for the JIT? This is particularly useful to us performance analysts because when we need to diagnose something we can at least run it using this option and turn it back off it code quality is impacted. Using a 32-bit process is not going to cut it because you can imagine they maybe using 64-bit native dlls. |
Yet another constraint: any JIT-generated code P/Invoke code that generates an InlineCallFrame on the function stack must use RBP as the frame pointer (this is x64 only). This is baked in to the VM, and it uses the saved RBP and RSP to seed further stack unwinding. It does not have the capability to use a different frame pointer register. |
@CppStars: I agree that it can be convenient to always have RBP chains sometimes. For Windows x64, though, that seems unlikely to ever happen (obviously, native code compilers won't do this). I think the problem you've seen on Windows 7 with XPERF is not because of breaking the RBP chain, it's because prior to Windows 8, the OS didn't have a way to get the unwind info generated by the JIT, so a stack would get lost at the native/managed boundary. |
General update: I went down the path of implementing the 2-SUB solution detailed above, and that proved to be very complex and invasive. I then decided to go with the "burn a register" solution. This was much cleaner, and there was lots of existing code for doing this, such as for ARM. However, I hit a dead end when I discovered the requirement mentioned above regarding InlineCallFrame. So, I'm back to investigating the original solution once again. |
Yet another potential way to fix this: On Windows, the unwind info encoding is set in stone - because of it follows Windows ABI. We cannot extend it. It is not the case on Unix. We do have our own local copy of the Windows unwind info decoder for Unix, and thus can easily add a new unwind code to get around the restriction that is getting into way. So a potential fix may look like:
This should be pretty simple fix. It is Unix-specific codepath, but it is very simple one - it should not have a material impact on how different is the Windows and Unix codegen. |
I love that idea. It should be simple to implement. One minor detail is that the UWOP_SET_FPREG code is weird in that frame pointer offset and register gets set in the UNWIND_INFO header instead of the unwind code, to provide a fast path at the beginning of VirtualUnwind() to not require parsing the unwind codes in case of unwinding from within the main body. Maybe for Linux we either always parse the unwind codes or find someplace else to store this. Presumably we can't change the UNWIND_INFO header type |
The JIT will now always create RBP chains on Unix platforms. To do this, the VM is extended with a new Unix-only AMD64 unwind code: UWOP_SET_FPREG_LARGE. The existing unwind code which is used to establish a frame pointer, UWOP_SET_FPREG, requires that the frame pointer, when established, be no more than 240 bytes offset from the stack pointer. This doesn't work well for frames that use localloc. (Large frames are ok, because we don't report the frame pointer in the unwind info except for in functions with localloc or EnC.) The new code has a 32-bit range, scaled by 16. If used, UNWIND_INFO.FrameRegister must be set to the frame pointer register, and UNWIND_INFO.FrameOffset must be set to 15 (its maximum value). This code is followed by two UNWIND_CODEs that are combined to form its 32-bit offset. This offset is then scaled by 16. This result is used as the FP register offset from SP at the time the frame pointer is established.
The JIT will now always create RBP chains on Unix platforms. This includes in functions the use localloc. To do this, the VM is extended with a new Unix-only AMD64 unwind code: UWOP_SET_FPREG_LARGE. The existing unwind code which is used to establish a frame pointer, UWOP_SET_FPREG, requires that the frame pointer, when established, be no more than 240 bytes offset from the stack pointer. This doesn't work well for frames that use localloc. (Large frames without localloc are ok, because we don't report the frame pointer in the unwind info except for in functions with localloc or EnC.) The new unwind code has a 32-bit range. If used, UNWIND_INFO.FrameRegister must be set to the frame pointer register, and UNWIND_INFO.FrameOffset must be set to 15 (its maximum value). This code is followed by two UNWIND_CODEs that are combined to form its 32-bit offset. The high 4 bits must be zero. This offset is then scaled by 16. This result is used as the FP register offset from SP at the time the frame pointer is established.
The JIT will now always create RBP chains on Unix platforms. This includes in functions the use localloc. To do this, the VM is extended with a new Unix-only AMD64 unwind code: UWOP_SET_FPREG_LARGE. The existing unwind code which is used to establish a frame pointer, UWOP_SET_FPREG, requires that the frame pointer, when established, be no more than 240 bytes offset from the stack pointer. This doesn't work well for frames that use localloc. (Large frames without localloc are ok, because we don't report the frame pointer in the unwind info except for in functions with localloc or EnC.) The new unwind code has a 32-bit range. If used, UNWIND_INFO.FrameRegister must be set to the frame pointer register, and UNWIND_INFO.FrameOffset must be set to 15 (its maximum value). This code is followed by two UNWIND_CODEs that are combined to form its 32-bit offset. The high 4 bits must be zero. This offset is then scaled by 16. This result is used as the FP register offset from SP at the time the frame pointer is established.
Fix #1977: always create RBP chains on Unix
Fix a bug in frame pointer chaining codegen (prolog code generation) for frames with localloc for AMD64 and System V AMD64.
The text was updated successfully, but these errors were encountered: