-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nibblemapmacros do a linear scan #93550
Comments
I edited the post above to remove a reference to JVM source code. On the .NET project we have to be careful that we don't use GPL source (or have any appearance or question that it might have been used). Sorry, I know it can be awkward limitation, but we need to ensure potential contributors don't read related GPL source so there is no question about the origin of contributions. |
@fabled I am doing some investigation here, nothing too deep though. I know that @davidwrighton was also noodling on this data structure. |
@AaronRobinsonMSFT Yep, I've noodled on this a bit, and come up with a model that implements the find algorithm as a O(1) operation... Its also somewhat complicated by a desire to support the lock free nature of reads of this structure, and iteration, and I'm not convinced it entirely meets all of those needs, but it might be a good choice if we choose to do any work in this space this release. NOTE: while I've looked at this as a theoretical improvement, I have not looked into whether or not this would provide a performance improvement to details such as stack walking. In my previous explorations of the performance of that system, I identified that the linked list we had for finding the CodeHeap itself was a significant performance problem, but the nibble table walks never showed up in any of my performance tests. (See #79021 where I introduced the I would expect that this algorithm would be unlikely to significantly affect the performance of much code. However, I do understand that EBPF filters have strong guarantees of forward progress implemented by the lack of real support for iteration, which the previous theoretically unbounded iteration algorithm breaks. I think that support for being possible to write an EBPF based stack walker could be quite useful for diagnostics on Linux platforms, and efforts to make that possible seem like a good idea to me. Note the approach below here effectively implements jumps as you suggested, and I believe could be implemented as a single linear set of if statements and operations. Proposed new nibble map layout lookup algorithm and structureThe intention here is to build a data structure which permits O(1) access once we reach a lookup at the nibble map layer. The dependencies it takes, are that:
The data structure must support the following algorithms
There shall be an array of 32bit unsigned integers ( Each 32bit unsigned value may be an array of 8 4-bit nibbles, or an encoded 32bit offset from the start of the CodeHeap memory section to the start of the associated method.
The algorithm to parse this data is as follows. Be aware that this algorithm has not actually be tested/implemented, so this is likely nearly correct, but unlikely to be completely correct.
HELPER Methods
|
Do I understand correctly that this is for looking up methods from their addresses? If so, would improvements in this area solve the issue I had in #85197 where invalid pointers would crash the VM when looking them up? |
This specific improvement would not help with your issue. The data structure proposed here has same function properties as the nibble map, it just has different perf characteristics. |
@davidwrighton - I spent a little while dissecting what you have and it seemed good to me. This was my simplified mental model of it which others might find useful. I know it doesn't include all the details, but let me know if anything it does include is inaccurate.
|
This comment was marked as off-topic.
This comment was marked as off-topic.
@AaronRobinsonMSFT No, there is only 1 method start in a 32byte region in the algorithm I presented, at the 256 byte region up to 8 method starts could exist. Now that I've spent a few more moments thinking on this, I suspect we should do some analysis to determine the number of methods which are < 48 and greater than 24 bytes bytes on our 64bit platforms, as you could reduce the amount of memory needed for this sort of map so that each 32bit value represents 64bytes of executable space. If it turns out that we have very few of those, it would probably be preferable to have the system do all of its alignment in terms of 8 bytes instead of 4 bytes, working with 64 byte and 512 byte regions. |
Apologies, you're right. I misinterpreted the 256 and confused that with DWORDs and other units that are defined in the current implementation. |
@max-charlamb is going to take a stab at prototyping #93550 (comment) (both reading and the implied writing algorithm) and to measure the perf impact |
Potential alternative designWhen tracing is turned for the process (that is very common in the cloud), we transform the information stored in the nibble to the OS unwind info format (look for We may want to consider replacing the nibble map by storing the information in the OS unwind info format only. It would save us from maintaining the duplicate information and it would reduce the performance hit from turning on tracing that has been reported as a perf problem by some of our customers. cc @brianrob |
@jkotas Is this a "boil the ocean" proposal? or could we opt into the new scheme gradually? for example start with windows or linux only? One thing that I liked about just replacing the nibble map is that it's a fairly self-contained project that could be prototyped and measured in a limited amount of time. I think this alternate proposal has broader scope (I think we wouldn't need the RangeSectionMap anymore either, right?) and I don't have a good handle on how much additional work we'd be taking on. |
Unfortunately this does not seem sufficient solution to me. Parsing SFRAME format requires also linear processing and maintaining a state machine. It is great for debuggers and post-mortem core dumps etc. But it does not solve the problem of doing realtime unwinding from eBPF due to the timing and instructions constraints. EDIT: I also think this is not the same thing. The Nibble map is used to map RIP-to-methodMetaData. This needs to be done first even before we can parse SFRAME. SFRAME is just for processing the unwinding information and cannot serve to do the initial mapping for what the Nibble map is used. |
...
Wonderful! When doing the timing and performance measurements, obviously there will be different affect on the "update map" and the "use" side. And much of it also depends on how large the JITted function is. So measuring this probably requires workloads that are designed to perf this. Its also good to measure other side effects such as how much cache is trashed etc. The profiler we are doing has been made open source and donated to Open Telemetry. In case you want and can look at Apache2 licensed things, I can drop the links to the relevant portions of our profiler on part of dotnet processing. |
Yes, that would be fine. Also, I do not have a problem with just improving nibble map. I have mentioned it here since I heard about the run-down performance issues and stacktraces for tracing multiple times from different directions recently.
SFRAME FDE entries can be sorted to allow binary search. We would emit the SFRAME entries sorted. It would not be a linear search.
We can use the SFRAME FDE entry index for a lookup in a parallel array with methodMetaData. |
@jkotas , I'm pretty sure we have race condition with race conditions with the existing scheme we use on Windows that dynamically reports unwind data. This is ... ok for ETW, where lost stack frames is a bit of an acceptable thing, but we also use this data in a completely lock free manner to handle exception unwinds/GC collections etc, which I don't think that we can handle correctly with the current OS interacting path. (I could be wrong on this though, I'm saying this from some vague memories). Also having a completely lock-free path here is important for our work towards having EH throw/catch reach peak performance, so I want to tread lightly there. OTOH, the current mess of how much it costs to have profiler stacks enabled on Windows is awful and should be investigated, that's for sure. My recollection was that much of the cost/race conditions and such with the current Windows approach involves handling cases where methods are deleted. |
I agree that publishing of the unwind data to Windows has by-design race condition. I am not suggesting that we replace our lock-free rangemap. It would make things slow and expose us to this race condition. If we were to switch to unwinding tables in the Windows OS format as the source of truth, we should be able keep our lock-free range map and keep our lock-free lookups. Replacing lock-free nibble map search with lock-free binary search should be the only change on the lookup path. We used nibble maps in NGen images originally. IIRC, deleting the nibble maps from NGen images and switching to regular unwind info as source of truth was perf-neutral on average for throughput and came with binary size reduction. The perf issues that I have heard about recently were about the tracing startup / rundown costs. Conversion of the nibble maps to OS unwind infos that happens when the tracing is enabled for the first time showed up on the radar as part of the tracing startup cost. |
|
I am currently researching the dotnet stack unwinding, and came across the nibblemap at https://github.com/dotnet/runtime/blob/main/src/coreclr/inc/nibblemapmacros.h.
It seems that the code is optimized to assume that JIT code generated is very small code blobs which are potentially unaligned. And on the contrast it performs poorly when mapping RIP from the end of a large code blobs. This is because a linear scan of the nibblemap is required to find the function beginning. Such linear scanning is done at e.g. https://github.com/dotnet/runtime/blob/main/src/coreclr/vm/codeman.cpp#L4051-L4056 (but also at other places).
I am also currently looking into reimplementing all of this in ebpf code (in Linux). This puts some constraints on how many iterations of the nibblemap scanning can be done. While some of these can be worked around, I think if some other data structure could serve here better.
I believe optimizing this map would also speed up unwinding in dotnet core runtime itself too. Do you have plans to improve this map, or would you welcome contributed work in this?
The existing implementation could be just improved by adding a special encoding to indicate a jump.
I would like to discuss potential algorithmic / data structures improvements, and potentially contribute work towards making it happen if a mutually acceptable approach can be reached.
The text was updated successfully, but these errors were encountered: