Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nibblemapmacros do a linear scan #93550

Closed
fabled opened this issue Oct 16, 2023 · 18 comments · Fixed by #108939
Closed

nibblemapmacros do a linear scan #93550

fabled opened this issue Oct 16, 2023 · 18 comments · Fixed by #108939
Assignees
Labels
area-VM-coreclr in-pr There is an active PR which will close this issue when it is merged
Milestone

Comments

@fabled
Copy link

fabled commented Oct 16, 2023

I am currently researching the dotnet stack unwinding, and came across the nibblemap at https://github.com/dotnet/runtime/blob/main/src/coreclr/inc/nibblemapmacros.h.

It seems that the code is optimized to assume that JIT code generated is very small code blobs which are potentially unaligned. And on the contrast it performs poorly when mapping RIP from the end of a large code blobs. This is because a linear scan of the nibblemap is required to find the function beginning. Such linear scanning is done at e.g. https://github.com/dotnet/runtime/blob/main/src/coreclr/vm/codeman.cpp#L4051-L4056 (but also at other places).

I am also currently looking into reimplementing all of this in ebpf code (in Linux). This puts some constraints on how many iterations of the nibblemap scanning can be done. While some of these can be worked around, I think if some other data structure could serve here better.

I believe optimizing this map would also speed up unwinding in dotnet core runtime itself too. Do you have plans to improve this map, or would you welcome contributed work in this?

The existing implementation could be just improved by adding a special encoding to indicate a jump.

I would like to discuss potential algorithmic / data structures improvements, and potentially contribute work towards making it happen if a mutually acceptable approach can be reached.

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 16, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Oct 16, 2023
@vcsjones vcsjones added area-VM-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Oct 16, 2023
@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Oct 16, 2023
@mangod9 mangod9 added this to the Future milestone Oct 16, 2023
@noahfalk
Copy link
Member

I edited the post above to remove a reference to JVM source code. On the .NET project we have to be careful that we don't use GPL source (or have any appearance or question that it might have been used). Sorry, I know it can be awkward limitation, but we need to ensure potential contributors don't read related GPL source so there is no question about the origin of contributions.

@AaronRobinsonMSFT
Copy link
Member

@fabled I am doing some investigation here, nothing too deep though. I know that @davidwrighton was also noodling on this data structure.

@davidwrighton
Copy link
Member

@AaronRobinsonMSFT Yep, I've noodled on this a bit, and come up with a model that implements the find algorithm as a O(1) operation... Its also somewhat complicated by a desire to support the lock free nature of reads of this structure, and iteration, and I'm not convinced it entirely meets all of those needs, but it might be a good choice if we choose to do any work in this space this release.

NOTE: while I've looked at this as a theoretical improvement, I have not looked into whether or not this would provide a performance improvement to details such as stack walking. In my previous explorations of the performance of that system, I identified that the linked list we had for finding the CodeHeap itself was a significant performance problem, but the nibble table walks never showed up in any of my performance tests. (See #79021 where I introduced the RangeSectionMap. The RangeSectionMap also includes a linked list in its algorithm at one point, but the length of that linked list is in practice bounded to a fairly small size which I believe is something like 6.)

I would expect that this algorithm would be unlikely to significantly affect the performance of much code. However, I do understand that EBPF filters have strong guarantees of forward progress implemented by the lack of real support for iteration, which the previous theoretically unbounded iteration algorithm breaks. I think that support for being possible to write an EBPF based stack walker could be quite useful for diagnostics on Linux platforms, and efforts to make that possible seem like a good idea to me. Note the approach below here effectively implements jumps as you suggested, and I believe could be implemented as a single linear set of if statements and operations.

Proposed new nibble map layout lookup algorithm and structure

The intention here is to build a data structure which permits O(1) access once we reach a lookup at the nibble map layer. The dependencies it takes, are that:

  1. No 2 methods may be have start addresses within a single 32byte chunk.
  2. Each start address must be at least 4 byte aligned
  3. No code heap may be larger than 4GB
  4. Each chunk of memory allocated in the codeheap which can be found via the nibble map shall be proceeded by a CodeHeader. These chunks will colloquially be known as jitted methods in this document, although the more precise term is executable memory chunk allocated on the CodeHeap. (The distinction exists as there is a concept known as a CodeFragmentHeap which allocates small chunks of memory within a single allocated "method". That allocation scheme is used by various forms of stubs.)

The data structure must support the following algorithms

  1. Find start of code given pointer
  2. Iterate starts of jitted methods within a heap
  3. Add a new region of jitted memory within heap

There shall be an array of 32bit unsigned integers (pHdrMap), where logically each nibble in the array corresponds to 32 bytes of memory within the jit code region.
There shall be a pointer to the start of the code heap associated with this nibble map (pCodeStart)
There shall be a count of bytes that the code heap represents. (cbCodeHeapSize)

Each 32bit unsigned value may be an array of 8 4-bit nibbles, or an encoded 32bit offset from the start of the CodeHeap memory section to the start of the associated method.

  • Each nibble is indexed from the least significant to the most significant bits in the 32bit value. Each nibble is physically stored as either a 0, to indicate that it is uninitialized, and otherwise represents the numbers 0-14 (This is done by reading the raw value of the nibble and subtracting 1 if it is valid data.
  • Nibble 0 is special in that if it is encoded as a number greater than 8, it indicates that the 32bit value is actually an encoded 32bit offset. The 32bit offset is computed as the 28 most significant bits of the chunk + 2 bits computed from nibble 0 by subtracting 8 from the nibble value (to be used as the next 2 most significant bits), and the lowest bits are 0, as all code start offsets are 4 byte aligned.
  • A nibble may declare that the start region for a method is within itself by having a value between 0 and 7. These represent the offsets within a given 32byte region where the code may start
  • A nibble may also declare that the start region for a method is in either another nibble, or described by the chunk directly preceding the current chunk. This is done by having a value between 8 and 14 which indicates to look in the nibble between 1 and 5 before this nibble. If this reaches into an earlier chunk, restart this scan from the last nibble of the previous chunk. (NOTE: Normally this index only needs to be done once, but for the last nibble, as it can ony represent up to 6 nibbles earlier, if it specifies movement to another chunk, there will be an extra handling of the 0th nibble, which will indicate a need to move to the earlier chunk).

The algorithm to parse this data is as follows. Be aware that this algorithm has not actually be tested/implemented, so this is likely nearly correct, but unlikely to be completely correct.

// This algorithm assumes that pCodePointer has previously been bounds checked to point a some portion of the code heap.
 
uint8_t* FindMethodCode(uint8_t* pCodePointer)
{
    if (pCodePointer < pCodeStart) return NULL;
    if (pCodePointer >= (pCodeStart + cbCodeHeapSize)) return NULL;
 
    bool dataUninitialized;
    uint32_t offsetFromStartOfCodeHeap = (uint32_t)((uint8_t*)pCodePointer - (uint8_t*)pCodeStart);
RestartOffByOne:
    int chunkSearch = offsetFromStartOfCodeHeap / 256;
    uint32_t chunkLogicalStartOffset = chunkSearch * 256;
    uint32_t chunk1 = pHdrMap[chunkSearch]; // This is the only memory read in the entire algorithm. Due to the presence of gotos, this may occur up to 3 times.
    if (IsEncoded32BitOffset(chunk1))
        return Decode32BitOffset(chunk1);
    int32_t nibbleIndex = (int32_t)((offsetFromStartOfCodeHeap / 32) % 8)
    int nibbleRead1 = ReadNibble(chunk1, nibbleIndex, &dataUnitialized);
    if (dataUninitialized) return NULL;
    if (nibbleRead1 < 8)
    {
        // Some code block starts within this nibble
        uint32_t codeOffset = chunkLogicalStartOffset + nibbleIndex * 32 + nibbleRead1 * 4;
        if (codeOffset <= offsetFromStartOfCodeHeap)
        {
            // We've found the start of the code
            return ((uint8_t*)pCodePointer + codeOffset);
        }
        else
        {
            // We've found the start of a chunk of code AFTER pCodePointer. Restart search with a pointer that corresponds to one nibble earlier in the search
            // This restart can only happen if pCodePointer refers to an actual method contained within the heap.
            if ((offsetFromStartOfCodeHeap & ~0x1F) == 0) return NULL;
            offsetFromStartOfCodeHeap = (offsetFromStartOfCodeHeap & ~0x1F) - 1;
            goto RestartOffByOne; // OffByOneNibble
        }
    }
    else
    {
        // This nibble does not have a code start within in, but lookup should check another nibble or chunk.
        int nibbleEarlierToLookAt = nibbleRead - 7;
        int nibbleOfInterest = nibbleIndex - nibbleEarlierToLookAt;
        // Valid results of nibbleOfInterest are the numbers -1 to 5. (It would be nice if 
        if (nibbleOfInterest == -1)
        {
            // We've found that the start of this jitted method can only be described by an earlier chunk entirely.
            // This can only happen once during the lookup, as either the earlier chunk is a 32bit offset chunk, in which case the lookup will only need to look at one chunk
            // Or it is another set of nibbles, in which case the start of the method must be within that set of nibbles.
            offsetFromStartOfCodeHeap = (offsetFromStartOfCodeHeap & ~0xFF) - 1;
            goto RestartOffByOne; // OffByOneChunk
        }
        else
        {
            int nibbleRead2 = ReadNibble(chunk1, nibbleOfInterest, &dataUnitialized);
            if (dataUninitialized) return NULL;
            if (nibbleRead2 < 8)
            {
                uint32_t codeOffset = chunkLogicalStartOffset + nibbleIndex * 32 + nibbleRead1 * 4;
                return ((uint8_t*)pCodePointer + codeOffset);
            }
            else
            {
                assert(nibbleIndex == 7);
                assert(nibbleRead2 == 8);
                // We've found that the start of this jitted method can only be described by an earlier chunk entirely.
                // This can only happen once during the lookup, as either the earlier chunk is a 32bit offset chunk, in which case the lookup will only need to look at one chunk
                // Or it is another set of nibbles, in which case the start of the method must be within that set of nibbles.
                offsetFromStartOfCodeHeap = (offsetFromStartOfCodeHeap & ~0xFF) - 1;
                goto RestartOffByOne; // OffByOneChunk
            }
        }
    }
}

HELPER Methods

// Valid input requires nibbleIndex to be 0-7 inclusive
uint8_t ReadNibble(uint32_t chunk, int nibbleIndex, bool *dataUnitialized)
{
    uint8_t rawNibble = (chunk >> 4 * nibbleIndex) & 0xF;
    *dataUnitialized = rawNibble == 0;
    return rawNibble - 1;
}
 
bool IsEncoded32BitOffset(uint32_t chunk)
{
    // This works as the data structure must not allow "nibbleOfInterest" to be less than -1, and that means no valid encoding for nibble index 0 will be greater than 8.
    uint8_t nibbleZero = ReadNibble(chunk, 0);
    return nibbleZero > 8;
}
 
uint32_t Decode32BitOffset(uint32_t chunk)
{
    uint32_t nibble = ReadNibble(chunk, 0);
    return offset = (chunk & ~0xF) + ((nibble - 8) << 2);
}

@MichalPetryka
Copy link
Contributor

Do I understand correctly that this is for looking up methods from their addresses? If so, would improvements in this area solve the issue I had in #85197 where invalid pointers would crash the VM when looking them up?

@jkotas
Copy link
Member

jkotas commented Oct 26, 2023

If so, would improvements in this area solve the issue I had in #85197 where invalid pointers would crash the VM when looking them up?

This specific improvement would not help with your issue. The data structure proposed here has same function properties as the nibble map, it just has different perf characteristics.

@noahfalk
Copy link
Member

noahfalk commented Oct 26, 2023

@davidwrighton - I spent a little while dissecting what you have and it seemed good to me. This was my simplified mental model of it which others might find useful. I know it doesn't include all the details, but let me know if anything it does include is inaccurate.

  • Every 256 byte region of the code heap is represented by 32 bits of data in the map
  • If there is no method start in a 256 byte region, the 32 bits will be used to encode the offset from the start of the code heap to nearest preceding method start (its not the normal encoding of a 32 bit integer though)
  • If there is one or more method start in a 256 byte region, the 32 bits will be used to encode 8 nibbles. The n'th nibble corresponds to the n'th 32-byte sub-region within the 256 byte region (8*32=256). There is a particular encoding of the 0th nibble which indicates whether to interpret the 32 bits as 8 nibbles or as a single offset.
  • Within a 32 byte sub-region code can start in at most one of the 8 4-byte aligned addresses. The nibble encodes either one of those 8 starting addresses, or 0 indicating it is unintialized, or the remaining 7 values are used when there is no method start in these 32 bytes. We use 7 values rather than just one to provide a hint on which preceding nibble to search next. Linearly searching backwards through the nibbles would work, but skipping some nibbles speeds up the search.
  • The total number of map entries searched is always bounded. If a search in one 32 bit data map entry doesn't find a start address then it must find the start in the preceding entry. If the preceding entry is represented by nibbles that means there was at least one start address in it. Or if the preceding entry is a 32 bit offset that also identifies a method start and terminates the search.

@AaronRobinsonMSFT

This comment was marked as off-topic.

@davidwrighton
Copy link
Member

@AaronRobinsonMSFT No, there is only 1 method start in a 32byte region in the algorithm I presented, at the 256 byte region up to 8 method starts could exist. Now that I've spent a few more moments thinking on this, I suspect we should do some analysis to determine the number of methods which are < 48 and greater than 24 bytes bytes on our 64bit platforms, as you could reduce the amount of memory needed for this sort of map so that each 32bit value represents 64bytes of executable space. If it turns out that we have very few of those, it would probably be preferable to have the system do all of its alignment in terms of 8 bytes instead of 4 bytes, working with 64 byte and 512 byte regions.

@AaronRobinsonMSFT
Copy link
Member

AaronRobinsonMSFT commented Oct 27, 2023

@AaronRobinsonMSFT No, there is only 1 method start in a 32byte region in the algorithm I presented,

Apologies, you're right. I misinterpreted the 256 and confused that with DWORDs and other units that are defined in the current implementation.

@lambdageek
Copy link
Member

@max-charlamb is going to take a stab at prototyping #93550 (comment) (both reading and the implied writing algorithm) and to measure the perf impact

@jkotas
Copy link
Member

jkotas commented Oct 9, 2024

Potential alternative design

When tracing is turned for the process (that is very common in the cloud), we transform the information stored in the nibble to the OS unwind info format (look for UnwindInfoTable::PublishUnwindInfoForExistingMethods) and then maintain the nibble map and the OS unwind info format in sync (look for UnwindInfoTable::PublishUnwindInfoForMethod). This flow only happens on Windows today. There is an effort to enable similar scheme for dynamic code on Linux as well based on SFRAME.

We may want to consider replacing the nibble map by storing the information in the OS unwind info format only. It would save us from maintaining the duplicate information and it would reduce the performance hit from turning on tracing that has been reported as a perf problem by some of our customers. cc @brianrob

@lambdageek
Copy link
Member

lambdageek commented Oct 9, 2024

We may want to consider replacing the nibble map by storing the information in the OS unwind info format only.

@jkotas Is this a "boil the ocean" proposal? or could we opt into the new scheme gradually? for example start with windows or linux only?
there's no particular reason why a platform would need to use its own format - for example macOS could use the linux SFRAME format? we would light up the platform's tools if we used the native format, but it wouldn't be a hard requirement? Not sure what to do with unsupported architectures (for example SFRAME v2 seems to only define arm64 and amd64).

One thing that I liked about just replacing the nibble map is that it's a fairly self-contained project that could be prototyped and measured in a limited amount of time. I think this alternate proposal has broader scope (I think we wouldn't need the RangeSectionMap anymore either, right?) and I don't have a good handle on how much additional work we'd be taking on.

@fabled
Copy link
Author

fabled commented Oct 9, 2024

When tracing is turned for the process (that is very common in the cloud), we transform the information stored in the nibble to the OS unwind info format (look for UnwindInfoTable::PublishUnwindInfoForExistingMethods) and then maintain the nibble map and the OS unwind info format in sync (look for UnwindInfoTable::PublishUnwindInfoForMethod).

Unfortunately this does not seem sufficient solution to me. Parsing SFRAME format requires also linear processing and maintaining a state machine. It is great for debuggers and post-mortem core dumps etc. But it does not solve the problem of doing realtime unwinding from eBPF due to the timing and instructions constraints.

EDIT: I also think this is not the same thing. The Nibble map is used to map RIP-to-methodMetaData. This needs to be done first even before we can parse SFRAME. SFRAME is just for processing the unwinding information and cannot serve to do the initial mapping for what the Nibble map is used.

@fabled
Copy link
Author

fabled commented Oct 9, 2024

@max-charlamb is going to take a stab at prototyping #93550 (comment) (both reading and the implied writing algorithm) and to measure the perf impact

...

One thing that I liked about just replacing the nibble map is that it's a fairly self-contained project that could be prototyped and measured in a limited amount of time. I think this alternate proposal has broader scope (I think we wouldn't need the RangeSectionMap anymore either, right?) and I don't have a good handle on how much additional work we'd be taking on.

Wonderful! When doing the timing and performance measurements, obviously there will be different affect on the "update map" and the "use" side. And much of it also depends on how large the JITted function is. So measuring this probably requires workloads that are designed to perf this. Its also good to measure other side effects such as how much cache is trashed etc.

The profiler we are doing has been made open source and donated to Open Telemetry. In case you want and can look at Apache2 licensed things, I can drop the links to the relevant portions of our profiler on part of dotnet processing.

@jkotas
Copy link
Member

jkotas commented Oct 9, 2024

@jkotas Is this a "boil the ocean" proposal? or could we opt into the new scheme gradually? for example start with windows or linux only?

Yes, that would be fine. Also, I do not have a problem with just improving nibble map. I have mentioned it here since I heard about the run-down performance issues and stacktraces for tracing multiple times from different directions recently.

Parsing SFRAME format requires also linear processing and maintaining a state machine.

SFRAME FDE entries can be sorted to allow binary search. We would emit the SFRAME entries sorted. It would not be a linear search.

I also think this is not the same thing. The Nibble map is used to map RIP-to-methodMetaData.

We can use the SFRAME FDE entry index for a lookup in a parallel array with methodMetaData.

@davidwrighton
Copy link
Member

@jkotas , I'm pretty sure we have race condition with race conditions with the existing scheme we use on Windows that dynamically reports unwind data. This is ... ok for ETW, where lost stack frames is a bit of an acceptable thing, but we also use this data in a completely lock free manner to handle exception unwinds/GC collections etc, which I don't think that we can handle correctly with the current OS interacting path. (I could be wrong on this though, I'm saying this from some vague memories). Also having a completely lock-free path here is important for our work towards having EH throw/catch reach peak performance, so I want to tread lightly there. OTOH, the current mess of how much it costs to have profiler stacks enabled on Windows is awful and should be investigated, that's for sure. My recollection was that much of the cost/race conditions and such with the current Windows approach involves handling cases where methods are deleted.

@jkotas
Copy link
Member

jkotas commented Oct 10, 2024

I agree that publishing of the unwind data to Windows has by-design race condition.

I am not suggesting that we replace our lock-free rangemap. It would make things slow and expose us to this race condition. If we were to switch to unwinding tables in the Windows OS format as the source of truth, we should be able keep our lock-free range map and keep our lock-free lookups. Replacing lock-free nibble map search with lock-free binary search should be the only change on the lookup path.

We used nibble maps in NGen images originally. IIRC, deleting the nibble maps from NGen images and switching to regular unwind info as source of truth was perf-neutral on average for throughput and came with binary size reduction.

The perf issues that I have heard about recently were about the tracing startup / rundown costs. Conversion of the nibble maps to OS unwind infos that happens when the tracing is enabled for the first time showed up on the radar as part of the tracing startup cost.

@davidwrighton
Copy link
Member

  1. I am not particularly concerned about the performance cost of converting to a sorted array based representation. I agree that it is almost certainly smaller, and given the typical size of our code regions, will never have a very deep binary search. The idea of guaranteed O(1) performance that we can achieve with a fairly small modification of our existing logic is nice, but not really driving factor for me (you'll note I wrote up this idea, but didn't push it forward last release). Notably, I have seen the sorted array in the R2R image show up as a minor perf bottleneck, but only when I did an EH tweak which wasn't quite safe, and measured multithreaded EH behavior. Also, note that performance concerns have shifted since the NGEN era, and while details like this simply didn't have much of an impact on EH in the past, since things have gotten a lot faster, the lookup of the function is much more relevant than it once was. All that said, reducing memory usage is pretty much always a win, and a sorted array is a smaller amount of memory for most cases. It also has the property that we could likely remove the inline pointer to the CodeHeader from within the jitted data stream, which would be nice.
  2. A lock-free sorted array isn't a simple thing to implement. For R2R or ngen, it's a fairly trivial thing, as those are fixed, and do not change, but given that we do need to support allocation/free patterns given the prevalence of dynamic code, and the possibility of a JIT compilation abort/failure, it gets much more difficult to be confident that the data is always safe. We might reasonably choose to end up with a model where access can be totally lock-free as long as the code is called in COOPERATIVE mode or from the GC (possibly, with a version number somewhere, so we can do in place mutation of the data structure, but isn't lock-free for users from PREEMPTIVE mode. This would match the locking behavior in practice of the lock-free rangemap, and cover all of the really performance critical code in the runtime (EH and GC performance.)
  3. My general preference would be to keep OS specific details to OS specific structures, but I agree that the cost to enable them for profiling is rather extreme, and disruptive to applications.

@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Oct 16, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Dec 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-VM-coreclr in-pr There is an active PR which will close this issue when it is merged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants