-
-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply reductions using generated code instead of using runtime function calls #522
Comments
One idea I've been toying with is to take a different approach: Instead of reductions, we assign time slices. For the sake of this example, let's say the time slice is 10 milliseconds. The scheduler maintains a global "run epoch" (RE), and each OS thread maintains a "thread epoch" (TE). Both values are atomic integers that wrap around upon overflowing. The epochs start at 1, with zero signalling a lack of a value. When a thread is about to run a process, it grabs the current RE and stores it in its TE, it then runs the process. At various places in the code (i.e where we now reduce), we check TE. If the value is zero, we reschedule the process, and pick up the next process, repeating the cycle. In the background a separate thread, which I'll call the "process monitor" (PM) runs. The PM thread sleeps for 10 milliseconds. Upon waking up, it iterates over the OS threads and compares their TE with RE. If the values differ, we set the current TE to zero. At the end we increment RE, wrapping around if needed and making sure to skip value 0. The idea behind this setup is that processes run a fixed amount of time, but without having to increment a counter many times. The cost here is atomically reading an integer. Using a "relaxed" ordering, the cost of this shouldn't be a problem. Since the PM operates over threads and not processes, and the number of threads is fixed, the PM's runtime is The downside is that if the PM interval is too small, we end up wasting CPU cycles (and battery life in case of mobile devices), but if the interval is too great the program's throughput may suffer. If I'm not mistaken, Go takes a similar approach to this. |
Per jinyus/related_post_gen#440 (comment), it seems we can easily spend a significant amount of time just in |
Another option is to do something similar to Pony (see here and here): when picking up a message, we run from start to finish, yielding at the end. This approach doesn't introduce any scheduling overhead (unlike fuel-based mechanisms), but also means code (e.g. |
Per this document, Go's approach is to send a signal ( The basic setup would involve installing a thread-local signal handlers, then having a background thread periodically send such a signal to running threads. The handler (running in the process) would then simply yield back to the scheduler. A downside is that this may affect system calls in weird ways, such that we have to handle retrying them in those cases, and I'm not sure what the best way of doing so is. Another challenge is that if a thread masks the signal for some reason (e.g. an FFI function does this), then it might miss the "yield yourself" notification. The approach would also require different code for Windows. |
Instead of each process reducing a counter and yielding when the counter reaches zero, each process now tracks an epoch. This epoch is a global counter incremented every 10 milliseconds by a background thread. The compiler in turn inserts a "preempt" instruction, compiled to code that checks the process' epoch against the global epoch. If the two differ, the process yields control back to the scheduler. This new approach can improve performance by up to 40%, in part because we no longer need function calls and instead compile the instruction directly to machine code. In addition, we no longer insert the preemption check after every method call, instead only inserting it at the end of a loop iteration, or before a next or break. This new setup is less fine-grained compared to the old approach, but that's OK as the end goal is not to give each process the exact same amount of fuel/time, but rather to prevent one process from forever holding on to a scheduler thread. This fixes #522. Changelog: performance
Description
#508 introduces a native code compiler uses LLVM, replacing the bytecode interpreter. Reductions are applied using the runtime function
inko_reduce
. Because this function is provided by the Rust-based runtime library, it isn't inlined. A quick glance at profiling output suggests that for the test suite we spend quite a bit of time in this function. This isn't entirely surprising though, as the function is called after every method call/return.To optimise this we should apply reductions using generated code, only calling runtime functions to yield back to the OS thread if needed. Given a reduction count of 1000, this means that instead of 1001 function calls (1000 for the reduction checks, and one for the yield), we'd only have 1 function call every 1000 reductions.
For this to work we'll need to make the
thread
field of theProcess
structure public (to signal it being used by the generated code), and add a test to check for the offset and size of this field to ensure they remain consistent. If a yield is needed we then callinko_process_yield
.Related work
No response
The text was updated successfully, but these errors were encountered: