Apply reductions using generated code instead of using runtime function calls #522

yorickpeterse · 2023-05-15T15:06:53Z

Description

#508 introduces a native code compiler uses LLVM, replacing the bytecode interpreter. Reductions are applied using the runtime function inko_reduce. Because this function is provided by the Rust-based runtime library, it isn't inlined. A quick glance at profiling output suggests that for the test suite we spend quite a bit of time in this function. This isn't entirely surprising though, as the function is called after every method call/return.

To optimise this we should apply reductions using generated code, only calling runtime functions to yield back to the OS thread if needed. Given a reduction count of 1000, this means that instead of 1001 function calls (1000 for the reduction checks, and one for the yield), we'd only have 1 function call every 1000 reductions.

For this to work we'll need to make the thread field of the Process structure public (to signal it being used by the generated code), and add a test to check for the offset and size of this field to ensure they remain consistent. If a yield is needed we then call inko_process_yield.

Related work

No response

The text was updated successfully, but these errors were encountered:

yorickpeterse · 2023-07-15T00:38:23Z

One idea I've been toying with is to take a different approach:

Instead of reductions, we assign time slices. For the sake of this example, let's say the time slice is 10 milliseconds.

The scheduler maintains a global "run epoch" (RE), and each OS thread maintains a "thread epoch" (TE). Both values are atomic integers that wrap around upon overflowing. The epochs start at 1, with zero signalling a lack of a value.

When a thread is about to run a process, it grabs the current RE and stores it in its TE, it then runs the process. At various places in the code (i.e where we now reduce), we check TE. If the value is zero, we reschedule the process, and pick up the next process, repeating the cycle.

In the background a separate thread, which I'll call the "process monitor" (PM) runs. The PM thread sleeps for 10 milliseconds. Upon waking up, it iterates over the OS threads and compares their TE with RE. If the values differ, we set the current TE to zero. At the end we increment RE, wrapping around if needed and making sure to skip value 0.

The idea behind this setup is that processes run a fixed amount of time, but without having to increment a counter many times. The cost here is atomically reading an integer. Using a "relaxed" ordering, the cost of this shouldn't be a problem. Since the PM operates over threads and not processes, and the number of threads is fixed, the PM's runtime is O(P+B) where P is the number of primary threads, and B the number of backup threads.

The downside is that if the PM interval is too small, we end up wasting CPU cycles (and battery life in case of mobile devices), but if the interval is too great the program's throughput may suffer.

If I'm not mistaken, Go takes a similar approach to this.

yorickpeterse · 2023-11-17T14:49:44Z

Per jinyus/related_post_gen#440 (comment), it seems we can easily spend a significant amount of time just in inko_reduce. We really should move towards a "passive" scheduler as outlined above.

yorickpeterse · 2023-11-17T20:15:50Z

Another option is to do something similar to Pony (see here and here): when picking up a message, we run from start to finish, yielding at the end.

This approach doesn't introduce any scheduling overhead (unlike fuel-based mechanisms), but also means code (e.g. while true) can hog an OS thread indefinitely. To mitigate that, we can apply the above time based mechanism on top, but only apply it in fewer places (e.g. around IO or at the end of a loop iteration), though that then begs the question why not only do that.

yorickpeterse · 2023-11-18T14:43:39Z

Per this document, Go's approach is to send a signal (SIGURG) to a thread that needs to yield. This way you don't need explicit checks in various places (e.g. loops), improving the performance.

The basic setup would involve installing a thread-local signal handlers, then having a background thread periodically send such a signal to running threads. The handler (running in the process) would then simply yield back to the scheduler.

A downside is that this may affect system calls in weird ways, such that we have to handle retrying them in those cases, and I'm not sure what the best way of doing so is.

Another challenge is that if a thread masks the signal for some reason (e.g. an FFI function does this), then it might miss the "yield yourself" notification. The approach would also require different code for Windows.

Instead of each process reducing a counter and yielding when the counter reaches zero, each process now tracks an epoch. This epoch is a global counter incremented every 10 milliseconds by a background thread. The compiler in turn inserts a "preempt" instruction, compiled to code that checks the process' epoch against the global epoch. If the two differ, the process yields control back to the scheduler. This new approach can improve performance by up to 40%, in part because we no longer need function calls and instead compile the instruction directly to machine code. In addition, we no longer insert the preemption check after every method call, instead only inserting it at the end of a loop iteration, or before a next or break. This new setup is less fine-grained compared to the old approach, but that's OK as the end goal is not to give each process the exact same amount of fuel/time, but rather to prevent one process from forever holding on to a scheduler thread. This fixes #522. Changelog: performance

yorickpeterse added feature New things to add to Inko, such as a new standard library module performance Changes related to improving performance accepting contributions Issues that are suitable to be worked on by anybody, not just maintainers labels May 15, 2023

yorickpeterse modified the milestones: 0.12.0, 0.13.0 May 19, 2023

yorickpeterse modified the milestones: 0.13.0, 0.14.0 Jun 6, 2023

yorickpeterse added compiler Changes related to the compiler and removed accepting contributions Issues that are suitable to be worked on by anybody, not just maintainers labels Jul 7, 2023

yorickpeterse mentioned this issue Nov 17, 2023

Add Inko jinyus/related_post_gen#440

Merged

yorickpeterse self-assigned this Nov 19, 2023

yorickpeterse added this to the 0.13.2 milestone Nov 19, 2023

yorickpeterse closed this as completed in cc8c6fe Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply reductions using generated code instead of using runtime function calls #522

Apply reductions using generated code instead of using runtime function calls #522

yorickpeterse commented May 15, 2023

yorickpeterse commented Jul 15, 2023 •

edited

Loading

yorickpeterse commented Nov 17, 2023

yorickpeterse commented Nov 17, 2023

yorickpeterse commented Nov 18, 2023 •

edited

Loading

Apply reductions using generated code instead of using runtime function calls #522

Apply reductions using generated code instead of using runtime function calls #522

Comments

yorickpeterse commented May 15, 2023

Description

Related work

yorickpeterse commented Jul 15, 2023 • edited Loading

yorickpeterse commented Nov 17, 2023

yorickpeterse commented Nov 17, 2023

yorickpeterse commented Nov 18, 2023 • edited Loading

yorickpeterse commented Jul 15, 2023 •

edited

Loading

yorickpeterse commented Nov 18, 2023 •

edited

Loading