Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply reductions using generated code instead of using runtime function calls #522

Closed
yorickpeterse opened this issue May 15, 2023 · 4 comments
Assignees
Labels
compiler Changes related to the compiler feature New things to add to Inko, such as a new standard library module performance Changes related to improving performance
Milestone

Comments

@yorickpeterse
Copy link
Collaborator

Description

#508 introduces a native code compiler uses LLVM, replacing the bytecode interpreter. Reductions are applied using the runtime function inko_reduce. Because this function is provided by the Rust-based runtime library, it isn't inlined. A quick glance at profiling output suggests that for the test suite we spend quite a bit of time in this function. This isn't entirely surprising though, as the function is called after every method call/return.

To optimise this we should apply reductions using generated code, only calling runtime functions to yield back to the OS thread if needed. Given a reduction count of 1000, this means that instead of 1001 function calls (1000 for the reduction checks, and one for the yield), we'd only have 1 function call every 1000 reductions.

For this to work we'll need to make the thread field of the Process structure public (to signal it being used by the generated code), and add a test to check for the offset and size of this field to ensure they remain consistent. If a yield is needed we then call inko_process_yield.

Related work

No response

@yorickpeterse yorickpeterse added feature New things to add to Inko, such as a new standard library module performance Changes related to improving performance accepting contributions Issues that are suitable to be worked on by anybody, not just maintainers labels May 15, 2023
@yorickpeterse yorickpeterse modified the milestones: 0.12.0, 0.13.0 May 19, 2023
@yorickpeterse yorickpeterse modified the milestones: 0.13.0, 0.14.0 Jun 6, 2023
@yorickpeterse yorickpeterse added compiler Changes related to the compiler and removed accepting contributions Issues that are suitable to be worked on by anybody, not just maintainers labels Jul 7, 2023
@yorickpeterse
Copy link
Collaborator Author

yorickpeterse commented Jul 15, 2023

One idea I've been toying with is to take a different approach:

Instead of reductions, we assign time slices. For the sake of this example, let's say the time slice is 10 milliseconds.

The scheduler maintains a global "run epoch" (RE), and each OS thread maintains a "thread epoch" (TE). Both values are atomic integers that wrap around upon overflowing. The epochs start at 1, with zero signalling a lack of a value.

When a thread is about to run a process, it grabs the current RE and stores it in its TE, it then runs the process. At various places in the code (i.e where we now reduce), we check TE. If the value is zero, we reschedule the process, and pick up the next process, repeating the cycle.

In the background a separate thread, which I'll call the "process monitor" (PM) runs. The PM thread sleeps for 10 milliseconds. Upon waking up, it iterates over the OS threads and compares their TE with RE. If the values differ, we set the current TE to zero. At the end we increment RE, wrapping around if needed and making sure to skip value 0.

The idea behind this setup is that processes run a fixed amount of time, but without having to increment a counter many times. The cost here is atomically reading an integer. Using a "relaxed" ordering, the cost of this shouldn't be a problem. Since the PM operates over threads and not processes, and the number of threads is fixed, the PM's runtime is O(P+B) where P is the number of primary threads, and B the number of backup threads.

The downside is that if the PM interval is too small, we end up wasting CPU cycles (and battery life in case of mobile devices), but if the interval is too great the program's throughput may suffer.

If I'm not mistaken, Go takes a similar approach to this.

@yorickpeterse
Copy link
Collaborator Author

Per jinyus/related_post_gen#440 (comment), it seems we can easily spend a significant amount of time just in inko_reduce. We really should move towards a "passive" scheduler as outlined above.

@yorickpeterse
Copy link
Collaborator Author

Another option is to do something similar to Pony (see here and here): when picking up a message, we run from start to finish, yielding at the end.

This approach doesn't introduce any scheduling overhead (unlike fuel-based mechanisms), but also means code (e.g. while true) can hog an OS thread indefinitely. To mitigate that, we can apply the above time based mechanism on top, but only apply it in fewer places (e.g. around IO or at the end of a loop iteration), though that then begs the question why not only do that.

@yorickpeterse
Copy link
Collaborator Author

yorickpeterse commented Nov 18, 2023

Per this document, Go's approach is to send a signal (SIGURG) to a thread that needs to yield. This way you don't need explicit checks in various places (e.g. loops), improving the performance.

The basic setup would involve installing a thread-local signal handlers, then having a background thread periodically send such a signal to running threads. The handler (running in the process) would then simply yield back to the scheduler.

A downside is that this may affect system calls in weird ways, such that we have to handle retrying them in those cases, and I'm not sure what the best way of doing so is.

Another challenge is that if a thread masks the signal for some reason (e.g. an FFI function does this), then it might miss the "yield yourself" notification. The approach would also require different code for Windows.

@yorickpeterse yorickpeterse self-assigned this Nov 19, 2023
@yorickpeterse yorickpeterse added this to the 0.13.2 milestone Nov 19, 2023
yorickpeterse added a commit that referenced this issue Nov 20, 2023
Instead of each process reducing a counter and yielding when the counter
reaches zero, each process now tracks an epoch. This epoch is a global
counter incremented every 10 milliseconds by a background thread. The
compiler in turn inserts a "preempt" instruction, compiled to code that
checks the process' epoch against the global epoch. If the two differ,
the process yields control back to the scheduler.

This new approach can improve performance by up to 40%, in part because
we no longer need function calls and instead compile the instruction
directly to machine code. In addition, we no longer insert the
preemption check after every method call, instead only inserting it at
the end of a loop iteration, or before a next or break.

This new setup is less fine-grained compared to the old approach, but
that's OK as the end goal is not to give each process the exact same
amount of fuel/time, but rather to prevent one process from forever
holding on to a scheduler thread.

This fixes #522.

Changelog: performance
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler Changes related to the compiler feature New things to add to Inko, such as a new standard library module performance Changes related to improving performance
Projects
None yet
Development

No branches or pull requests

1 participant