-
-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider compiling down to machine code instead of VM bytecode #317
Comments
Another option is to use Cranelift (https://github.com/bytecodealliance/wasmtime/tree/main/cranelift), but:
|
Another option is a hybrid: the runtime is written in Rust, and we compile this into a static library that exposes a set of C functions/types. Inko code in turn is compiled into C code, and the resulting C program statically links to the Rust based runtime. Using this approach we can benefit from the faster compile times of C, while still being able to write the runtime in a sane language. This does require that the compiler can still inline calls from the statically linked runtime library, otherwise it might get expensive. |
I experimented a bit with compiling the VM to a static library, then using that in C (optionally with ThinLTO enabled to inline the functions). The summary of this experiment is as follows: In theory it could work, with the caveat that cross-language LTO (using https://doc.rust-lang.org/rustc/linker-plugin-lto.html) is highly unreliable at the time of writing. I initially got it to work on my laptop, but after updating Rust to 1.62 I'm no longer able to get it to inline the Inko methods. On my desktop compiling dependencies segfaulted, until I ran Calling a function for basically every instruction isn't great from a performance perspective, and it might even be possible an interpreter is faster than that (as it won't need to call a function every instruction). Obviously we'd improve this over time, so it would get better. One issue is suspending processes: in the VM we have full control over everything, so suspending/resuming processes is pretty simple. If we compile down to C (or something else for that matter), we need a mechanism to save/restore the entire call stack. C has In an ideal world, threads were much cheaper to use and we could just rely on them instead of using our own scheduler. Sadly I suspect it will be another ten years (if not more) before threads (and context switching) is cheap enough that you can just spawn tens of thousands of threads and not suffer as a result. |
For generic methods and methods relying on dynamic dispatch (e.g.
While we know that Relying on hidden arguments would likely prove difficult to implement, and I suspect we'd run into hard to debug bugs or cases where code passes the wrong data but it happens to work because the sizes match. An alternative is to do something like this: For every type the compiler generates a type information object. This structure describes the size, alignment, etc, in addition to describing this for its generic arguments (if any). We would then stick this class, at the cost of having to duplicate the class for every generic type (i.e. For allocating we'd have to read this from the arguments before reserving the stack space, and I'm not sure this would work out. Assuming for a moment we'd find a way, we'd still have to deal with values without object headers such as Int and Float. For Int we could sacrifice a bit and use that, but for Float I'm not sure what options we have. We could also take the heavy-handed approach of always giving objects a header, but this means Int takes up 16 bytes instead of 8. The other obvious approach is to monomorphise code, at the cost of increased compile times and memory usage. We'd also still have the issue of |
By default all values would be heap allocated and passed around as pointers. For Int we could use pointer tagging as we do now, but only use a single bit instead of two (meaning we can represent up to 63 bits instead of 62 bits). 64 bits integers would still be heap allocated/boxed. For Float we could box it, but (re)introduce a special-purpose bump allocator for values that are 24 bytes in size (16 bytes for the header, plus 8 for the value). We'd use this bump allocator for boxed Float and Int values, bypassing the cost of using the system allocator. The allocator would recycle pointers by sticking them in a free list, though care must be taken for cases such as thread A allocating a pointer P but thread B releasing it. In this setup Float is larger than necessary, but the cost of creating them is still only a few nanoseconds (5-10 should be realistic for such a special purpose allocator). This solves our stack space problem: since values are always 8 bytes in size, we don't need to read any type information upon creating variables; only when dropping them. We'd still have to copy generic classes though, or stick the type information into the object header somehow. In addition, this setup somewhat falls apart if we allow stack allocating of objects. In this case arguments passed may be more than 8 bytes in size, complicating generic code. We also still have the issue of differentiating between owned values and references in generic code, as the strategy for dropping them is different. This is something we can't encode in the object itself, because it's a property of the reference and not the value itself. We could keep using the current pointer tagging approach, but I'm not really a fan of it. |
Perhaps instead of focusing on stack allocating objects, we could focus on adding more special-purpose allocators, and An
This would essentially be the same as this:
I'm not sure yet if/when we'd support this though, and what the implications are for our move semantics/ownership model. |
The whole reason for going through this trouble is basically twofold: One, I don't want compile times to blow up for large projects due to monomorphising a lot of types. Two, I don't want users to argue over whether they should use generic methods (and thus pay the monomorphise price), or use dynamic dispatch; they should just translate to pretty much the same code. Even if we do monomorphise, for dynamic dispatch code we'd still need to know type information when dropping values (e.g. how much memory is getting dropped in case we're using a special-purpose allocator, and perhaps more). |
Probably easiest for the time being is to stick with the current approach of tagging plus a uniform representation. For Float we could always decide to unbox it if we know for a fact it's never used in a context where boxing is necessary. But again, this is probably best left for some time in the future. |
As for the backend: digging through the Cranelift documentation I can't find any information about linking static/shared libraries using it. I suspect it doesn't handle this for you in any way, instead requiring you to use some random linker provided by the system. QBE (https://c9x.me/compile/) is another option, but it's not clear to me how popular/useful it is. I'm also not sure how well maintained it is, and if that will remain the case 5-10 years from now. C seems like a more reasonable backend, though we'd end up having to deal with everything that makes C a terrible language (e.g. we'd have to be very careful to not depend on undefined behaviour). |
It seems that with Cranelift you still need to manually link things together. For example, https://github.com/Y-Nak/tiger-cranelift/ uses |
In terms of dependencies: Cranelift pulls in a total of 15 crates, and just building a project that imports these takes 23 seconds for a debug build; and that's without actually using any of the crates. |
Using Cranelift may pose another challenge: when binding against C, globals such as |
One more: switching thread stacks may require platform specific shenanigans. If we use |
If we compile to C we don't actually need a different way of managing the stack in the VM, as we'd defer this to C. Instead we'd need the necessary fields to store the data of
In pseudo code you'd end up with something like this:
A process wanting to yield would then do something like this:
|
One requirement of this is that we never longjmp/setjmp in the middle of a Rust function, as Rust doesn't work well with goto-like constructs. This means that any runtime method that wants to reschedule a process should instead return some sort of status value, and the generated C code would then respond accordingly. |
GCC and clang apparently optimise setjmp: https://github.com/bytecodealliance/wasmtime/blob/642102e69971862c5880d9d1efeec15296428e73/crates/runtime/src/helpers.c#L26 |
An issue with C is that AFAIK there's no way to adjust stack sizes at runtime. The default of 8KB on Linux is way too much, but fixing it to e.g. 1KB isn't enough for everything. Using Cranelift would probably make this part easier, because we control everything. A hybrid approach is also an option: we use Cranelift for code generation, but bind to libc for the use of |
https://graphitemaster.github.io/fibers/ has interesting details on the matter. |
https://brennan.io/2020/05/24/userspace-cooperative-multitasking/ is another resource worth reading. |
Another issue with compiling to C: if we compile to C we won't know exactly how much stack space a function needs. There are probably hacks to achieve this, but I doubt they would be portable and/or reliable. |
Another option is that we simply allocate a fixed size stack and never grow it. Assuming the OS is smart enough to only physically allocate memory when used, it should be totally fine to allocate e.g. 8KB stacks of virtual memory while simultaneously having hundreds of thousands of processes. And when they do grow there's no way around an increased stack size anyway, and personally I'd rather have my program crash instead of silently consume more and more memory. |
As an example: 100 000 processes using a 4KB stack would need 390 MB of virtual memory. This is nothing considering many programs can easily use hundreds of megabytes of virtual memory without issue. So perhaps this should be the starting point: a 4 KB stack that doesn't grow, at least for now. Of course we'd have to make sure the code we generate fits within those limits (e.g. we can't allocate large arrays on the stack), but since most objects go on the heap this should be fine. |
Not having to worry about resizing the stack does significantly simplify our setup, as "all" we'd need to worry about is:
|
For recursive algorithms one would have three options:
|
One improvement to consider is reusing stack memory: when allocating a process we first see if any stacks are available for reuse, only allocating a new one when needed. This way frequently spawning and terminating processes won't need as many allocations. |
Another question to answer is what C types we'd use to represent traits. C has no notion of traits/interfaces, and they don't exist at runtime in the current VM either. We'd likely need to type traits as just struct InkoTrait {
void* pointer // Pointer to the value that is typed as a trait
} The purpose of this type would mostly be to make it harder to generate code accidentally passing a regular |
Regarding stack memory, the following resources are worth reading:
It discusses an approach where the native stack isn't replaced with a task-local stack. Instead tasks use the native stack, and upon yielding the stack's data is copied into a separate buffer. When the task is resumed some time later, the buffer is used to overwrite the stack. This would probably be easier to implement, and likely plays nicer with C tooling and compilers as the stack base pointer never changes; rather its the contents that change. Of course you still need to maintain that buffer. Probably the best way would be to store an optional pointer to such a buffer in a process. One important bit is to make sure the buffer is the size of the stack as it's used up until the point of yielding, not the entire stack space available. That is, if the maximum stack size is 8 MB but you're only using 1 KB at the point of yielding, the buffer should only be 1 KB. The cost is essentially a Another challenge is making sure those buffers are reused, instead of being allocated and dropped every time. This means rounding them up to certain sizes so it's easier to reuse them. |
For the above we'd probably need to do something like this: when we pick up a message we record the current stack pointer's address. When we need to copy into a stack buffer, we copy everything from that address until the current stack pointer. This way we skip whatever came before (e.g. the few function calls needed to set up the scheduler). |
The trick here seems as follows:
I'm not sure how/if this is portable though. There's also this caveat from the documentation, which has me wondering about how reliable/safe this really is:
|
Python greenlets seem to do the same thing: https://stackoverflow.com/questions/3349048/how-do-greenlets-work. There seem to be hints that this approach indeed is risky: python-greenlet/greenlet#66 |
Some details about how project Loom for the JVM does something similar: https://discuss.kotlinlang.org/t/kotlin-coroutines-and-upcoming-java-loom/15819/20 |
https://github.com/rust-lang/stacker/tree/master/psm might be worth using for fiddling with the stack. |
https://www.erlang.org/blog/the-road-to-the-jit/ contains an interesting history about Erlang's road to a JIT. In particular they mention how compiling to C turned out to be problematic, probably for the same reasons as I've outlined above. At this point I'm starting to wonder if we should instead just use Cranelift. It's more work and we won't benefit from everything that C compilers have to offer, but we also have total control. We would still need one or two C functions to swap stacks as Cranelift doesn't allow inline assembly, but that should be doable. I think the big challenge here would be Windows support: Unix systems generally let you do whatever you want, provided you bang your head against the wall often enough. Windows on the other hand seems to be more picky about what it does and doesn't allow. |
I started playing around with https://github.com/Y-Nak/tiger-cranelift a bit to see how Cranelift works. One thing this reminded me of is the following: if we provide a Rust-based runtime library, we also end up linking Rust's standard library into that. The current VM and a bunch of our dependencies aren't compatible with In addition, tiger-cranelift uses a rather old version of Cranelift, and it seems the API has changed a fair bit since. |
Some recent projects using Cranelift that may be worth looking at for inspiration: |
If we compile to C, using |
If we compile to C we should require a two's complement integer implementation, as this makes dealing with signed integers easier. https://stackoverflow.com/questions/64842669/how-to-test-if-a-target-has-twos-complement-integers-with-the-c-preprocessor outlines an approach for detecting this. There are basically no mainstream C compilers that don't use two's complement anyway. |
Producing a compile-time error using the preprocessor is done as outlined in https://stackoverflow.com/questions/2221517/how-do-i-generate-an-error-or-warning-in-the-c-preprocessor. |
Checking which C compiler is used can be done like this. |
How to check for overflows in the remainder operator, which is needed to implement the checked modulos operator used by Inko: https://stackoverflow.com/questions/19285163/does-modulus-overflow |
TinyC seems to not actually work, even for simple C files:
In addition, something as simple as checked integer addition is already a bit tricky to implement, requiring compiler-specific code to get things working: bool int64_add(int64_t lhs, int64_t rhs, int64_t *result) {
#if defined(__clang__) || defined(__GNUC__)
return __builtin_saddll_overflow(lhs, rhs, (long long *)result);
#elif defined(_MSC_VER)
return LongLongAdd(lhs, rhs, (long long *)result) != S_OK;
#else
return inko_int64_add(lhs, rhs, result);
#endif
} Here #[no_mangle]
pub unsafe extern "system" fn inko_int64_add(
lhs: i64,
rhs: i64,
result: *mut i64,
) -> bool {
if let Some(value) = lhs.checked_add(rhs) {
*result = value;
false
} else {
true
}
} As far as I know Cranelift offers no builtins for this at all, so we'd have to implement it all from scratch. This is surprisingly tricky, and the obvious solutions are often wrong. For example, checking if |
Summary
Inko currently compiles down to bytecode for its own VM. This means you write
your application code once, then run it on multiple platforms, without the need
for recompiling your program. Instead of doing this we should investigate
compiling Inko source code straight to Rust, removing the need for an
interpreter.
Note that this is just an idea/shower thought, it's not a serious plan at this
stage.
Motivation
Interpreters inevitably end up being a bottleneck. The usual way around this is
to either implement a JIT, or write your code as a native extension of some
sort. Inko doesn't have (nor will have) support for native extensions, and
writing a JIT is an enormous undertaking. Even with a JIT present, chances are
you still won't get the performance you need. Having a VM also means having to
write your own debugger and related tooling.
Compiling to native code somehow would let you work around this. And unlike a
VM+JIT combination, you only need to write one compiler (= your AOT compiler).
Using LLVM isn't an option, as it's API is highly unstable and different
platforms either ship different LLVM versions (= every Linux distribution). The
result is that many users of LLVM end up vendoring it, only updating every now
and then. This in turn makes distribution of the code more difficult, as you may
end up vendoring a version not supported by a platform you wish to support.
A way around that is to compile to an existing language, such as C, Zig, or
Rust. C is the usual suspect in this regard, and it makes sense: the language is
stable, supports pretty much every platform out there, and is lightweight. But C
is also full of footguns, with even integer addition introducing undefined
behaviour. I don't feel comfortable compiling to such a language.
Compiling to Rust would be an interesting approach. First, I'm much more
familiar and comfortable with Rust. There are also fewer footguns, we can still
take advantage of a low/no overhead FFI (Inko's FFI would translate to Rust's
FFI), get debugger support, etc. Porting the VM to this might also be easier:
we'd basically introduce a "libinko" crate that just provides the runtime, and
the generated code would link to said library. If we compile to C, we'd first
have to write all that in C, and I'm pretty sure we'd introduce 2000 CVEs in the
process.
The obvious downside is the compile times: Rust's compiler is slooooow, and
adding Inko's compiler on top only makes that worse. There may be ways around
that though: first, we could disable the standard library, depending on how much
we need it in "libinko". Second, we can further reduce the number of
dependencies, and maybe also reduce the amount of generic code to compile
(especially in combination with disabling the standard library).
Compiling to Rust offers another benefit: we can let rustc take care of the
heavy lifting of optimisations, monomorphisation (assuming we don't continue
boxing values, which we probably will for a while at least), etc. For example,
we wouldn't need to inline code, instead rustc would do that for us. Of course
we can (and probably should) write certain optimisations ourselves, as we may be
able to do them faster than rustc can, but it's not strictly necessary.
Implementation
At this stage, I have no idea. Here are just some ideas that come to mind:
Reducing dependencies
Probably a first step is to reduce the number of
dependencies we have (fortunately we don't have that many). For example, we
probably can get rid of the crossbeam crates if we just use regular synchronised
data structures. That alone saves us three direct dependencies.
The libffi crate would also no longer be needed, as we'd use Rust's FFI instead
(requiring special syntax in Inko for this). Initially we probably should just
keep it though, as I'm not sure what the FFI syntax would look like (especially
since I don't want to introduce C pointer/int types to Inko). libloading can
also be removed along with this.
dirs-next we could probably just replace with some custom code, or we keep it; I
don't think it really contributes to compile times anyway.
Suspending processes and stacks
Processes currently use their own stack, which is basically just a Vec per
function call. This isn't great for performance, but it makes
suspending/resuming processes trivial as we don't need to fiddle with assembly.
Initially we should just keep this setup, but over time we'd need a way such
that variables in Inko translate directly to variables in Rust. That requires a
way of saving/restoring them.
Types
Inko classes would translate to structs and methods as in Rust. Traits wouldn't
exist in the Rust layer, instead methods are just duplicated whenever traits are
implemented. Generics would be erased initially, but at some point we could keep
them if we decide we want to monomorphise them. This will contribute to compile
times, so I prefer investigating different strategies.
Drawbacks
It's a really big undertaking, and will slow down compile times one way or
another. While working on this, probably other work needs to wait. It's also not
clear if this would actually matter much in the coming years, as Inko is
currently not used for anything that would benefit from this change.
The text was updated successfully, but these errors were encountered: