Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider compiling down to machine code instead of VM bytecode #317

Closed
yorickpeterse opened this issue May 20, 2022 · 41 comments · Fixed by #508
Closed

Consider compiling down to machine code instead of VM bytecode #317

yorickpeterse opened this issue May 20, 2022 · 41 comments · Fixed by #508
Labels
compiler Changes related to the compiler feature New things to add to Inko, such as a new standard library module

Comments

@yorickpeterse
Copy link
Collaborator

Summary

Inko currently compiles down to bytecode for its own VM. This means you write
your application code once, then run it on multiple platforms, without the need
for recompiling your program. Instead of doing this we should investigate
compiling Inko source code straight to Rust, removing the need for an
interpreter.

Note that this is just an idea/shower thought, it's not a serious plan at this
stage.

Motivation

Interpreters inevitably end up being a bottleneck. The usual way around this is
to either implement a JIT, or write your code as a native extension of some
sort. Inko doesn't have (nor will have) support for native extensions, and
writing a JIT is an enormous undertaking. Even with a JIT present, chances are
you still won't get the performance you need. Having a VM also means having to
write your own debugger and related tooling.

Compiling to native code somehow would let you work around this. And unlike a
VM+JIT combination, you only need to write one compiler (= your AOT compiler).
Using LLVM isn't an option, as it's API is highly unstable and different
platforms either ship different LLVM versions (= every Linux distribution). The
result is that many users of LLVM end up vendoring it, only updating every now
and then. This in turn makes distribution of the code more difficult, as you may
end up vendoring a version not supported by a platform you wish to support.

A way around that is to compile to an existing language, such as C, Zig, or
Rust. C is the usual suspect in this regard, and it makes sense: the language is
stable, supports pretty much every platform out there, and is lightweight. But C
is also full of footguns, with even integer addition introducing undefined
behaviour. I don't feel comfortable compiling to such a language.

Compiling to Rust would be an interesting approach. First, I'm much more
familiar and comfortable with Rust. There are also fewer footguns, we can still
take advantage of a low/no overhead FFI (Inko's FFI would translate to Rust's
FFI), get debugger support, etc. Porting the VM to this might also be easier:
we'd basically introduce a "libinko" crate that just provides the runtime, and
the generated code would link to said library. If we compile to C, we'd first
have to write all that in C, and I'm pretty sure we'd introduce 2000 CVEs in the
process.

The obvious downside is the compile times: Rust's compiler is slooooow, and
adding Inko's compiler on top only makes that worse. There may be ways around
that though: first, we could disable the standard library, depending on how much
we need it in "libinko". Second, we can further reduce the number of
dependencies, and maybe also reduce the amount of generic code to compile
(especially in combination with disabling the standard library).

Compiling to Rust offers another benefit: we can let rustc take care of the
heavy lifting of optimisations, monomorphisation (assuming we don't continue
boxing values, which we probably will for a while at least), etc. For example,
we wouldn't need to inline code, instead rustc would do that for us. Of course
we can (and probably should) write certain optimisations ourselves, as we may be
able to do them faster than rustc can, but it's not strictly necessary.

Implementation

At this stage, I have no idea. Here are just some ideas that come to mind:

Reducing dependencies

Probably a first step is to reduce the number of
dependencies we have (fortunately we don't have that many). For example, we
probably can get rid of the crossbeam crates if we just use regular synchronised
data structures. That alone saves us three direct dependencies.

The libffi crate would also no longer be needed, as we'd use Rust's FFI instead
(requiring special syntax in Inko for this). Initially we probably should just
keep it though, as I'm not sure what the FFI syntax would look like (especially
since I don't want to introduce C pointer/int types to Inko). libloading can
also be removed along with this.

dirs-next we could probably just replace with some custom code, or we keep it; I
don't think it really contributes to compile times anyway.

Suspending processes and stacks

Processes currently use their own stack, which is basically just a Vec per
function call. This isn't great for performance, but it makes
suspending/resuming processes trivial as we don't need to fiddle with assembly.
Initially we should just keep this setup, but over time we'd need a way such
that variables in Inko translate directly to variables in Rust. That requires a
way of saving/restoring them.

Types

Inko classes would translate to structs and methods as in Rust. Traits wouldn't
exist in the Rust layer, instead methods are just duplicated whenever traits are
implemented. Generics would be erased initially, but at some point we could keep
them if we decide we want to monomorphise them. This will contribute to compile
times, so I prefer investigating different strategies.

Drawbacks

It's a really big undertaking, and will slow down compile times one way or
another. While working on this, probably other work needs to wait. It's also not
clear if this would actually matter much in the coming years, as Inko is
currently not used for anything that would benefit from this change.

@yorickpeterse
Copy link
Collaborator Author

Another option is to use Cranelift (https://github.com/bytecodealliance/wasmtime/tree/main/cranelift), but:

  1. I'm not sure how suitable it is for ahead of time compilation
  2. I'm not convinced the companies backing it would still exist 10 years from now
  3. We'd have to implement the runtime in Inko itself I think, which would require drastic changes to the language (e.g. we'd need support for raw pointers, globals, etc)

@yorickpeterse
Copy link
Collaborator Author

Another option is a hybrid: the runtime is written in Rust, and we compile this into a static library that exposes a set of C functions/types. Inko code in turn is compiled into C code, and the resulting C program statically links to the Rust based runtime.

Using this approach we can benefit from the faster compile times of C, while still being able to write the runtime in a sane language. This does require that the compiler can still inline calls from the statically linked runtime library, otherwise it might get expensive.

@yorickpeterse
Copy link
Collaborator Author

I experimented a bit with compiling the VM to a static library, then using that in C (optionally with ThinLTO enabled to inline the functions). The summary of this experiment is as follows:

In theory it could work, with the caveat that cross-language LTO (using https://doc.rust-lang.org/rustc/linker-plugin-lto.html) is highly unreliable at the time of writing. I initially got it to work on my laptop, but after updating Rust to 1.62 I'm no longer able to get it to inline the Inko methods. On my desktop compiling dependencies segfaulted, until I ran cargo update (even though the old list of versions worked fine on my laptop). After that I got the same results: no inlining.

Calling a function for basically every instruction isn't great from a performance perspective, and it might even be possible an interpreter is faster than that (as it won't need to call a function every instruction). Obviously we'd improve this over time, so it would get better.

One issue is suspending processes: in the VM we have full control over everything, so suspending/resuming processes is pretty simple. If we compile down to C (or something else for that matter), we need a mechanism to save/restore the entire call stack. C has setjmp/longjmp for this, but it's (like everything in C) easy to shoot yourself in the foot with these functions. There's no Rust equivalent of this, so I'm not sure how we'd approach suspending/resuming if we were to compile down to Rust.

In an ideal world, threads were much cheaper to use and we could just rely on them instead of using our own scheduler. Sadly I suspect it will be another ten years (if not more) before threads (and context switching) is cheap enough that you can just spawn tens of thousands of threads and not suffer as a result.

@yorickpeterse
Copy link
Collaborator Author

For generic methods and methods relying on dynamic dispatch (e.g. foo(value: ToString) { ... }) we'll need a different approach to what we have today, assuming we don't want everything to always be heap allocated. For example, take this code:

fn foo(value: ToString) { }

While we know that value is an owned value, we don't know its size. If it's an Int it would be 8 bytes, but if it's a String it would be 24 bytes. If we at some point allow stack allocating regular objects, the size could be anything. The size is important because the method would have to know how much stack space to reserve. This is also true for generic methods.

Relying on hidden arguments would likely prove difficult to implement, and I suspect we'd run into hard to debug bugs or cases where code passes the wrong data but it happens to work because the sizes match. An alternative is to do something like this:

For every type the compiler generates a type information object. This structure describes the size, alignment, etc, in addition to describing this for its generic arguments (if any). We would then stick this class, at the cost of having to duplicate the class for every generic type (i.e. Array[String] and Array[Int] would have two different classes). We can also stick it in the object, but we'd increase the header size by 8 bytes; unless we can somehow limit the address to fit in 32 bits, though this would limit us to 4 294 967 295 generic type instances (which might honestly be enough). When dropping a value we'd read this information to determine how to drop the value, how to release its memory, etc.

For allocating we'd have to read this from the arguments before reserving the stack space, and I'm not sure this would work out. Assuming for a moment we'd find a way, we'd still have to deal with values without object headers such as Int and Float. For Int we could sacrifice a bit and use that, but for Float I'm not sure what options we have. We could also take the heavy-handed approach of always giving objects a header, but this means Int takes up 16 bytes instead of 8.


The other obvious approach is to monomorphise code, at the cost of increased compile times and memory usage. We'd also still have the issue of Array being used in the VM (or the runtime when we compile to machine code), so we'd have to move that to being a standard library type entirely somehow.

@yorickpeterse
Copy link
Collaborator Author

By default all values would be heap allocated and passed around as pointers. For Int we could use pointer tagging as we do now, but only use a single bit instead of two (meaning we can represent up to 63 bits instead of 62 bits). 64 bits integers would still be heap allocated/boxed.

For Float we could box it, but (re)introduce a special-purpose bump allocator for values that are 24 bytes in size (16 bytes for the header, plus 8 for the value). We'd use this bump allocator for boxed Float and Int values, bypassing the cost of using the system allocator. The allocator would recycle pointers by sticking them in a free list, though care must be taken for cases such as thread A allocating a pointer P but thread B releasing it. In this setup Float is larger than necessary, but the cost of creating them is still only a few nanoseconds (5-10 should be realistic for such a special purpose allocator).

This solves our stack space problem: since values are always 8 bytes in size, we don't need to read any type information upon creating variables; only when dropping them. We'd still have to copy generic classes though, or stick the type information into the object header somehow.

In addition, this setup somewhat falls apart if we allow stack allocating of objects. In this case arguments passed may be more than 8 bytes in size, complicating generic code.

We also still have the issue of differentiating between owned values and references in generic code, as the strategy for dropping them is different. This is something we can't encode in the object itself, because it's a property of the reference and not the value itself. We could keep using the current pointer tagging approach, but I'm not really a fan of it.

@yorickpeterse
Copy link
Collaborator Author

Perhaps instead of focusing on stack allocating objects, we could focus on adding more special-purpose allocators, and inline types. Special-purpose allocators would be used for small objects of certain sizes (e.g. one for 24 bytes, one for 32, etc). This is basically what most allocators already do, except we'd be able to implement the allocating side without locks, and without system calls, so in theory it could outperform jemalloc & friends.

An inline type is a type allocated directly into another, e.g. something like this:

class Thing {
  let @a: Int
  let @b: inline (Int, Int, Int)
}

This would essentially be the same as this:

class Thing {
  let @a: Int
  let @b0: Int
  let @b1: Int
  let @b2: Int
}

I'm not sure yet if/when we'd support this though, and what the implications are for our move semantics/ownership model.

@yorickpeterse
Copy link
Collaborator Author

The whole reason for going through this trouble is basically twofold:

One, I don't want compile times to blow up for large projects due to monomorphising a lot of types.

Two, I don't want users to argue over whether they should use generic methods (and thus pay the monomorphise price), or use dynamic dispatch; they should just translate to pretty much the same code. Even if we do monomorphise, for dynamic dispatch code we'd still need to know type information when dropping values (e.g. how much memory is getting dropped in case we're using a special-purpose allocator, and perhaps more).

@yorickpeterse
Copy link
Collaborator Author

Probably easiest for the time being is to stick with the current approach of tagging plus a uniform representation. For Float we could always decide to unbox it if we know for a fact it's never used in a context where boxing is necessary. But again, this is probably best left for some time in the future.

@yorickpeterse
Copy link
Collaborator Author

As for the backend: digging through the Cranelift documentation I can't find any information about linking static/shared libraries using it. I suspect it doesn't handle this for you in any way, instead requiring you to use some random linker provided by the system.

QBE (https://c9x.me/compile/) is another option, but it's not clear to me how popular/useful it is. I'm also not sure how well maintained it is, and if that will remain the case 5-10 years from now.

C seems like a more reasonable backend, though we'd end up having to deal with everything that makes C a terrible language (e.g. we'd have to be very careful to not depend on undefined behaviour).

@yorickpeterse
Copy link
Collaborator Author

It seems that with Cranelift you still need to manually link things together. For example, https://github.com/Y-Nak/tiger-cranelift/ uses cc to link its .o file with its runtime library.

@yorickpeterse
Copy link
Collaborator Author

In terms of dependencies: Cranelift pulls in a total of 15 crates, and just building a project that imports these takes 23 seconds for a debug build; and that's without actually using any of the crates.

@yorickpeterse
Copy link
Collaborator Author

Using Cranelift may pose another challenge: when binding against C, globals such as errno may be needed. The problem is that depending on the libc used, errno is either a global variable, or a macro that calls a function. This would require that we know what libc is used and link this symbol accordingly. If we instead compile to C this isn't necessary, as the C compiler does this for us.

@yorickpeterse
Copy link
Collaborator Author

One more: switching thread stacks may require platform specific shenanigans. If we use setjmp/longjmp that would be taken care of for us, at the cost of it potentially being a bit more expensive to use.

@yorickpeterse
Copy link
Collaborator Author

If we compile to C we don't actually need a different way of managing the stack in the VM, as we'd defer this to C. Instead we'd need the necessary fields to store the data of setjmp(). Specifically I think we'd need to do something like this:

  1. When a thread picks up a process, it uses setjmp() to save its current stack. This is written into process->thread_stack, which defaults to NULL and may store the stack of the thread that's running the process.
  2. We use longjmp to jump to the process' stack, stored in e.g. process->stack
  3. When the process wants to yield, it uses longjmp(process->thread_stack)
  4. The thread "wakes up", clears process->thread_stack and puts the process back in the queue

In pseudo code you'd end up with something like this:

if process = pop()
  if setjmp(process.thread_stack) == 0
    longjmp(process.stack)

  # Process yielded, put it back in the queue
  schedule(process)

A process wanting to yield would then do something like this:

work()

if setjmp(process.stack) == 0
  longjmp(process.thread_stack)

# We resumed
more_work()

@yorickpeterse
Copy link
Collaborator Author

One requirement of this is that we never longjmp/setjmp in the middle of a Rust function, as Rust doesn't work well with goto-like constructs. This means that any runtime method that wants to reschedule a process should instead return some sort of status value, and the generated C code would then respond accordingly.

@yorickpeterse
Copy link
Collaborator Author

@yorickpeterse
Copy link
Collaborator Author

An issue with C is that AFAIK there's no way to adjust stack sizes at runtime. The default of 8KB on Linux is way too much, but fixing it to e.g. 1KB isn't enough for everything. Using Cranelift would probably make this part easier, because we control everything.

A hybrid approach is also an option: we use Cranelift for code generation, but bind to libc for the use of setjmp/longjmp. Over time we could move away from that in favour of platform specific assembly.

@yorickpeterse
Copy link
Collaborator Author

https://graphitemaster.github.io/fibers/ has interesting details on the matter.

@yorickpeterse
Copy link
Collaborator Author

https://brennan.io/2020/05/24/userspace-cooperative-multitasking/ is another resource worth reading.

@yorickpeterse
Copy link
Collaborator Author

Another issue with compiling to C: if we compile to C we won't know exactly how much stack space a function needs. There are probably hacks to achieve this, but I doubt they would be portable and/or reliable.

@yorickpeterse
Copy link
Collaborator Author

Another option is that we simply allocate a fixed size stack and never grow it. Assuming the OS is smart enough to only physically allocate memory when used, it should be totally fine to allocate e.g. 8KB stacks of virtual memory while simultaneously having hundreds of thousands of processes. And when they do grow there's no way around an increased stack size anyway, and personally I'd rather have my program crash instead of silently consume more and more memory.

@yorickpeterse
Copy link
Collaborator Author

As an example: 100 000 processes using a 4KB stack would need 390 MB of virtual memory. This is nothing considering many programs can easily use hundreds of megabytes of virtual memory without issue.

So perhaps this should be the starting point: a 4 KB stack that doesn't grow, at least for now. Of course we'd have to make sure the code we generate fits within those limits (e.g. we can't allocate large arrays on the stack), but since most objects go on the heap this should be fine.

@yorickpeterse
Copy link
Collaborator Author

Not having to worry about resizing the stack does significantly simplify our setup, as "all" we'd need to worry about is:

  1. Creating our own stack (probably using mmap so we can set the right flags)
  2. Swapping the stack with some assembly
  3. Jumping between processes using setjmp / longjmp

@yorickpeterse
Copy link
Collaborator Author

For recursive algorithms one would have three options:

  1. If recursion isn't too intense, just use recursion and you'll probably be fine
  2. Turn the recursive algorithm into an iterative one
  3. We (re)implement tail-call elimination at some point, and you use that. Since C has goto this would mostly be a matter of re-initialising the argument locals, then a goto to the start of the function. Since Inko doesn't allow reading of uninitialised values we wouldn't have to reinitialise other C generated variables

@yorickpeterse
Copy link
Collaborator Author

One improvement to consider is reusing stack memory: when allocating a process we first see if any stacks are available for reuse, only allocating a new one when needed. This way frequently spawning and terminating processes won't need as many allocations.

@yorickpeterse
Copy link
Collaborator Author

Another question to answer is what C types we'd use to represent traits. C has no notion of traits/interfaces, and they don't exist at runtime in the current VM either. We'd likely need to type traits as just void* or some sort of newtype, e.g:

struct InkoTrait {
  void* pointer // Pointer to the value that is typed as a trait
}

The purpose of this type would mostly be to make it harder to generate code accidentally passing a regular void*, though I'm not sure it's actually beneficial.

@yorickpeterse
Copy link
Collaborator Author

Regarding stack memory, the following resources are worth reading:

It discusses an approach where the native stack isn't replaced with a task-local stack. Instead tasks use the native stack, and upon yielding the stack's data is copied into a separate buffer. When the task is resumed some time later, the buffer is used to overwrite the stack.

This would probably be easier to implement, and likely plays nicer with C tooling and compilers as the stack base pointer never changes; rather its the contents that change.

Of course you still need to maintain that buffer. Probably the best way would be to store an optional pointer to such a buffer in a process. One important bit is to make sure the buffer is the size of the stack as it's used up until the point of yielding, not the entire stack space available. That is, if the maximum stack size is 8 MB but you're only using 1 KB at the point of yielding, the buffer should only be 1 KB.

The cost is essentially a memcpy into the buffer. This makes yielding more expensive, especially for large stack sizes. Swapping the stack entirely on the other hand is much faster, as it's mostly a matter of adjusting a few pointers.

Another challenge is making sure those buffers are reused, instead of being allocated and dropped every time. This means rounding them up to certain sizes so it's easier to reuse them.

@yorickpeterse
Copy link
Collaborator Author

For the above we'd probably need to do something like this: when we pick up a message we record the current stack pointer's address. When we need to copy into a stack buffer, we copy everything from that address until the current stack pointer. This way we skip whatever came before (e.g. the few function calls needed to set up the scheduler).

@yorickpeterse
Copy link
Collaborator Author

https://github.com/nikitadanilov/usched/blob/86d152e85aabc032f95583af924703f09d2daea9/usched.c#L288

The trick here seems as follows:

  • Save the current stack using setjmp
  • Use the resulting jmp_buf to copy the stack data into the buffer
  • Restore that later

I'm not sure how/if this is portable though. There's also this caveat from the documentation, which has me wondering about how reliable/safe this really is:

 *  - usched.c must be compiled without optimisations and with
 *    -fno-stack-protector option (gcc);
 *

@yorickpeterse
Copy link
Collaborator Author

Python greenlets seem to do the same thing: https://stackoverflow.com/questions/3349048/how-do-greenlets-work. There seem to be hints that this approach indeed is risky: python-greenlet/greenlet#66

@yorickpeterse
Copy link
Collaborator Author

Some details about how project Loom for the JVM does something similar: https://discuss.kotlinlang.org/t/kotlin-coroutines-and-upcoming-java-loom/15819/20

@yorickpeterse
Copy link
Collaborator Author

https://github.com/rust-lang/stacker/tree/master/psm might be worth using for fiddling with the stack.

@yorickpeterse
Copy link
Collaborator Author

https://www.erlang.org/blog/the-road-to-the-jit/ contains an interesting history about Erlang's road to a JIT. In particular they mention how compiling to C turned out to be problematic, probably for the same reasons as I've outlined above.

At this point I'm starting to wonder if we should instead just use Cranelift. It's more work and we won't benefit from everything that C compilers have to offer, but we also have total control. We would still need one or two C functions to swap stacks as Cranelift doesn't allow inline assembly, but that should be doable.

I think the big challenge here would be Windows support: Unix systems generally let you do whatever you want, provided you bang your head against the wall often enough. Windows on the other hand seems to be more picky about what it does and doesn't allow.

@yorickpeterse
Copy link
Collaborator Author

I started playing around with https://github.com/Y-Nak/tiger-cranelift a bit to see how Cranelift works. One thing this reminded me of is the following: if we provide a Rust-based runtime library, we also end up linking Rust's standard library into that. The current VM and a bunch of our dependencies aren't compatible with no_std. Using LTO seems to have no impact on the size of the resulting binary.

In addition, tiger-cranelift uses a rather old version of Cranelift, and it seems the API has changed a fair bit since.

@yorickpeterse
Copy link
Collaborator Author

Some recent projects using Cranelift that may be worth looking at for inspiration:

@yorickpeterse
Copy link
Collaborator Author

If we compile to C, using intsafe.h on Windows would give us access to functions similar to __builtin_saddll_overflow and the likes (e.g. LongLongAdd).

@yorickpeterse
Copy link
Collaborator Author

If we compile to C we should require a two's complement integer implementation, as this makes dealing with signed integers easier. https://stackoverflow.com/questions/64842669/how-to-test-if-a-target-has-twos-complement-integers-with-the-c-preprocessor outlines an approach for detecting this. There are basically no mainstream C compilers that don't use two's complement anyway.

@yorickpeterse
Copy link
Collaborator Author

Producing a compile-time error using the preprocessor is done as outlined in https://stackoverflow.com/questions/2221517/how-do-i-generate-an-error-or-warning-in-the-c-preprocessor.

@yorickpeterse
Copy link
Collaborator Author

Checking which C compiler is used can be done like this.

@yorickpeterse
Copy link
Collaborator Author

How to check for overflows in the remainder operator, which is needed to implement the checked modulos operator used by Inko: https://stackoverflow.com/questions/19285163/does-modulus-overflow

@yorickpeterse
Copy link
Collaborator Author

TinyC seems to not actually work, even for simple C files:

tcc -o /tmp/test /tmp/test.c
/usr/lib64/crt1.o: error: Invalid relocation entry [15] '.rela.debug_info' @ 00000504
tcc: error: file 'crt1.o' not found

In addition, something as simple as checked integer addition is already a bit tricky to implement, requiring compiler-specific code to get things working:

bool int64_add(int64_t lhs, int64_t rhs, int64_t *result) {
#if defined(__clang__) || defined(__GNUC__)
  return __builtin_saddll_overflow(lhs, rhs, (long long *)result);
#elif defined(_MSC_VER)
  return LongLongAdd(lhs, rhs, (long long *)result) != S_OK;
#else
  return inko_int64_add(lhs, rhs, result);
#endif
}

Here inko_int64_add is a Rust fallback defined as follows:

#[no_mangle]
pub unsafe extern "system" fn inko_int64_add(
    lhs: i64,
    rhs: i64,
    result: *mut i64,
) -> bool {
    if let Some(value) = lhs.checked_add(rhs) {
        *result = value;
        false
    } else {
        true
    }
}

As far as I know Cranelift offers no builtins for this at all, so we'd have to implement it all from scratch. This is surprisingly tricky, and the obvious solutions are often wrong. For example, checking if a + b overflows can't be done by checking if the result is smaller than a, as this doesn't account for e.g. 10 + -20. Division is pretty easy, but multiplication is pretty difficult if I remember correctly.

@yorickpeterse yorickpeterse added feature New things to add to Inko, such as a new standard library module and removed type::feature labels Mar 15, 2023
@yorickpeterse yorickpeterse removed this from the 0.2.5 milestone Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler Changes related to the compiler feature New things to add to Inko, such as a new standard library module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant