-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add experimental support for illumos #9535
Conversation
if #[cfg(target_os = "illumos")] { | ||
// On illumos, by default, mmap reserves what it calls "swap space" ahead of time, so that | ||
// memory accesses are guaranteed not to fail once mmap succeeds. NORESERVE is for cases | ||
// where that memory is never meant to be accessed -- e.g. memory that's used as guard | ||
// pages. | ||
pub(super) const MMAP_NORESERVE_FLAG: rustix::mm::MapFlags = | ||
rustix::mm::MapFlags::NORESERVE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having spent some time digging into this, I believe that we should also pass in NORESERVE on Linux here as well. On Linux this is a bit hidden with overcommit, but e.g. Chromium passes in NORESERVE: https://chromium.googlesource.com/v8/v8/+/chromium/2840/src/base/platform/platform-linux.cc#234
I'm happy to update this PR with that change if it sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable to me!
re slowness, for future reference, with 997Hz sampling: user stacks: https://gist.github.com/sunshowers/b69b7bd2e671d9c23355d5e952636c5e kernel stacks: https://gist.github.com/sunshowers/fa822f161e54d57a8103f6736656fbe8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thank you for your time in doing this!
One high-level question for you: how easy is it to compile for illumos from Linux? For example we'd ideally couple this with a check in our CI that illumos continues to build (e.g. at least via cargo check
) but our CI only runs Linux/macOS/Windows right now so we'd have to cross-compile. If it's easy to cross-compile I think it'd be quite reasonable to have a check added to CI that this passes. For example this is our FreeBSD check (and reading that again we should probably apply the same trick you've done here with handling vtune/ittapi by default, but that's ok to leave to a future PR)
takes 0.07 seconds on Linux, but over 5 seconds on illumos
Whoa! It looks like the pooling allocator is the part that's slow here and that, by default, has a large number of virtual memory mappings associated with it. For example it'll allocate terabytes of virtual memory and then within that giant chunk it'll slice up roughly 10_000 linear memories (each with guard regions between them). These are prepared with a MemoryImageSlot
each.
My guess is that the way things are managed is tuned to "this is acceptable due to some fast path in Linux we're hidding" which we didn't really design for and just happened to run across. I think it'd be reasonable to solve this along a number of lines such as:
- Rely on the kernel to fix this (as you mentioned)
- Change default settings in Wasmtime for illumos (e.g. different defaults for the pooling allocator or something like that)
- Refactor the pooling allocator to have a different allocation strategy by default which is better-by-default on illumos.
Basically I think it'd be reasonable to handle this in Wasmtime with code changes too. I'm certainly no expert with illumos though so I'm also happy to defer to those with more knowledge of it.
if #[cfg(target_os = "illumos")] { | ||
// On illumos, by default, mmap reserves what it calls "swap space" ahead of time, so that | ||
// memory accesses are guaranteed not to fail once mmap succeeds. NORESERVE is for cases | ||
// where that memory is never meant to be accessed -- e.g. memory that's used as guard | ||
// pages. | ||
pub(super) const MMAP_NORESERVE_FLAG: rustix::mm::MapFlags = | ||
rustix::mm::MapFlags::NORESERVE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable to me!
@@ -24,7 +37,7 @@ impl Mmap { | |||
ptr::null_mut(), | |||
size, | |||
rustix::mm::ProtFlags::READ | rustix::mm::ProtFlags::WRITE, | |||
rustix::mm::MapFlags::PRIVATE, | |||
rustix::mm::MapFlags::PRIVATE | MMAP_NORESERVE_FLAG, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for this one in specific we'll want to leave off NORESERVE
since this is creating a mapping that is entirely readable/writable and aren't doing the same trick with address-space reservation like the reserve
call below.
(unless in your testing you found that this was required, in which case I'd be curious to dig in!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left this in based on the discussion below -- do you feel like that's okay or should I test without the flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah yeah let's leave it here and see how it goes, it doesn't seem unreasonable given the parts below
@@ -56,7 +56,7 @@ pub unsafe fn decommit_pages(addr: *mut u8, len: usize) -> io::Result<()> { | |||
addr as _, | |||
len, | |||
ProtFlags::READ | ProtFlags::WRITE, | |||
MapFlags::PRIVATE | MapFlags::FIXED, | |||
MapFlags::PRIVATE | super::mmap::MMAP_NORESERVE_FLAG | MapFlags::FIXED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question for you (as I've not had much experience historically with MMAP_NORESERVE
):
In above situations MMAP_NORESERVE
is paired with ProtFlags::empty()
which makes sense to me because that's just a VM reservation as opposed to actually needing readable/writable memory at any one point in time. Here, though, there are read/write flags (same as remap_as_zeros_at
below). Does that means that we should leave off MMAP_NORESERVE
here?
This function is for example used for erasing wasm memory that was already accessible to the guest itself so in some sense we in theory already have pages reserved for it, so I think there's a case to be made for leaving off the flag here and below. That being said I'm new to this flag, so I'm curious what you think too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering about this and the other cases with non-PROT_NONE permissions! From reading the code it wasn't quite clear to me whether this is memory that guests would certainly have accessed and therefore have physical reservations for already, or memory that guests could merely theoretically access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iximeow actually suggested I do this, so I'm wondering if they have thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i may have misremembered where decommit_pages
gets used! my memory was that this is used when cleaning up after an instance exits, where a new module that could have a smaller max heap size. so my thinking was that even though the pages had been accessible before, we don't yet know if that will still be the case on the instance's next invocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh this function definitely has a storied history and has gone through many iterations. I had to reread up on the callsites to refresh myself on what exactly is happening here. @iximeow you're right that this is used after an instance is deallocated, but this is separate from resizing memory. Here decommit_pages
is only used to purge backing memory for ranges that were possibly-in-use previously and reset them to their original contents. That's used for both linear memories and tables then. Later other functions like hide_existing_mapping
are use to shrink memory or expose_existing_mapping
are used to grow memory.
So put another way: this function is used on memory which was theoretically accessible to wasm but it may or may not have touched because it may not have touched every page in its linear memory. Some of it may be reused for the next instance without further syscalls. Some of it may be hidden via mprotect
and then re-exposed via mprotect
to use in wasm later on too.
I suspect Wasmtime's not built currently to work well with precise overcommit because we'd have to hook into expose_existing_mapping
as well for example which currently is just a bland mprotect
. Otherwise I don't think it's unreasonable to use NORESERVE
here.
Overall though it seems like the main use case for NORESERVE
is in the reserve
function above where the allocations are often 6G+. Everywhere else should be relatively small allocations/deallocations related to what wasm is about to use or previously using.
In any case happy to leave this in here and we can always frob it later if necessary.
Interesting -- how would I go about doing this? I was floundering around the wasmtime code haha :) |
edit: ahh spoke too soon.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it's easier to just use cross build --target x86_64-unknown-illumos
oh nice! If you'd like to add that here as well feel free, otherwise I'm happy to try to finagle this in a follow-up too
If you want to setup the NORESERVE
flag for linux builds as well I think this might be good to go after that?
@@ -56,7 +56,7 @@ pub unsafe fn decommit_pages(addr: *mut u8, len: usize) -> io::Result<()> { | |||
addr as _, | |||
len, | |||
ProtFlags::READ | ProtFlags::WRITE, | |||
MapFlags::PRIVATE | MapFlags::FIXED, | |||
MapFlags::PRIVATE | super::mmap::MMAP_NORESERVE_FLAG | MapFlags::FIXED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh this function definitely has a storied history and has gone through many iterations. I had to reread up on the callsites to refresh myself on what exactly is happening here. @iximeow you're right that this is used after an instance is deallocated, but this is separate from resizing memory. Here decommit_pages
is only used to purge backing memory for ranges that were possibly-in-use previously and reset them to their original contents. That's used for both linear memories and tables then. Later other functions like hide_existing_mapping
are use to shrink memory or expose_existing_mapping
are used to grow memory.
So put another way: this function is used on memory which was theoretically accessible to wasm but it may or may not have touched because it may not have touched every page in its linear memory. Some of it may be reused for the next instance without further syscalls. Some of it may be hidden via mprotect
and then re-exposed via mprotect
to use in wasm later on too.
I suspect Wasmtime's not built currently to work well with precise overcommit because we'd have to hook into expose_existing_mapping
as well for example which currently is just a bland mprotect
. Otherwise I don't think it's unreasonable to use NORESERVE
here.
Overall though it seems like the main use case for NORESERVE
is in the reserve
function above where the allocations are often 6G+. Everywhere else should be relatively small allocations/deallocations related to what wasm is about to use or previously using.
In any case happy to leave this in here and we can always frob it later if necessary.
Ugh, I guess I need to clean the target dir for illumos. Silly glibc. |
5e82b23
to
470ad80
Compare
With this change, the basic wasm tests all pass on illumos. Note the addition of NORESERVE to mmap calls. However: While wasmtime appears to be functional on illumos, it is still quite slow, particularly in the wast tests. For example, the following test: ``` cargo +beta test --test wast -- Cranelift/pooling/tests/spec_testsuite/load.wast ``` takes 0.07 seconds on Linux, but over 5 seconds on illumos. Some profiling suggests that this is due to lock contention inside the kernel while freeing memory, so I don't think this is a wasmtime issue. I'd like to pull some illumos experts in to do some debugging here as time permits, but I don't think this PR should necessarily be held up on that. Thanks to iximeow for all the help with this! (One note is that due to a [rustc segfault on illumos](https://sunshowers.io/posts/rustc-segfault-illumos/), building wasmtime requires Rust 1.83 or higher. I did my building and testing with `+beta`.)
All right, CI updated and I think this is good to go. |
With this change, the basic wasm tests all pass on illumos. Note the addition of NORESERVE to mmap calls.
However:
While wasmtime appears to be functional on illumos, it is still quite slow, particularly in the wast tests. For example, the following test:
takes 0.07 seconds on Linux, but over 5 seconds on illumos. Some profiling suggests that this is due to lock contention inside the kernel while freeing memory, so I don't think this is a wasmtime issue. I'd like to pull some illumos experts in to do some debugging here as time permits, but I don't think this PR should necessarily be held up on that.
Thanks to iximeow for all the help with this!
(One note is that due to a rustc segfault on illumos, building wasmtime requires Rust 1.83 or higher. I did my building and testing with
+beta
.)