-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a soundness issue with the component model and async #6509
Fix a soundness issue with the component model and async #6509
Conversation
Currently this uses tokio's `spawn_blocking` but that will reuse threads in its thread pool. Instead spawn a thread and perform a single poll on that thread to force lots of fresh threads to be used and ideally stress TLS management further.
This commit adds a guard to Wasmtime's async support to double-check that when a call to `poll` is finished that the currently active TLS activation pointer does not point to the stack that is being switched off of. This is attempting to be a bit of a defense-in-depth measure to prevent stale pointers from sticking around in TLS. This is currently happening and causing bytecodealliance#6493 which can result in unsoundness but currently is manifesting as a crash.
This commit addresses bytecodealliance#6493 by fixing a soundness issue with the async implementation of the component model. This issue has been presence since the inception of the addition of async support to the component model and doesn't represent a recent regression. The underlying problem is that one of the base assumptions of the trap handling code is that there's only one single activation in TLS that needs to be pushed/popped when a stack is switched (e.g. a fiber is switched to or from). In the case of the component model there might be two activations: one for an invocation of a component function and then a second for an invocation of a `realloc` function to return results back to wasm (e.g. in the case an imported function returns a list). This problem is fixed by changing how TLS is managed in the presence of fibers. Previously when a fiber was suspended it would pop a single activation from the top of the stack and save that to get pushed when the fiber was resumed. This has the benefit of maintaining an entire linked list of activations for the current thread but has the problem above where it doesn't handle a fiber with multiple activations on it. Instead now TLS management is done when a fiber is resumed instead of suspended. Instead of pushing/popping a single activation the entire linked list of activations is tracked for a particular fiber and stored within the fiber itself. In this manner resuming a fiber will push all activations onto the current thread and suspending a fiber will pop all activations for the fiber (and store them as a new linked list in the fiber's state itself). This end result is that all activations on a fiber should now be managed correctly, regardless of how many there are. The main downside of this commit is that fiber suspension and resumption is more complicated, but the hope there is that fiber suspension typically corresponds with I/O not being ready or similar so the order of magnitude of TLS operations isn't too significant compared to the I/O overhead. Closes bytecodealliance#6493
I'll note that I had a number of false starts in attempting to solve this issue. First I thought that this only needed a push/pop around calls to Next I thought that it would be possible to not actually maintain the entire linked list of activations and instead have a fiber have its own linked list which replaces the current thread's linked list when it resumes and swaps it back out when it suspends. This however does not work because a complete list of activations is required to correctly perform GC, so I wrote a test to expose that bug. Finally I settled on this solution, which is at least sufficient for all the above scenarios. I'm not 100% sure it's the end-all-be-all, so extra care in review would be appreciated! |
Subscribe to Label Actioncc @peterhuene
This issue or pull request has been labeled: "wasmtime:api"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM but I have an idea around making this slightly stronger typed and also some naming nitpicks (take these with a grain of salt, I'm not 100% sure why I hate "TLS" here so much, maybe you can think of something better tho, since I am woefully lacking in terms of actual alternatives)
/// This will execute the `future` provided to completion and each invocation of | ||
/// `poll` for the future will be executed on a separate thread. | ||
pub async fn execute_across_threads<F>(future: F) -> F::Output | ||
where | ||
F: Future + Send + 'static, | ||
F::Output: Send, | ||
{ | ||
let future = PollOnce::new(Box::pin(future)).await; | ||
|
||
tokio::task::spawn_blocking(move || tokio::runtime::Handle::current().block_on(future)) | ||
.await | ||
.expect("shouldn't panic") | ||
let mut future = Box::pin(future); | ||
loop { | ||
let once = PollOnce::new(future); | ||
let handle = tokio::runtime::Handle::current(); | ||
let result = std::thread::spawn(move || handle.block_on(once)) | ||
.join() | ||
.unwrap(); | ||
match result { | ||
Ok(val) => break val, | ||
Err(f) => future = f, | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice
I definitely agree with the separation of types and re-wording of things, done now 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM!
This commit addresses #6493 by fixing a soundness issue with the async
implementation of the component model. This issue has been presence
since the inception of the addition of async support to the component
model and doesn't represent a recent regression. The underlying problem
is that one of the base assumptions of the trap handling code is that
there's only one single activation in TLS that needs to be pushed/popped
when a stack is switched (e.g. a fiber is switched to or from). In the
case of the component model there might be two activations: one for an
invocation of a component function and then a second for an invocation
of a
realloc
function to return results back to wasm (e.g. in the casean imported function returns a list).
This problem is fixed by changing how TLS is managed in the presence of
fibers. Previously when a fiber was suspended it would pop a single
activation from the top of the stack and save that to get pushed when
the fiber was resumed. This has the benefit of maintaining an entire
linked list of activations for the current thread but has the problem
above where it doesn't handle a fiber with multiple activations on it.
Instead now TLS management is done when a fiber is resumed instead of
suspended. Instead of pushing/popping a single activation the entire
linked list of activations is tracked for a particular fiber and stored
within the fiber itself. In this manner resuming a fiber will push
all activations onto the current thread and suspending a fiber will pop
all activations for the fiber (and store them as a new linked list in
the fiber's state itself).
This end result is that all activations on a fiber should now be managed
correctly, regardless of how many there are. The main downside of this
commit is that fiber suspension and resumption is more complicated, but
the hope there is that fiber suspension typically corresponds with I/O
not being ready or similar so the order of magnitude of TLS operations
isn't too significant compared to the I/O overhead.