Fix a soundness issue with the component model and async #6509

alexcrichton · 2023-06-02T16:03:10Z

This commit addresses #6493 by fixing a soundness issue with the async
implementation of the component model. This issue has been presence
since the inception of the addition of async support to the component
model and doesn't represent a recent regression. The underlying problem
is that one of the base assumptions of the trap handling code is that
there's only one single activation in TLS that needs to be pushed/popped
when a stack is switched (e.g. a fiber is switched to or from). In the
case of the component model there might be two activations: one for an
invocation of a component function and then a second for an invocation
of a realloc function to return results back to wasm (e.g. in the case
an imported function returns a list).

This problem is fixed by changing how TLS is managed in the presence of
fibers. Previously when a fiber was suspended it would pop a single
activation from the top of the stack and save that to get pushed when
the fiber was resumed. This has the benefit of maintaining an entire
linked list of activations for the current thread but has the problem
above where it doesn't handle a fiber with multiple activations on it.
Instead now TLS management is done when a fiber is resumed instead of
suspended. Instead of pushing/popping a single activation the entire
linked list of activations is tracked for a particular fiber and stored
within the fiber itself. In this manner resuming a fiber will push
all activations onto the current thread and suspending a fiber will pop
all activations for the fiber (and store them as a new linked list in
the fiber's state itself).

This end result is that all activations on a fiber should now be managed
correctly, regardless of how many there are. The main downside of this
commit is that fiber suspension and resumption is more complicated, but
the hope there is that fiber suspension typically corresponds with I/O
not being ready or similar so the order of magnitude of TLS operations
isn't too significant compared to the I/O overhead.

Currently this uses tokio's `spawn_blocking` but that will reuse threads in its thread pool. Instead spawn a thread and perform a single poll on that thread to force lots of fresh threads to be used and ideally stress TLS management further.

This commit adds a guard to Wasmtime's async support to double-check that when a call to `poll` is finished that the currently active TLS activation pointer does not point to the stack that is being switched off of. This is attempting to be a bit of a defense-in-depth measure to prevent stale pointers from sticking around in TLS. This is currently happening and causing bytecodealliance#6493 which can result in unsoundness but currently is manifesting as a crash.

This commit addresses bytecodealliance#6493 by fixing a soundness issue with the async implementation of the component model. This issue has been presence since the inception of the addition of async support to the component model and doesn't represent a recent regression. The underlying problem is that one of the base assumptions of the trap handling code is that there's only one single activation in TLS that needs to be pushed/popped when a stack is switched (e.g. a fiber is switched to or from). In the case of the component model there might be two activations: one for an invocation of a component function and then a second for an invocation of a `realloc` function to return results back to wasm (e.g. in the case an imported function returns a list). This problem is fixed by changing how TLS is managed in the presence of fibers. Previously when a fiber was suspended it would pop a single activation from the top of the stack and save that to get pushed when the fiber was resumed. This has the benefit of maintaining an entire linked list of activations for the current thread but has the problem above where it doesn't handle a fiber with multiple activations on it. Instead now TLS management is done when a fiber is resumed instead of suspended. Instead of pushing/popping a single activation the entire linked list of activations is tracked for a particular fiber and stored within the fiber itself. In this manner resuming a fiber will push all activations onto the current thread and suspending a fiber will pop all activations for the fiber (and store them as a new linked list in the fiber's state itself). This end result is that all activations on a fiber should now be managed correctly, regardless of how many there are. The main downside of this commit is that fiber suspension and resumption is more complicated, but the hope there is that fiber suspension typically corresponds with I/O not being ready or similar so the order of magnitude of TLS operations isn't too significant compared to the I/O overhead. Closes bytecodealliance#6493

alexcrichton · 2023-06-02T16:06:50Z

I'll note that I had a number of false starts in attempting to solve this issue. First I thought that this only needed a push/pop around calls to cabi_realloc in the component model, but that wasn't correct because that only works for import-based calls to realloc, not export-based calls. I then wrote a test to expose the flaw with that strategy.

Next I thought that it would be possible to not actually maintain the entire linked list of activations and instead have a fiber have its own linked list which replaces the current thread's linked list when it resumes and swaps it back out when it suspends. This however does not work because a complete list of activations is required to correctly perform GC, so I wrote a test to expose that bug.

Finally I settled on this solution, which is at least sufficient for all the above scenarios. I'm not 100% sure it's the end-all-be-all, so extra care in review would be appreciated!

github-actions · 2023-06-02T16:45:03Z

Subscribe to Label Action

cc @peterhuene

This issue or pull request has been labeled: "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

fitzgen

Mostly LGTM but I have an idea around making this slightly stronger typed and also some naming nitpicks (take these with a grain of salt, I'm not 100% sure why I hate "TLS" here so much, maybe you can think of something better tho, since I am woefully lacking in terms of actual alternatives)

fitzgen · 2023-06-02T17:43:45Z

tests/all/async_functions.rs

+/// This will execute the `future` provided to completion and each invocation of
+/// `poll` for the future will be executed on a separate thread.
+pub async fn execute_across_threads<F>(future: F) -> F::Output
 where
    F: Future + Send + 'static,
    F::Output: Send,
 {
-    let future = PollOnce::new(Box::pin(future)).await;
-
-    tokio::task::spawn_blocking(move || tokio::runtime::Handle::current().block_on(future))
-        .await
-        .expect("shouldn't panic")
+    let mut future = Box::pin(future);
+    loop {
+        let once = PollOnce::new(future);
+        let handle = tokio::runtime::Handle::current();
+        let result = std::thread::spawn(move || handle.block_on(once))
+            .join()
+            .unwrap();
+        match result {
+            Ok(val) => break val,
+            Err(f) => future = f,
+        }
+    }
 }


crates/runtime/src/traphandlers.rs

crates/wasmtime/src/store.rs

crates/runtime/src/traphandlers.rs

crates/wasmtime/src/store.rs

alexcrichton · 2023-06-02T19:23:46Z

I definitely agree with the separation of types and re-wording of things, done now 👍

fitzgen

Thanks! LGTM!

alexcrichton added 3 commits June 2, 2023 08:00

alexcrichton requested a review from fitzgen June 2, 2023 16:03

alexcrichton requested a review from a team as a code owner June 2, 2023 16:03

alexcrichton mentioned this pull request Jun 2, 2023

Crash when async yielding is enabled #6493

Closed

github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Jun 2, 2023

fitzgen reviewed Jun 2, 2023

View reviewed changes

Review comments

b5f780d

Fix restoration during panic

4b09699

fitzgen approved these changes Jun 2, 2023

View reviewed changes

fitzgen added this pull request to the merge queue Jun 2, 2023

Merged via the queue into bytecodealliance:main with commit 550a16f Jun 2, 2023

alexcrichton deleted the component-async-bug branch June 12, 2023 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a soundness issue with the component model and async #6509

Fix a soundness issue with the component model and async #6509

alexcrichton commented Jun 2, 2023

alexcrichton commented Jun 2, 2023

github-actions bot commented Jun 2, 2023

fitzgen left a comment

fitzgen Jun 2, 2023

alexcrichton commented Jun 2, 2023

fitzgen left a comment

Fix a soundness issue with the component model and async #6509

Fix a soundness issue with the component model and async #6509

Conversation

alexcrichton commented Jun 2, 2023

alexcrichton commented Jun 2, 2023

github-actions bot commented Jun 2, 2023

Subscribe to Label Action

fitzgen left a comment

Choose a reason for hiding this comment

fitzgen Jun 2, 2023

Choose a reason for hiding this comment

alexcrichton commented Jun 2, 2023

fitzgen left a comment

Choose a reason for hiding this comment