-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cranelift: de-duplicate bounds checks in legalizations #5190
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
@@ -46,6 +47,8 @@ fn imm_const(pos: &mut FuncCursor, arg: Value, imm: Imm64, is_signed: bool) -> V | |||
/// Perform a simple legalization by expansion of the function, without | |||
/// platform-specific transforms. | |||
pub fn simple_legalize(func: &mut ir::Function, cfg: &mut ControlFlowGraph, isa: &dyn TargetIsa) { | |||
trace!("Pre-legalization function:\n{}", func.display()); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These trace!
s seem a bit verbose to leave in -- big function bodies will be replicated two more times (in addition to the pre-opt, pre-lowering and lowered vcode forms). I don't feel too strongly about that, just a suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove the post-legalization logging (since you can see that as the input to the next pass) but I think that in general every pass should trace!
log its input.
Trevor and I investigated the failing test on aarch64 and this ultimately led us Fundamentally, this PR changes the semantics of This leads us into a bit of a mess:
I'm not sure how to resolve these issues and still deduplicate the bounds checks Interested in folks thoughts on this. Separately, Trevor and I discovered the following issues / not necessarily
|
I think it changes the semantics for dynamic heaps, but this doesn't change the semantics for static heaps I think? In that sense all our libcalls should already be guarded against "this address is in bounds and should not segfault" I think? |
Thanks for digging into this; a few thoughts:
Overall, I think I'm not as pessimistic on this: it basically changes |
Yeah, and I agree it is safe, it just makes me a little nervous to produce CLIF that has |
I guess the biggest unknown in my mind is how large of a NULL guard page region we can rely on? Because this PR bakes in assumptions that the spectre guard already makes about For example, given (module
(memory (export "memory") 1)
(func (export "f") (param i32) (result i32)
local.get 0
i32.load offset=0xfffffff0)) We lower that Wasm to this clif: v4 = heap_addr.i64 heap0, v2, 0x8000_0000
v5 = iadd_imm v4, 0xffff_fff0
v6 = load.i32 little heap v5 And then (with static memories (but smaller than 4GiB guard region) + spectre guards) legalize that clif to this clif: ;; check if the pointer is larger than the static memory size (including both
;; wasm memory and guard pages, I think?)
v7 = uextend.i64 v2
v8 = icmp_imm ugt v7, 0x8000_0000
trapnz v8, heap_oob
v9 = iconst.i64 0x8000_0000
;; gv4 is the memory base
v10 = global_value.i64 gv4
v11 = iadd v10, v7
v12 = iconst.i64 0
v13 = icmp ugt v7, v9 ; v9 = 0x8000_0000
v4 = select_spectre_guard v13, v12, v11 ; v12 = 0
v5 = iadd_imm v4, 0xffff_fff0
v6 = load.i32 little heap v5 Do we guarantee a 4GiB-sized NULL guard region? Because doing the spectre guard on the base ( This becomes even more problematic when we look at the legalization of that lowered Wasm's initial clif, this time configured with dynamic memories and spectre guards, and using this patch which removes the dynamic memory's bounds check in favor of the spectre guard's bounds check: ;; extend wasm i32 pointer to native i64 pointer
v7 = uextend.i64 v2
;; gv4 is the base of memory
v8 = global_value.i64 gv4
;; overflow check for pointer + offset
v9 = iconst.i64 0xffff_0000
v10 = uadd_overflow_trap v7, v9, heap_oob ; v9 = 0xffff_0000
;; gv5 is the length of memory
v11 = global_value.i64 gv5
v12 = iadd v11, v7
v13 = iconst.i64 0
v14 = icmp ugt v10, v8
v4 = select_spectre_guard v14, v13, v12 ; v13 = 0
v5 = iadd_imm v4, 0xffff_fff0
v6 = load.i32 little heap v5 We're baking in that same assumption. Now we aren't just letting speculative execution access the hopefully-null-guard-page-protected region of memory in 0..4GiB, but regular Wasm execution as well. I think we can fix all this ambiguity by effectively doing |
This requires changes to |
I don't think we can guarantee a 4GB null guard. Memory may be mapped within this range, whether explicitly using mmal or because the executable has a fixed position within this range. Only 4096 bytes is guaranteed on linux as this is the default value for |
If I understand the code correctly, I think we handle all of this already -- first, here we get Or I guess all of this is a long way of saying: I don't think we actually lower into that CLIF if dynamic memory is configured. This is subtly different from
I think -- if I understand your methodology correct, you took the CLIF generated by translation with static memories enabled, then compiled it with dynamic memories enabled? With the above-linked code I think doing the translation with dynamic memories enabled should result in code that does not assume a 4GiB guard at address 0. |
(And to answer the immediate question, no, we definitely can't assume a 4GiB guard region at address 0! That would be an un-portable mess, as you observe.) |
No, each different legalization example is generated from the original input Wasm with different flags passed to Wasmtime. I was not working with just the CLIF. Always using the original Wasm as input. |
This still doesn't prevent speculated execution from reading 0..4GiB though, because the generated code ultimately does
rather than
|
Ah, OK, I am misreading your concern, sorry. I had thought you were saying that the actual bounds-check (the non-speculative behavior that should result in a trap) was wrong; and was very confused because the result of the But indeed, the select is picking the base prior to the offset addition. We should fix that by reordering the ops, as you suggest, and providing Once we do that, are there any other issues with this approach? |
Today, actual execution is not affected by this. After this PR as currently written, actual execution is affected by this when dynamic memories and spectre guards are both enabled.
Not that I am aware of. Currently working on the |
Right, sorry, makes sense now; not sure why I missed that! |
No worries, this stuff is pretty subtle, it has basically taken me all day just to write these comments and read through the code enough to understand what is happening here :-p |
This is a continuation of the thrust in bytecodealliance#5207 for reducing page faults and lock contention when using the pooling allocator. To that end this commit implements support for efficient memory management in the pooling allocator when using wasm that is instrumented with bounds checks. The `MemoryImageSlot` type now avoids unconditionally shrinking memory back to its initial size during the `clear_and_remain_ready` operation, instead deferring optional resizing of memory to the subsequent call to `instantiate` when the slot is reused. The instantiation portion then takes the "memory style" as an argument which dictates whether the accessible memory must be precisely fit or whether it's allowed to exceed the maximum. This in effect enables skipping a call to `mprotect` to shrink the heap when dynamic memory checks are enabled. In terms of page fault and contention this should improve the situation by: * Fewer calls to `mprotect` since once a heap grows it stays grown and it never shrinks. This means that a write lock is taken within the kernel much more rarely from before (only asymptotically now, not N-times-per-instance). * Accessed memory after a heap growth operation will not fault if it was previously paged in by a prior instance and set to zero with `memset`. Unlike bytecodealliance#5207 which requires a 6.0 kernel to see this optimization this commit enables the optimization for any kernel. The major cost of choosing this strategy is naturally the performance hit of the wasm itself. This is being looked at in PRs such as bytecodealliance#5190 to improve Wasmtime's story here. This commit does not implement any new configuration options for Wasmtime but instead reinterprets existing configuration options. The pooling allocator no longer unconditionally sets `static_memory_bound_is_maximum` and then implements support necessary for this memory type. This other change to this commit is that the `Tunables::static_memory_bound` configuration option is no longer gating on the creation of a `MemoryPool` and it will now appropriately size to `instance_limits.memory_pages` if the `static_memory_bound` is to small. This is done to accomodate fuzzing more easily where the `static_memory_bound` will become small during fuzzing and otherwise the configuration would be rejected and require manual handling. The spirit of the `MemoryPool` is one of large virtual address space reservations anyway so it seemed reasonable to interpret the configuration this way.
* Implement support for dynamic memories in the pooling allocator This is a continuation of the thrust in #5207 for reducing page faults and lock contention when using the pooling allocator. To that end this commit implements support for efficient memory management in the pooling allocator when using wasm that is instrumented with bounds checks. The `MemoryImageSlot` type now avoids unconditionally shrinking memory back to its initial size during the `clear_and_remain_ready` operation, instead deferring optional resizing of memory to the subsequent call to `instantiate` when the slot is reused. The instantiation portion then takes the "memory style" as an argument which dictates whether the accessible memory must be precisely fit or whether it's allowed to exceed the maximum. This in effect enables skipping a call to `mprotect` to shrink the heap when dynamic memory checks are enabled. In terms of page fault and contention this should improve the situation by: * Fewer calls to `mprotect` since once a heap grows it stays grown and it never shrinks. This means that a write lock is taken within the kernel much more rarely from before (only asymptotically now, not N-times-per-instance). * Accessed memory after a heap growth operation will not fault if it was previously paged in by a prior instance and set to zero with `memset`. Unlike #5207 which requires a 6.0 kernel to see this optimization this commit enables the optimization for any kernel. The major cost of choosing this strategy is naturally the performance hit of the wasm itself. This is being looked at in PRs such as #5190 to improve Wasmtime's story here. This commit does not implement any new configuration options for Wasmtime but instead reinterprets existing configuration options. The pooling allocator no longer unconditionally sets `static_memory_bound_is_maximum` and then implements support necessary for this memory type. This other change to this commit is that the `Tunables::static_memory_bound` configuration option is no longer gating on the creation of a `MemoryPool` and it will now appropriately size to `instance_limits.memory_pages` if the `static_memory_bound` is to small. This is done to accomodate fuzzing more easily where the `static_memory_bound` will become small during fuzzing and otherwise the configuration would be rejected and require manual handling. The spirit of the `MemoryPool` is one of large virtual address space reservations anyway so it seemed reasonable to interpret the configuration this way. * Skip zero memory_size cases These are causing errors to happen when fuzzing and otherwise in theory shouldn't be too interesting to optimize for anyway since they likely aren't used in practice.
Rather than return just the `base + index`. (Note: I've chosen to use the nomenclature "index" for the dynamic operand and "offset" for the static immediate.) This move the addition of the `offset` into `heap_addr`, instead of leaving it for the subsequent memory operation, so that we can Spectre-guard the full address, and not allow speculative execution to read the first 4GiB of memory. Before this commit, we were effectively doing load(spectre_guard(base + index) + offset) Now we are effectively doing load(spectre_guard(base + index + offset)) Finally, this also corrects `heap_addr`'s documented semantics to say that it returns an address that will trap on access if `index + offset + access_size` is out of bounds for the given heap, rather than saying that the `heap_addr` itself will trap. This matches the implemented behavior for static memories, and after bytecodealliance#5190 lands (which is blocked on this commit) will also match the implemented behavior for dynamic memories.
Rather than return just the `base + index`. (Note: I've chosen to use the nomenclature "index" for the dynamic operand and "offset" for the static immediate.) This move the addition of the `offset` into `heap_addr`, instead of leaving it for the subsequent memory operation, so that we can Spectre-guard the full address, and not allow speculative execution to read the first 4GiB of memory. Before this commit, we were effectively doing load(spectre_guard(base + index) + offset) Now we are effectively doing load(spectre_guard(base + index + offset)) Finally, this also corrects `heap_addr`'s documented semantics to say that it returns an address that will trap on access if `index + offset + access_size` is out of bounds for the given heap, rather than saying that the `heap_addr` itself will trap. This matches the implemented behavior for static memories, and after bytecodealliance#5190 lands (which is blocked on this commit) will also match the implemented behavior for dynamic memories.
Rather than return just the `base + index`. (Note: I've chosen to use the nomenclature "index" for the dynamic operand and "offset" for the static immediate.) This move the addition of the `offset` into `heap_addr`, instead of leaving it for the subsequent memory operation, so that we can Spectre-guard the full address, and not allow speculative execution to read the first 4GiB of memory. Before this commit, we were effectively doing load(spectre_guard(base + index) + offset) Now we are effectively doing load(spectre_guard(base + index + offset)) Finally, this also corrects `heap_addr`'s documented semantics to say that it returns an address that will trap on access if `index + offset + access_size` is out of bounds for the given heap, rather than saying that the `heap_addr` itself will trap. This matches the implemented behavior for static memories, and after bytecodealliance#5190 lands (which is blocked on this commit) will also match the implemented behavior for dynamic memories.
…#5231) * Cranelift: Make `heap_addr` return calculated `base + index + offset` Rather than return just the `base + index`. (Note: I've chosen to use the nomenclature "index" for the dynamic operand and "offset" for the static immediate.) This move the addition of the `offset` into `heap_addr`, instead of leaving it for the subsequent memory operation, so that we can Spectre-guard the full address, and not allow speculative execution to read the first 4GiB of memory. Before this commit, we were effectively doing load(spectre_guard(base + index) + offset) Now we are effectively doing load(spectre_guard(base + index + offset)) Finally, this also corrects `heap_addr`'s documented semantics to say that it returns an address that will trap on access if `index + offset + access_size` is out of bounds for the given heap, rather than saying that the `heap_addr` itself will trap. This matches the implemented behavior for static memories, and after #5190 lands (which is blocked on this commit) will also match the implemented behavior for dynamic memories. * Update heap_addr docs * Factor out `offset + size` to a helper
d322bd6
to
8403248
Compare
@cfallin I've rebased this on top of |
When both (1) "dynamic" memories that need explicit bounds checks and (2) spectre mitigations that perform bounds checks are enabled, reuse the same bounds checks between the two legalizations. This reduces the overhead of explicit bounds checks and spectre mitigations over using virtual memory guard pages with spectre mitigations from ~1.9-2.1x overhead to ~1.6-1.8x overhead. That is about a 14-19% speed up for when dynamic memories and spectre mitigations are enabled. <details> ``` execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm Δ = 3422455129.47 ± 120159.49 (confidence = 99%) virtual-memory-guards.so is 2.09x to 2.09x faster than bounds-checks.so! [6563931659 6564063496.07 6564301535] bounds-checks.so [3141492675 3141608366.60 3141895249] virtual-memory-guards.so execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm Δ = 338716136.87 ± 1.38 (confidence = 99%) virtual-memory-guards.so is 2.08x to 2.08x faster than bounds-checks.so! [651961494 651961495.47 651961497] bounds-checks.so [313245357 313245358.60 313245362] virtual-memory-guards.so execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 22742944.07 ± 331.73 (confidence = 99%) virtual-memory-guards.so is 1.87x to 1.87x faster than bounds-checks.so! [48841295 48841567.33 48842139] bounds-checks.so [26098439 26098623.27 26099479] virtual-memory-guards.so ``` </details> <details> ``` execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm Δ = 2465900207.27 ± 146476.61 (confidence = 99%) virtual-memory-guards.so is 1.78x to 1.78x faster than de-duped-bounds-checks.so! [5607275431 5607442989.13 5607838342] de-duped-bounds-checks.so [3141445345 3141542781.87 3141711213] virtual-memory-guards.so execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm Δ = 234253620.20 ± 2.33 (confidence = 99%) virtual-memory-guards.so is 1.75x to 1.75x faster than de-duped-bounds-checks.so! [547498977 547498980.93 547498985] de-duped-bounds-checks.so [313245357 313245360.73 313245363] virtual-memory-guards.so execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 16605659.13 ± 315.78 (confidence = 99%) virtual-memory-guards.so is 1.64x to 1.64x faster than de-duped-bounds-checks.so! [42703971 42704284.40 42704787] de-duped-bounds-checks.so [26098432 26098625.27 26099234] virtual-memory-guards.so ``` </details> <details> ``` execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm Δ = 104462517.13 ± 7.32 (confidence = 99%) de-duped-bounds-checks.so is 1.19x to 1.19x faster than bounds-checks.so! [651961493 651961500.80 651961532] bounds-checks.so [547498981 547498983.67 547498989] de-duped-bounds-checks.so execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm Δ = 956556982.80 ± 103034.59 (confidence = 99%) de-duped-bounds-checks.so is 1.17x to 1.17x faster than bounds-checks.so! [6563930590 6564019842.40 6564243651] bounds-checks.so [5607307146 5607462859.60 5607677763] de-duped-bounds-checks.so execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 6137307.87 ± 247.75 (confidence = 99%) de-duped-bounds-checks.so is 1.14x to 1.14x faster than bounds-checks.so! [48841303 48841472.93 48842000] bounds-checks.so [42703965 42704165.07 42704718] de-duped-bounds-checks.so ``` </details>
…nd spectre mitigations
8403248
to
7aa3cff
Compare
Just realized @cfallin is going to be OoO for a few days, @jameysharp would you mind taking a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since #5231 is merged, sure, this makes enough sense to me to sign off on it.
It's unfortunate that the extensive comments in cranelift/filetests/filetests/isa/x64/heap.clif
haven't been maintained since @abrown wrote them in #4841. If you have a chance to fix them up before merging this I think that'd be great, but I wouldn't block merging on that.
I plan on fixing up a lot of |
When both (1) "dynamic" memories that need explicit bounds checks and (2)
spectre mitigations that perform bounds checks are enabled, reuse the same
bounds checks between the two legalizations.
This reduces the overhead of explicit bounds checks and spectre mitigations over
using virtual memory guard pages with spectre mitigations from ~1.9-2.1x
overhead to ~1.6-1.8x overhead. That is about a 14-19% speed up for when dynamic
memories and spectre mitigations are enabled.
virtual-memory-guards.so vs bounds-checks.so
virtual-memory-guards.so vs de-duped-bounds-checks.so
bounds-checks.so vs de-duped-bounds-checks.so