Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wasmparser: Performance regressions since v0.100.2 #1701

Closed
Robbepop opened this issue Jul 26, 2024 · 23 comments · Fixed by #1906
Closed

wasmparser: Performance regressions since v0.100.2 #1701

Robbepop opened this issue Jul 26, 2024 · 23 comments · Fixed by #1906

Comments

@Robbepop
Copy link
Collaborator

Robbepop commented Jul 26, 2024

Due to technical reasons I had to use the (by now) ancient wasmparser 0.100.2 in Wasmi and was using a fork called wasmparser-nostd. The fork had no significant impact on the performance.

Recently I updated Wasmi to use the newest wasmparser v0.214.0 and unfortunately saw massive performance regressions, especially for Wasm validation. Many Wasmi benchmarks regressed by 15-35% and some even by 45% compared to the Wasmi that uses the ancient wasmparser v0.100.2.

The Wasmi upgrade PR is here: wasmi-labs/wasmi#1141

Maybe I did some mistakes when upgrading which explain the performance regressions.

Since Wasmi is very focused on speedy translation times those performance regressions unfortunately are a show stopper for using the new wasmparser versions.

@alexcrichton
Copy link
Member

Do you have a link to the module with the largest regression? That'd probably be the easiest to start with in terms of evaluating validation performance on that module and seeing what sticks out in perf

@Robbepop
Copy link
Collaborator Author

Robbepop commented Jul 26, 2024

Do you mean Wasm module?

From what I saw tiny_keccak is a good contender. But there were others.
To benchmark tiny_keccak translation test cases in Wasmi, use:

cargo bench --bench benches translate/tiny_keccak/

The tiny_keccak benchmark can be found here:
https://github.com/wasmi-labs/rust-benchmarks/tree/main/cases/tiny_keccak

Update

I just found one reason why it regressed: I benchmarked without setting the no-hash-maps feature. Generally BTreeMap is faster than HashMap for Wasmi uses and the old wasmparser-nostd fork alsways uses BTreeMaps. With this feature set local benchmarks still indicate regressions but they are less severe, ranging between 10-25% instead of 15-35%. The unchecked (no validation) case now even has slightly improved performance which indicates very strongly to me that the regressions that we see come from the Wasm validation.

image

@alexcrichton
Copy link
Member

Ok I've been poking at this a bit and so far what I've found is that #701, the first commit after 0.100.0, led to a precipitous drop in performance. That was known at the time however. There are other smaller regressions here and there but they're looking to be more difficult to pin down.

Profiling the code itself does not show any unexpected hot spots. That means that it's just sort of spread out a lot.

Doing various tweaks seems to help here and there but nothing is a major improvement. For example outlining string-based errors to their own #[cold] function sometimes helps and sometimes doesn't.

One improvement that has shown promise is skipping initialization checking for locals entirely. This was first added in the function-references proposal and later proposals like gc rely on it as well. This yields a ~5% gain (ish) but requires completely removing the check. That means that only way to achieve this would be to add compile-time features to wasmparser.

Overall my guess is that the wasm language has just gotten more complicated over time which has slowed down the validator. I don't know of a great way to handle this other than to add a bunch of Cargo compile-time features for each wasm proposal. Doing that is not something I think would be easy to design but I think would provide the wins necessary here.

@Robbepop
Copy link
Collaborator Author

Robbepop commented Jul 26, 2024

Thank you for the investigation @alexcrichton !

If your assessment is correct I have to admit that it is a bit worrysome to me that the Wasm spec seems to develop into a direction that may not be suitable for all of its original users and stake holders.

I remember that you said that function-references are going to incur a performance overhead and in hindsight I was kinda optimistic that this performance penalty will surely be dealt with over time.

My assumption of how Wasm features are intended to work was that they should (ideally) not significantly influence the performance if they are unused. So only if I actually were to use the function-references proposal in a significant way in my Wasm blob, I'd "suffer" from the performance penalties that are associated to its additional complexity. Though, I totally understand that it is probably already hard enough to keep up with all the Wasm developments.

Conditional compilation wouldn't solve the problem from my point of view since eventually all Wasm runtimes (including Wasmi) want to adopt all stabilized features anyways. The Wasm standard is not modularised at the moment and thus does not allow to be picky.

@Robbepop
Copy link
Collaborator Author

Robbepop commented Jul 27, 2024

I just ran flamegraph on the translate/tiny_keccak/checked/lazy-translation test case:

main:
image

PR:
image

From what it looks I can confirm your assessment that multiple parts of the wasmparser validation pipeline seem to have regressed performance. However, especially notable parts (from the looks) are:

  • Validator::type_section
  • ModuleParser::process_end
  • VisitOperator::visit_local_get
  • VisitOperator::visit_end

One improvement that has shown promise is skipping initialization checking for locals entirely.

I forgot to ask last time: What exactly does this mean? What is initialization checking of locals and where can I find the code?

Looking just at translate_operators we can also see some significant shifts:
main:
image
PR:
image

@Robbepop
Copy link
Collaborator Author

Robbepop commented Jul 29, 2024

In my last performance assessment I made a mistake by not also enabling no-hash-maps for the main branch when conducting benchmarks between main and the PR which made the results look better for the PR than they actually are. The corrected results look even worse:

We see a performance regression between 15-45%. Still, unchecked workloads without Wasm validation are not regressed.

image

@alexcrichton
Copy link
Member

I would be hesitant to generalize wasm-tools's implementation to the entire spec development and other engines in the sense that we could just perhaps be bitten by our own decisions. This could be wasmparser-specific and not affect other implementations, I'm not sure. When I've benchmarked historically, for example, I believe that v8's validator (via node) has consistently been faster than wasmparser's. Basically I don't think wasmparser was ever at the peak of validation performance.

I also saw similar hot spots as you locally, but I was unable to figure out how best to optimize them and local attempts weren't bearing much fruit.

The initialization checking looks like this. That's an array lookup plus a boolean test of what's in the array. My guess is that everything is a well-predicted branch and in-cache since locals typically go to the same local. The 5% perf win I got (ish) was removing those checks entirely. That's not spec-compliant so we can't land it but it was the best I could do.

@Robbepop
Copy link
Collaborator Author

Robbepop commented Jul 29, 2024

You are right that this might very well be just an implementation detail that can be fixed.

Thanks a lot for the link to the local.get checks, I will have a look at the codebase to see whether it is possible to further apply some optimizations without removing those checks. E.g. I could imagine that we could use in-place bitfields for up to 32 (or 64) locals (common case) instead of Vec<bool> etc.. Maybe there are more things that could be improved. I will experiment with this a bit and come back.

@eqrion
Copy link
Collaborator

eqrion commented Jul 29, 2024

If it helps, here is the implementation in SpiderMonkey [1]. We do an upfront check for if the local index is less than where the first non-nullable local is and skip any boolean checking after that. Then we consult a bitmap. For setting a local, we also skip any consultation of the bitmap in the case where the local has been set [2].

So for applications that are not using any GC types, the only cost on local.get/set is the upfront check against the first non-nullable local index (which will always fail).

[1] https://searchfox.org/mozilla-central/rev/dd421ae14997e3bebb9c08634633a4a3e3edeffc/js/src/wasm/WasmOpIter.h#347-352
[2] https://searchfox.org/mozilla-central/rev/dd421ae14997e3bebb9c08634633a4a3e3edeffc/js/src/wasm/WasmOpIter.h#2283-2285

@Robbepop
Copy link
Collaborator Author

If it helps, here is the implementation in SpiderMonkey [1]. We do an upfront check for if the local index is less than where the first non-nullable local is and skip any boolean checking after that. Then we consult a bitmap. For setting a local, we also skip any consultation of the bitmap in the case where the local has been set [2].

So for applications that are not using any GC types, the only cost on local.get/set is the upfront check against the first non-nullable local index (which will always fail).

[1] https://searchfox.org/mozilla-central/rev/dd421ae14997e3bebb9c08634633a4a3e3edeffc/js/src/wasm/WasmOpIter.h#347-352 [2] https://searchfox.org/mozilla-central/rev/dd421ae14997e3bebb9c08634633a4a3e3edeffc/js/src/wasm/WasmOpIter.h#2283-2285

Thanks so much for sharing this information! That's invaluable. :)

@alexcrichton
Copy link
Member

Indeed thanks @eqrion! I suspect that both SpiderMonkey (and possibly v8 too) would be good sources of inspiration for algorithmic changes to the validator in wasmparser.

Around locals specifically another thing that wasmparser does is it doesn't maintain Vec<ValType> for locals and instead has a scheme where the first N locals are stored in Vec<ValType> but afterwards they're stored in Vec<(u32, ValType)> for a more compressed representation like wasm has. This representation is probably too-clever-by-half and probably isn't really benefitting all that much. Simplifying that to just Vec<ValType> would probably help clean up the code a bit and remove a branch here and there.

Another possibility is that there is currently zero unsafe code in the validator which while is a great property I don't personally consider set in stone forever necessarily. For example using the idea from SpiderMonkey checking for local initialization the first index of a non-nullable local could be the number of locals in a function if they're all nullable. That way after that "bounds check" the access into the actual types array could be done in an unchecked fashion to have just a single condition in the validation of local.get. This could also very well be "too clever by half" and introducing unsafe code is definitely something we'd want to do carefully.

@Robbepop
Copy link
Collaborator Author

Robbepop commented Oct 6, 2024

I updated the Wasmi PR to update wasmparser to v0.218.0 and improved/extended translation benchmarks and reran them to get a better understanding of where performance regressions are located and could be fixed.

Despite the performance regressions I consider merging the PR to finally make use of the newer wasmparser versions but I am keen on fixing the performance regressions, especially for the important lazy modes.

Link: wasmi-labs/wasmi#1141 (comment)

@alexcrichton
Copy link
Member

FWIW I'm poking around in #1845 about slimming down wasmparser optionally at compile-time and one thing I was thinking of was adding a GC-like feature which might help recover some performance here. I'm not 100% sure it'll work though nor if it's worth it, but figured it's worth poking around at least.

@Robbepop
Copy link
Collaborator Author

Robbepop commented Oct 7, 2024

Oh thanks a lot for the information! I like #1845 a lot even if it won't improve performance, reducing compile times is also a huge gain already. :)

@Robbepop
Copy link
Collaborator Author

Robbepop commented Oct 17, 2024

I opened #1870 for minor gains in the code section validation.

However, as the Wasmi benchmarks suggest, most of the performance regressions are in the Wasm validation parts outside the code section. I have to dig deeper to see which parts of the Wasm validator are the main offenders. @alexcrichton any clues?

@Robbepop
Copy link
Collaborator Author

Robbepop commented Oct 19, 2024

@alexcrichton I reran flamegraph on current Wasmi main compared to the PR to update wasmparser to find out the most regressing part of wasmparser since v0.100.0.

It is the wasmparser::Validator::type_section call that regressed by over 130%, thus requiring more than double the time compared to v0.100.0 (wasmparser-nostd` fork).

From what I can see, the following sub-parts take the most time in the benchmarks:

  • TypeList::intern_canonical_rec_group (2.48%)
  • <RecGroup as FromReader>::from_reader (2.15%)
  • <RecGroup as Ord>::cmp (1.06%)
  • InternRecGroup::check_subtype (0.88%)

All of which seem to be related to the gc Wasm proposal.

In comparison the sub-parts that take the most time for wasmparser v0.100.0 are:

  • <FuncType as FromReader>::from_reader (1.24%)
  • Module::add_type (0.82%)

It seems that the (computationally expensive) recgroup canonicalization and internment is always performed, even for type sections that do not have recursive definitions - which is the overwhelmingly common case, especially for non-GC wasm. If this is true, I think we could optimize this by only doing this canonicalization when necessary.

I.e. use an efficient way to register types that also detects (potentially) recursive references within them, and only use the expensive recgroup canonicalization and internment when needed as a fallback. If a detection of potentially recursive references cannot be implemented easily we could also simply use the WasmFeatures::GC (or WasmFeatures::GC_TYPES) flag.

This way users only pay for what they use.

@alexcrichton
Copy link
Member

Thanks for measuring this @Robbepop! That intuitively makes sense to me and we historically have spent most of our time looking at optimizing the operator validator rather than the general module validator.

Would you be able to extract an example module or two that showcases this slowdown? For example a module that only has a type section should be sufficient to showcase the slowdown and additionally track improvements.

@alexcrichton
Copy link
Member

(sorry still recovering from jetlag so responding to the rest of your comment now too)

I think we could optimize this by only doing this canonicalization when necessary

Makes sense to me! I'd prefer to lean on "detect a recursive type" if possible and only use WasmFeatures as a last resort as well, but I agree that the ultimate decision maker there will be benchmarks one way or another.

@Robbepop
Copy link
Collaborator Author

Robbepop commented Oct 22, 2024

Would you be able to extract an example module or two that showcases this slowdown? For example a module that only has a type section should be sufficient to showcase the slowdown and additionally track improvements.

There already is a lots-of-types.wasm benchmark test case in the wasmparser crate. I took a look at it and it consists of tons of empty (func) types, which probably won't take a burden on the type section validation. I would rename this test case to lots-of-empty-fn-types.wasm and add another with lots of unique function types, named lots-of-unique-fn-types.wasm.

Makes sense to me! I'd prefer to lean on "detect a recursive type" if possible and only use WasmFeatures as a last resort as well, but I agree that the ultimate decision maker there will be benchmarks one way or another.

I too think it is better not to use the WasmFeatures if possible. Another idea I had, was to make internment and canonicalization a lazy process and only ever do whatever is computationally expensive just when it is actually needed for a particular type. In contrast to the rec-type detection idea, this kinda turns the eager validation into a lazy one. Not sure if possible, but I think this would be an even cleaner design.

@alexcrichton
Copy link
Member

All that sounds reasonable to me yeah. I haven't dug into why this is so much slower than before myself so I can't speak too much as to whether it seems like the best route forward, but I'm happy to defer to you as well.

@Robbepop
Copy link
Collaborator Author

Robbepop commented Oct 22, 2024

So far I have only digged into <RecGroup as FromReader>::from_reader regressions. I tried to lightly modify the code, e.g. with some annotations, inline, #[cold], extracting functions, putting common cases in front but without any major performance wins while the code became much uglier, thus I decided not to open a PR.

@Robbepop
Copy link
Collaborator Author

Robbepop commented Nov 13, 2024

@alexcrichton I have been looking into the type section validation of the wasmparser crate recently and was trying to wrap my head around there. I have to admit, that I am swimming a bit in rough waters to coming up with a general implementation with lazy behavior to optimize the common case. I think it makes sense to start with a simple, optimization for common case and then elaborate further.

I implemented specialized type section validation here: #1906

@Robbepop
Copy link
Collaborator Author

Robbepop commented Nov 16, 2024

I just tested this PR on Wasmi's wasmparser update PR with the following results:

tiny_keccak.wasm

Benchmark v0.100 v0.220 PR
eager/checked 76.3µs 83.9µs (+10.0%) 82.6µs (+8.2%)
eager/checked+metered 78.3µs 87.3µs (+11.3%) 85.0µs (+8.3%)
eager/unchecked 58.5µs 58.5µs 59.3µs
lazy-translation/checked 32.6µs 40.8µs (+21.8%) 38.3µs (+14.3%)
lazy/checked 6.3µs 8.8µs (+35.5%) 7.2µs (+11.3%
lazy/unchecked 3.6µs 3.9µs (+6.1%) 3.8µs (+5.9%)

spidermonkey.wasm

Benchmark v0.100 v0.220 PR
eager/checked 35.7ms 38ms (+5.6%) 38.5ms (+5.9%)
eager/checked+metered 37.0ms 40.1ms (+6.6%) 40.1ms (+6.6%
eager/unchecked 28.5ms 28.9ms 29.1ms
lazy-translation/checked 15.2ms 16.9ms (+10.0%) 16.8ms (+9.8%)
lazy/checked 1.0ms 1.1ms (+6.0%) 1.0ms
lazy/unchecked 725µs 740µs 745µs

Summary: Still not ideal for small Wasm files, but much better compared to v0.220.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants