Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use NaN-boxing on value::InnerValue #4091

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open

Use NaN-boxing on value::InnerValue #4091

wants to merge 29 commits into from

Conversation

hansl
Copy link
Contributor

@hansl hansl commented Dec 20, 2024

This adds a new InnerValue that uses NaN-boxing to allow JsValue to ultimately only take 64-bits (8 bytes). This allows JsValue to fit into one x86_64 register which should greatly improve performance.

For more details on NaN-boxing (and its alternative, Pointer tagging), see https://piotrduperas.com/posts/nan-boxing which is a great tutorial (in C) on how to get there. A proper explanation of each tags/range of values is described in the header of the inner.rs module.

Fixes #1373.

@hansl
Copy link
Contributor Author

hansl commented Dec 20, 2024

All tests in inner.rs are passing. Unfortunately, running cargo test -p boa_engine currently results in seg faults: invalid memory reference. This will remain a draft until at least all tests pass.

core/engine/src/value/inner.rs Outdated Show resolved Hide resolved
@hansl hansl marked this pull request as ready for review December 20, 2024 23:38
@hansl hansl requested a review from a team December 20, 2024 23:38
@raskad
Copy link
Member

raskad commented Dec 21, 2024

Still failing some tests:

Test result main count PR count difference
Total 48,625 48,625 0
Passed 43,616 43,606 -10
Ignored 1,471 1,471 0
Failed 3,538 3,548 +10
Panics 0 0 0
Conformance 89.70% 89.68% -0.02%
Broken tests (10):
test/built-ins/TypedArray/prototype/map/return-new-typedarray-conversion-operation.js (previously Passed)
test/built-ins/TypedArray/prototype/set/typedarray-arg-set-values-diff-buffer-other-type-conversions.js (previously Passed)
test/built-ins/TypedArray/prototype/set/array-arg-src-tonumber-value-conversions.js (previously Passed)
test/built-ins/TypedArray/prototype/fill/fill-values-conversion-operations.js (previously Passed)
test/built-ins/Map/valid-keys.js (previously Passed)
test/built-ins/DataView/prototype/setFloat64/set-values-return-undefined.js (previously Passed)
test/built-ins/DataView/prototype/setFloat32/set-values-return-undefined.js (previously Passed)
test/built-ins/TypedArrayConstructors/ctors/object-arg/conversion-operation.js (previously Passed)
test/built-ins/TypedArrayConstructors/internals/DefineOwnProperty/conversion-operation.js (previously Passed)
test/built-ins/TypedArrayConstructors/internals/Set/conversion-operation.js (previously Passed)

But the bigger issue that I see is performance. I ran the benchmarks locally and the performance was worse in all benchmarks, but in some up to almost 40%. I though a performance tradeoff vs lower memory usage seemed reasonable and expected for this change, but this seems way to extreme.

Benchmarks before:

RESULT Richards 78.5
RESULT DeltaBlue 71.8
RESULT Crypto 80.3
RESULT RayTrace 182
RESULT EarleyBoyer 227
RESULT RegExp 43.7
RESULT Splay 326
RESULT NavierStokes 187
SCORE 122

After:

RESULT Richards 49.9
RESULT DeltaBlue 43.7
RESULT Crypto 77.7
RESULT RayTrace 138
RESULT EarleyBoyer 161
RESULT RegExp 38.9
RESULT Splay 214
RESULT NavierStokes 163
SCORE 92.0

@jedel1043
Copy link
Member

jedel1043 commented Dec 21, 2024

@raskad @hansl As you probably already know, not all memory optimizations will yield (or even preserve) performance gains. It may be that way sometimes, but there's this balance between memory and speed that big memory optimizations (such as this one) will have to deal with.

The first step is to make every test work, then we'd have to optimize repeatedly to recover all the lost performance, and hopefully we'll have an engine that's as fast as before but using a fraction of the memory.

Having said that, we should see what the overall gains in memory are, because this should at least save a considerable amount of memory to be worth putting more work on.

@hansl
Copy link
Contributor Author

hansl commented Dec 21, 2024

There's no rush for this, since the actual breaking change already happened :)

I'll run benchmarks on my own, and I'll try to check a profiler for memory. In the next 2 weeks or so. It's also possible that I'm missing a few easy inline, and adding some bit magic might make things faster down the line.

First things first, I'll make sure test262 has no regressions.

@hansl
Copy link
Contributor Author

hansl commented Dec 21, 2024

@raskad BTW this should cover both; it should take less memory, AND be more performant. But I guess we pass JsValue more by reference than by value. Now that we can move JsValue for free, maybe we should reconsider some other parts of hte code as well. Later work.

This article shows a 28% increase in performance for a tagged union of pointers, in JavaScriptCore (Mozilla's engine). Tough TBF we weren't using pointers.

Copy link

codecov bot commented Dec 21, 2024

Codecov Report

Attention: Patch coverage is 76.39155% with 123 lines in your changes missing coverage. Please review.

Project coverage is 53.85%. Comparing base (6ddc2b4) to head (5d406c7).
Report is 335 commits behind head on main.

Files with missing lines Patch % Lines
core/engine/src/value/operations.rs 51.56% 62 Missing ⚠️
core/engine/src/value/conversions/try_from_js.rs 20.00% 20 Missing ⚠️
core/engine/src/value/inner/nan_boxed.rs 91.32% 17 Missing ⚠️
core/engine/src/value/mod.rs 86.45% 13 Missing ⚠️
core/engine/src/value/equality.rs 89.47% 4 Missing ⚠️
core/engine/src/value/hash.rs 60.00% 4 Missing ⚠️
core/engine/src/value/conversions/serde_json.rs 66.66% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4091      +/-   ##
==========================================
+ Coverage   47.24%   53.85%   +6.60%     
==========================================
  Files         476      486      +10     
  Lines       46892    48534    +1642     
==========================================
+ Hits        22154    26136    +3982     
+ Misses      24738    22398    -2340     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@raskad
Copy link
Member

raskad commented Dec 21, 2024

@hansl I did a quick perf run on the release-dbg profile and noticed that there are a lot of operations relating to the Box that the JsObject is now wrapped in. For example in the function JsObject::try_get we call self.clone().into() on the JsObject to turn in into a JsValue. In my baseline the into call only makes up 0.5% of the function runtime but now with the alloc for the Box it takes 6.2%. Similarly the 2.3% time to drop that JsValue later then takes 8.4%.
I'm totally unsure if that would explain everything, since I cant really pin it down in the inverted call stack, but it might be a starting point.

@jedel1043
Copy link
Member

@hansl I did a quick perf run on the release-dbg profile and noticed that there are a lot of operations relating to the Box that the JsObject is now wrapped in. For example in the function JsObject::try_get we call self.clone().into() on the JsObject to turn in into a JsValue. In my baseline the into call only makes up 0.5% of the function runtime but now with the alloc for the Box it takes 6.2%. Similarly the 2.3% time to drop that JsValue later then takes 8.4%.
I'm totally unsure if that would explain everything, since I cant really pin it down in the inverted call stack, but it might be a starting point.

That makes a lot of sense actually, because clones for JsValues containing JsObjects now aren't as cheap as just increasing a ref count

The "proper" solution would be to make JsObject 8 bytes again by using manual vtables instead of dyn objects (something very similar to what we do on GcBox).

@hansl
Copy link
Contributor Author

hansl commented Dec 22, 2024

@raskad If we nan boxed Gc and GcBox instead of Box-ing this might go away, similar to what @jedel1043 is saying, since cloning will be essentially free. I need to put the value on the heap though, so just receiving and taking a pointer to a JsObject is not an option.

Let me make sure we don't regress on test262 then I'll look into the performance.

@hansl
Copy link
Contributor Author

hansl commented Dec 23, 2024

Okay this one is fascinating...

If I only create 1 value in an expression, e.g. -0 or [-0] or [[[[[-0]]]]], it keeps the parity just fine.

If I create multiple negative zeroes in an expression, only the first one keeps the parity. [-0, -0, -0, -0] becomes [-0, 0, 0, 0]... I'm looking for a place where we would cache these, but I can't seem to find one. I'm adding tests everywhere, but I'm flailing a bit.

Cloning the negative zero keeps the parity, copying is not supported, ... It seems the bug is not in my code, but that's the only thing I changed... Any idea?

@hansl
Copy link
Contributor Author

hansl commented Dec 26, 2024

There's something funny with the parser and/or optimizer, but in any case [-0, -0] results in both values not taking the same code path. Found a bug anyway, so that's great. Fixed the bug, ready for reviews.

I'm still planning on running the benchmark before the end of the year.

@hansl
Copy link
Contributor Author

hansl commented Dec 29, 2024

Latest on my laptop:

Before:

RESULT Richards 107
RESULT DeltaBlue 103
RESULT Crypto 126
RESULT RayTrace 283
RESULT EarleyBoyer 343
RESULT RegExp 47.8
RESULT Splay 454
RESULT NavierStokes 295
SCORE 175

After:

RESULT Richards 80.9
RESULT DeltaBlue 79.6
RESULT Crypto 128
RESULT RayTrace 231
RESULT EarleyBoyer 284
RESULT RegExp 45.8
RESULT Splay 369
RESULT NavierStokes 247
SCORE 148

So not as dramatic as @raskad's result (though I didn't run the benchmark at that commit). This is with the latest changes of using ranges and const bit masks.

@hansl
Copy link
Contributor Author

hansl commented Dec 29, 2024

After the last fix which should fix everything, this is the latest benchmark:

RESULT Richards 85.5
RESULT DeltaBlue 84.0
RESULT Crypto 131
RESULT RayTrace 243
RESULT EarleyBoyer 303
RESULT RegExp 47.2
RESULT Splay 368
RESULT NavierStokes 251
SCORE 153

So a bit better. The next steps would be to remove the extra deref of pointers (bigint, object, symbol and string), which I'll argue is out of scope for this PR.

I'll cleanup some more (some magic numbers still in), but this should be ready to review soon.

@hansl
Copy link
Contributor Author

hansl commented Dec 29, 2024

Test262 conformance changes

Test result main count PR count difference
Total 48,625 48,625 0
Passed 43,947 43,947 0
Ignored 1,471 1,471 0
Failed 3,207 3,207 0
Panics 0 0 0
Conformance 90.38% 90.38% 0.00%

Copy link
Member

@raskad raskad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already have some comments, but I still have to look at the real inner value code. Only skimmed it until now.
All in all I really nice work!

I ran the benchmarks again and noticed that while there are some significant variants between the individual benchmarks, the overall score is now almost on par with the main branch. Specifically the Crypto and NavierStokes benchmarks got faster. @hansl Is this trend also also something that you can confirm when you run them? Otherwise I'll have to take a look why this might be the case on my setup.

One general question regarding the feature flag: Do we plan to remove it depending on some certain features, or do we want to keep that flag? If we want to keep it, I think we probably need to think about how we can test the code with / without to be sure that everything works with both features. Also we would have to choose a default feature for our benchmarks etc.
Based on just the additional effort with all the tooling, I would tend to remove it at some point. Any other thoughts?

core/engine/src/builtins/array/tests.rs Outdated Show resolved Hide resolved
Comment on lines +300 to 306
} else if let Some(rational) = self.0.as_float64() {
// Use this poor-man's check as `[f64::fract]` isn't const.
if rational == ((rational as i32) as f64) {
Some(rational as i32)
} else {
None
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could remove this branch now, right? I ran the 262 tests locally with the f64 part removed all tests passed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll have to explain to me what code you removed. I removed that let Some(rational) = self.0.as_float64() branch and got a lot of tests to fail. Is that the branch you were meaning?

core/engine/src/value/inner/nan_boxed.rs Outdated Show resolved Hide resolved
@hansl
Copy link
Contributor Author

hansl commented Jan 1, 2025

One general question regarding the feature flag: Do we plan to remove it depending on some certain features, or do we want to keep that flag? If we want to keep it, I think we probably need to think about how we can test the code with / without to be sure that everything works with both features. Also we would have to choose a default feature for our benchmarks etc.
Based on just the additional effort with all the tooling, I would tend to remove it at some point. Any other thoughts?

My plan when adding the feature flag was to be quickly able to switch between the two implementations to run benchmarks and tests for this PR.

We might decide to merge the flag in the main branch, e.g. if we want to have a release and test client projects with NaN boxing before actually making it the default. I did not intend it to live through a full release, but I also didn't want to commit to NaN before we were 100% committed to it, and giving us an easy revert.

@hansl
Copy link
Contributor Author

hansl commented Jan 1, 2025

Benchmarks (using taskpolicy -b on MacOS makes CPU priority to the efficiency cores, which explains why it's so slow, but it is more fair to compare):

Enum-based

$ cargo build  --release --bin boa
...
$ taskpolicy -b ./target/release/boa -O ~/Sources/boa-dev/data/bench/bench-v8/combined.js
PROGRESS Richards
RESULT Richards 19.7
PROGRESS DeltaBlue
RESULT DeltaBlue 18.5
PROGRESS Encrypt
PROGRESS Decrypt
RESULT Crypto 21.2
PROGRESS RayTrace
RESULT RayTrace 48.8
PROGRESS Earley
PROGRESS Boyer
RESULT EarleyBoyer 58.8
PROGRESS RegExp
RESULT RegExp 10.9
PROGRESS Splay
RESULT Splay 82.3
PROGRESS NavierStokes
RESULT NavierStokes 50.4
SCORE 31.6

NaN-Boxed

$ cargo build  --release --bin boa --features nan-box-jsvalue
...
$ taskpolicy -b ./target/release/boa -O ~/Sources/boa-dev/data/bench/bench-v8/combined.js
PROGRESS Richards
RESULT Richards 12.8
PROGRESS DeltaBlue
RESULT DeltaBlue 12.2
PROGRESS Encrypt
PROGRESS Decrypt
RESULT Crypto 22.0
PROGRESS RayTrace
RESULT RayTrace 37.5
PROGRESS Earley
PROGRESS Boyer
RESULT EarleyBoyer 45.2
PROGRESS RegExp
RESULT RegExp 9.97
PROGRESS Splay
RESULT Splay 56.7
PROGRESS NavierStokes
RESULT NavierStokes 44.5
SCORE 24.9

@hansl
Copy link
Contributor Author

hansl commented Jan 5, 2025

On the latest commit, I changed the pointer tagging to using the top bits of the pointer. This means that we lose bits on the pointer value itself (after doing some research 48 bits should be enough), but we gain on performance by having non overlapping ranges for pointers. Seems like it's speeding it up a bit overall:

RESULT Richards 87.9
RESULT DeltaBlue 85.6
RESULT Crypto 142
RESULT RayTrace 250
RESULT EarleyBoyer 312
RESULT RegExp 47.4
RESULT Splay 437
RESULT NavierStokes 256
SCORE 161

The final score is getting real close to before this PR on my machine.

Comment on lines 22 to 25
//! | `BigInt` | `7FF8:PPPP:PPPP:PPPP` | 49-bits pointer. Assumes non-null pointer. |
//! | `Object` | `7FFA:PPPP:PPPP:PPPP` | 49-bits pointer. |
//! | `Symbol` | `7FFC:PPPP:PPPP:PPPP` | 49-bits pointer. |
//! | `String` | `7FFE:PPPP:PPPP:PPPP` | 49-bits pointer. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the pointer space we only have 1 unused slot left right which is 7FF8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7FF8 (or b0111_1111_1111_1000) is used by bigint if the pointer is non-null. The 16th MSB isn't used (yet?), so pointers are actually 48 bits (I'll correct the comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if the commit helps clarifying this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sorry I meant to ask if 7FFF is unused and reserved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unused and reserved for now.

@hansl hansl requested review from jasonwilliams and raskad January 7, 2025 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NaN boxed JavaScript Value
4 participants