Use NaN-boxing on value::InnerValue #4091

hansl · 2024-12-20T06:18:05Z

This adds a new InnerValue that uses NaN-boxing to allow JsValue to ultimately only take 64-bits (8 bytes). This allows JsValue to fit into one x86_64 register which should greatly improve performance.

For more details on NaN-boxing (and its alternative, Pointer tagging), see https://piotrduperas.com/posts/nan-boxing which is a great tutorial (in C) on how to get there. A proper explanation of each tags/range of values is described in the header of the inner.rs module.

Fixes #1373.

hansl · 2024-12-20T06:18:51Z

All tests in inner.rs are passing. Unfortunately, running cargo test -p boa_engine currently results in seg faults: invalid memory reference. This will remain a draft until at least all tests pass.

core/engine/src/value/inner.rs

raskad · 2024-12-21T15:32:30Z

Still failing some tests:

Test result	main count	PR count	difference
Total	48,625	48,625	0
Passed	43,616	43,606	-10
Ignored	1,471	1,471	0
Failed	3,538	3,548	+10
Panics	0	0	0
Conformance	89.70%	89.68%	-0.02%

Broken tests (10):

test/built-ins/TypedArray/prototype/map/return-new-typedarray-conversion-operation.js (previously Passed)
test/built-ins/TypedArray/prototype/set/typedarray-arg-set-values-diff-buffer-other-type-conversions.js (previously Passed)
test/built-ins/TypedArray/prototype/set/array-arg-src-tonumber-value-conversions.js (previously Passed)
test/built-ins/TypedArray/prototype/fill/fill-values-conversion-operations.js (previously Passed)
test/built-ins/Map/valid-keys.js (previously Passed)
test/built-ins/DataView/prototype/setFloat64/set-values-return-undefined.js (previously Passed)
test/built-ins/DataView/prototype/setFloat32/set-values-return-undefined.js (previously Passed)
test/built-ins/TypedArrayConstructors/ctors/object-arg/conversion-operation.js (previously Passed)
test/built-ins/TypedArrayConstructors/internals/DefineOwnProperty/conversion-operation.js (previously Passed)
test/built-ins/TypedArrayConstructors/internals/Set/conversion-operation.js (previously Passed)

But the bigger issue that I see is performance. I ran the benchmarks locally and the performance was worse in all benchmarks, but in some up to almost 40%. I though a performance tradeoff vs lower memory usage seemed reasonable and expected for this change, but this seems way to extreme.

Benchmarks before:

RESULT Richards 78.5
RESULT DeltaBlue 71.8
RESULT Crypto 80.3
RESULT RayTrace 182
RESULT EarleyBoyer 227
RESULT RegExp 43.7
RESULT Splay 326
RESULT NavierStokes 187
SCORE 122

After:

RESULT Richards 49.9
RESULT DeltaBlue 43.7
RESULT Crypto 77.7
RESULT RayTrace 138
RESULT EarleyBoyer 161
RESULT RegExp 38.9
RESULT Splay 214
RESULT NavierStokes 163
SCORE 92.0

jedel1043 · 2024-12-21T15:42:49Z

@raskad @hansl As you probably already know, not all memory optimizations will yield (or even preserve) performance gains. It may be that way sometimes, but there's this balance between memory and speed that big memory optimizations (such as this one) will have to deal with.

The first step is to make every test work, then we'd have to optimize repeatedly to recover all the lost performance, and hopefully we'll have an engine that's as fast as before but using a fraction of the memory.

Having said that, we should see what the overall gains in memory are, because this should at least save a considerable amount of memory to be worth putting more work on.

hansl · 2024-12-21T21:08:22Z

There's no rush for this, since the actual breaking change already happened :)

I'll run benchmarks on my own, and I'll try to check a profiler for memory. In the next 2 weeks or so. It's also possible that I'm missing a few easy inline, and adding some bit magic might make things faster down the line.

First things first, I'll make sure test262 has no regressions.

hansl · 2024-12-21T21:13:00Z

@raskad BTW this should cover both; it should take less memory, AND be more performant. But I guess we pass JsValue more by reference than by value. Now that we can move JsValue for free, maybe we should reconsider some other parts of hte code as well. Later work.

This article shows a 28% increase in performance for a tagged union of pointers, in JavaScriptCore (Mozilla's engine). Tough TBF we weren't using pointers.

codecov · 2024-12-21T21:20:42Z

Codecov Report

Attention: Patch coverage is 76.39155% with 123 lines in your changes missing coverage. Please review.

Project coverage is 53.85%. Comparing base (6ddc2b4) to head (5d406c7).
Report is 335 commits behind head on main.

Files with missing lines	Patch %	Lines
core/engine/src/value/operations.rs	51.56%	62 Missing ⚠️
core/engine/src/value/conversions/try_from_js.rs	20.00%	20 Missing ⚠️
core/engine/src/value/inner/nan_boxed.rs	91.32%	17 Missing ⚠️
core/engine/src/value/mod.rs	86.45%	13 Missing ⚠️
core/engine/src/value/equality.rs	89.47%	4 Missing ⚠️
core/engine/src/value/hash.rs	60.00%	4 Missing ⚠️
core/engine/src/value/conversions/serde_json.rs	66.66%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4091      +/-   ##
==========================================
+ Coverage   47.24%   53.85%   +6.60%     
==========================================
  Files         476      486      +10     
  Lines       46892    48534    +1642     
==========================================
+ Hits        22154    26136    +3982     
+ Misses      24738    22398    -2340

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

raskad · 2024-12-21T22:14:23Z

@hansl I did a quick perf run on the release-dbg profile and noticed that there are a lot of operations relating to the Box that the JsObject is now wrapped in. For example in the function JsObject::try_get we call self.clone().into() on the JsObject to turn in into a JsValue. In my baseline the into call only makes up 0.5% of the function runtime but now with the alloc for the Box it takes 6.2%. Similarly the 2.3% time to drop that JsValue later then takes 8.4%.
I'm totally unsure if that would explain everything, since I cant really pin it down in the inverted call stack, but it might be a starting point.

jedel1043 · 2024-12-21T22:50:43Z

@hansl I did a quick perf run on the release-dbg profile and noticed that there are a lot of operations relating to the Box that the JsObject is now wrapped in. For example in the function JsObject::try_get we call self.clone().into() on the JsObject to turn in into a JsValue. In my baseline the into call only makes up 0.5% of the function runtime but now with the alloc for the Box it takes 6.2%. Similarly the 2.3% time to drop that JsValue later then takes 8.4%.
I'm totally unsure if that would explain everything, since I cant really pin it down in the inverted call stack, but it might be a starting point.

That makes a lot of sense actually, because clones for JsValues containing JsObjects now aren't as cheap as just increasing a ref count

The "proper" solution would be to make JsObject 8 bytes again by using manual vtables instead of dyn objects (something very similar to what we do on GcBox).

hansl · 2024-12-22T05:03:17Z

@raskad If we nan boxed Gc and GcBox instead of Box-ing this might go away, similar to what @jedel1043 is saying, since cloning will be essentially free. I need to put the value on the heap though, so just receiving and taking a pointer to a JsObject is not an option.

Let me make sure we don't regress on test262 then I'll look into the performance.

hansl · 2024-12-23T04:49:47Z

Okay this one is fascinating...

If I only create 1 value in an expression, e.g. -0 or [-0] or [[[[[-0]]]]], it keeps the parity just fine.

If I create multiple negative zeroes in an expression, only the first one keeps the parity. [-0, -0, -0, -0] becomes [-0, 0, 0, 0]... I'm looking for a place where we would cache these, but I can't seem to find one. I'm adding tests everywhere, but I'm flailing a bit.

Cloning the negative zero keeps the parity, copying is not supported, ... It seems the bug is not in my code, but that's the only thing I changed... Any idea?

hansl · 2024-12-26T02:32:37Z

There's something funny with the parser and/or optimizer, but in any case [-0, -0] results in both values not taking the same code path. Found a bug anyway, so that's great. Fixed the bug, ready for reviews.

I'm still planning on running the benchmark before the end of the year.

hansl · 2024-12-29T01:21:53Z

Latest on my laptop:

Before:

RESULT Richards 107
RESULT DeltaBlue 103
RESULT Crypto 126
RESULT RayTrace 283
RESULT EarleyBoyer 343
RESULT RegExp 47.8
RESULT Splay 454
RESULT NavierStokes 295
SCORE 175

After:

RESULT Richards 80.9
RESULT DeltaBlue 79.6
RESULT Crypto 128
RESULT RayTrace 231
RESULT EarleyBoyer 284
RESULT RegExp 45.8
RESULT Splay 369
RESULT NavierStokes 247
SCORE 148

So not as dramatic as @raskad's result (though I didn't run the benchmark at that commit). This is with the latest changes of using ranges and const bit masks.

hansl · 2024-12-29T01:45:47Z

After the last fix which should fix everything, this is the latest benchmark:

RESULT Richards 85.5
RESULT DeltaBlue 84.0
RESULT Crypto 131
RESULT RayTrace 243
RESULT EarleyBoyer 303
RESULT RegExp 47.2
RESULT Splay 368
RESULT NavierStokes 251
SCORE 153

So a bit better. The next steps would be to remove the extra deref of pointers (bigint, object, symbol and string), which I'll argue is out of scope for this PR.

I'll cleanup some more (some magic numbers still in), but this should be ready to review soon.

hansl · 2024-12-29T01:47:20Z

Test262 conformance changes

Test result	main count	PR count	difference
Total	48,625	48,625	0
Passed	43,947	43,947	0
Ignored	1,471	1,471	0
Failed	3,207	3,207	0
Panics	0	0	0
Conformance	90.38%	90.38%	0.00%

raskad

I already have some comments, but I still have to look at the real inner value code. Only skimmed it until now.
All in all I really nice work!

I ran the benchmarks again and noticed that while there are some significant variants between the individual benchmarks, the overall score is now almost on par with the main branch. Specifically the Crypto and NavierStokes benchmarks got faster. @hansl Is this trend also also something that you can confirm when you run them? Otherwise I'll have to take a look why this might be the case on my setup.

One general question regarding the feature flag: Do we plan to remove it depending on some certain features, or do we want to keep that flag? If we want to keep it, I think we probably need to think about how we can test the code with / without to be sure that everything works with both features. Also we would have to choose a default feature for our benchmarks etc.
Based on just the additional effort with all the tooling, I would tend to remove it at some point. Any other thoughts?

core/engine/src/builtins/array/tests.rs

raskad · 2024-12-31T00:27:02Z

core/engine/src/value/mod.rs

+        } else if let Some(rational) = self.0.as_float64() {
+            // Use this poor-man's check as `[f64::fract]` isn't const.
+            if rational == ((rational as i32) as f64) {
                Some(rational as i32)
+            } else {
+                None
            }


I think we could remove this branch now, right? I ran the 262 tests locally with the f64 part removed all tests passed.

You'll have to explain to me what code you removed. I removed that let Some(rational) = self.0.as_float64() branch and got a lot of tests to fail. Is that the branch you were meaning?

core/engine/src/value/inner/nan_boxed.rs

hansl · 2025-01-01T22:08:57Z

One general question regarding the feature flag: Do we plan to remove it depending on some certain features, or do we want to keep that flag? If we want to keep it, I think we probably need to think about how we can test the code with / without to be sure that everything works with both features. Also we would have to choose a default feature for our benchmarks etc.
Based on just the additional effort with all the tooling, I would tend to remove it at some point. Any other thoughts?

My plan when adding the feature flag was to be quickly able to switch between the two implementations to run benchmarks and tests for this PR.

We might decide to merge the flag in the main branch, e.g. if we want to have a release and test client projects with NaN boxing before actually making it the default. I did not intend it to live through a full release, but I also didn't want to commit to NaN before we were 100% committed to it, and giving us an easy revert.

hansl · 2025-01-01T23:02:39Z

Benchmarks (using taskpolicy -b on MacOS makes CPU priority to the efficiency cores, which explains why it's so slow, but it is more fair to compare):

Enum-based

$ cargo build  --release --bin boa
...
$ taskpolicy -b ./target/release/boa -O ~/Sources/boa-dev/data/bench/bench-v8/combined.js
PROGRESS Richards
RESULT Richards 19.7
PROGRESS DeltaBlue
RESULT DeltaBlue 18.5
PROGRESS Encrypt
PROGRESS Decrypt
RESULT Crypto 21.2
PROGRESS RayTrace
RESULT RayTrace 48.8
PROGRESS Earley
PROGRESS Boyer
RESULT EarleyBoyer 58.8
PROGRESS RegExp
RESULT RegExp 10.9
PROGRESS Splay
RESULT Splay 82.3
PROGRESS NavierStokes
RESULT NavierStokes 50.4
SCORE 31.6

NaN-Boxed

$ cargo build  --release --bin boa --features nan-box-jsvalue
...
$ taskpolicy -b ./target/release/boa -O ~/Sources/boa-dev/data/bench/bench-v8/combined.js
PROGRESS Richards
RESULT Richards 12.8
PROGRESS DeltaBlue
RESULT DeltaBlue 12.2
PROGRESS Encrypt
PROGRESS Decrypt
RESULT Crypto 22.0
PROGRESS RayTrace
RESULT RayTrace 37.5
PROGRESS Earley
PROGRESS Boyer
RESULT EarleyBoyer 45.2
PROGRESS RegExp
RESULT RegExp 9.97
PROGRESS Splay
RESULT Splay 56.7
PROGRESS NavierStokes
RESULT NavierStokes 44.5
SCORE 24.9

Seems to work

hansl · 2025-01-05T05:45:09Z

On the latest commit, I changed the pointer tagging to using the top bits of the pointer. This means that we lose bits on the pointer value itself (after doing some research 48 bits should be enough), but we gain on performance by having non overlapping ranges for pointers. Seems like it's speeding it up a bit overall:

RESULT Richards 87.9
RESULT DeltaBlue 85.6
RESULT Crypto 142
RESULT RayTrace 250
RESULT EarleyBoyer 312
RESULT RegExp 47.4
RESULT Splay 437
RESULT NavierStokes 256
SCORE 161

The final score is getting real close to before this PR on my machine.

jasonwilliams · 2025-01-05T22:23:36Z

core/engine/src/value/inner/nan_boxed.rs

+//! | `BigInt`          | `7FF8:PPPP:PPPP:PPPP`    | 49-bits pointer. Assumes non-null pointer. |
+//! | `Object`          | `7FFA:PPPP:PPPP:PPPP`    | 49-bits pointer. |
+//! | `Symbol`          | `7FFC:PPPP:PPPP:PPPP`    | 49-bits pointer. |
+//! | `String`          | `7FFE:PPPP:PPPP:PPPP`    | 49-bits pointer. |


So in the pointer space we only have 1 unused slot left right which is 7FF8

7FF8 (or b0111_1111_1111_1000) is used by bigint if the pointer is non-null. The 16th MSB isn't used (yet?), so pointers are actually 48 bits (I'll correct the comment).

Let me know if the commit helps clarifying this.

Yes sorry I meant to ask if 7FFF is unused and reserved

It is unused and reserved for now.

hansl added 3 commits December 19, 2024 13:34

Start of nan boxing values

6d3d9db

Continuing, but this endian thing is not helping

c64417d

Finish the boxing and add tests

e357908

hansl commented Dec 20, 2024

View reviewed changes

core/engine/src/value/inner.rs Outdated Show resolved Hide resolved

hansl added 4 commits December 20, 2024 11:52

Better tests and fixing a couple of bugs

d16a546

Rollback changes outside of core/engine/src/value/...

e2496c3

Fix Drop::drop and clippies

683444e

Move MSRV to 1.83

9d1ac9f

hansl marked this pull request as ready for review December 20, 2024 23:38

hansl requested a review from a team December 20, 2024 23:38

Move MSRV back to 1.82 and impl missing const fn

167d634

Fix a few clippies and ignore the one thats wrong

2a68855

hansl added 5 commits December 23, 2024 10:41

Merge branch 'main' into nan_box

56d513a

WIP

c7cd8b0

Duh!

52fc103

Merge remote-tracking branch 'upstream/main' into nan_box

084b25e

Oops

345cd69

hansl added 2 commits December 25, 2024 21:36

Clippies

710d8f1

refactor bit-masks and ranges

c6d0669

hansl added 2 commits December 28, 2024 20:22

Merge remote-tracking branch 'upstream/main' into nan_box

7ae9352

NAN is not null_ptr and clippies

ad64fd9

hansl added 4 commits December 28, 2024 21:03

Move all magic numbers and bits to the bits private module

ac171fe

Merge remote-tracking branch 'upstream/main' into nan_box

8789d55

Implement both enum-based and nan-boxed JsValue inner behind a flag

ec32abd

Clippies

40d5e5c

raskad reviewed Dec 31, 2024

View reviewed changes

jasonwilliams reviewed Dec 31, 2024

View reviewed changes

core/engine/src/value/inner/nan_boxed.rs Show resolved Hide resolved

jasonwilliams reviewed Dec 31, 2024

View reviewed changes

core/engine/src/value/inner/nan_boxed.rs Show resolved Hide resolved

jasonwilliams reviewed Dec 31, 2024

View reviewed changes

core/engine/src/value/inner/nan_boxed.rs Outdated Show resolved Hide resolved

Remove debug logging

e647385

Some simple attempts to improve performance with NonNull

69bbf82

Seems to work

hansl force-pushed the nan_box branch from 9388be4 to 69bbf82 Compare January 5, 2025 03:38

Use top 2 bits for pointer tagging, speeding up as_variant

ca9f657

Remove the feature flag for NaN boxing now that were closer in perf

e29cedb

jasonwilliams reviewed Jan 5, 2025

View reviewed changes

Add a bits section to explain the scheme

c949461

hansl requested review from jasonwilliams and raskad January 7, 2025 16:43

hansl added 2 commits January 7, 2025 09:31

Inline ALL THE THINGS - slight increase in perf

c21c886

Merge branch 'main' into nan_box

5d406c7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use NaN-boxing on value::InnerValue #4091

Use NaN-boxing on value::InnerValue #4091

hansl commented Dec 20, 2024

hansl commented Dec 20, 2024

raskad commented Dec 21, 2024

jedel1043 commented Dec 21, 2024 •

edited

Loading

hansl commented Dec 21, 2024

hansl commented Dec 21, 2024

codecov bot commented Dec 21, 2024 •

edited

Loading

raskad commented Dec 21, 2024

jedel1043 commented Dec 21, 2024

hansl commented Dec 22, 2024

hansl commented Dec 23, 2024

hansl commented Dec 26, 2024

hansl commented Dec 29, 2024

hansl commented Dec 29, 2024

hansl commented Dec 29, 2024

raskad left a comment

raskad Dec 31, 2024

hansl Jan 1, 2025

hansl commented Jan 1, 2025

hansl commented Jan 1, 2025 •

edited

Loading

hansl commented Jan 5, 2025

jasonwilliams Jan 5, 2025

hansl Jan 6, 2025

hansl Jan 6, 2025

jasonwilliams Jan 6, 2025

hansl Jan 6, 2025

Use NaN-boxing on value::InnerValue #4091

Are you sure you want to change the base?

Use NaN-boxing on value::InnerValue #4091

Conversation

hansl commented Dec 20, 2024

hansl commented Dec 20, 2024

raskad commented Dec 21, 2024

jedel1043 commented Dec 21, 2024 • edited Loading

hansl commented Dec 21, 2024

hansl commented Dec 21, 2024

codecov bot commented Dec 21, 2024 • edited Loading

Codecov Report

raskad commented Dec 21, 2024

jedel1043 commented Dec 21, 2024

hansl commented Dec 22, 2024

hansl commented Dec 23, 2024

hansl commented Dec 26, 2024

hansl commented Dec 29, 2024

hansl commented Dec 29, 2024

hansl commented Dec 29, 2024

Test262 conformance changes

raskad left a comment

Choose a reason for hiding this comment

raskad Dec 31, 2024

Choose a reason for hiding this comment

hansl Jan 1, 2025

Choose a reason for hiding this comment

hansl commented Jan 1, 2025

hansl commented Jan 1, 2025 • edited Loading

Enum-based

NaN-Boxed

hansl commented Jan 5, 2025

jasonwilliams Jan 5, 2025

Choose a reason for hiding this comment

hansl Jan 6, 2025

Choose a reason for hiding this comment

hansl Jan 6, 2025

Choose a reason for hiding this comment

jasonwilliams Jan 6, 2025

Choose a reason for hiding this comment

hansl Jan 6, 2025

Choose a reason for hiding this comment

jedel1043 commented Dec 21, 2024 •

edited

Loading

codecov bot commented Dec 21, 2024 •

edited

Loading

hansl commented Jan 1, 2025 •

edited

Loading