Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notes: 32-bit implementation and perf optimization #5

Closed
mgold opened this issue Jul 10, 2016 · 5 comments
Closed

Notes: 32-bit implementation and perf optimization #5

mgold opened this issue Jul 10, 2016 · 5 comments
Labels

Comments

@mgold
Copy link
Owner

mgold commented Jul 10, 2016

Tests run on mgold/elm-nonempty-list test suite commit 9fe70d9 unless otherwise noted. Unreleased PCG versions were hand-copied into elm-stuff.

RNG time to run nonemptylist suite (seconds) reason for improvement
PCG 2.1.1 release 3.2+ (up to 4) (baseline)
PCG bit32 branch 2.3 - 2.5 removal of 64-bit arithmetic and complicated peel implementation
PCG bit32 with hand-optimized JS 1.7 bitwise ops and % no longer buried in function calls
PCG stub branch 0.2 Complete removal of all RNG code; represents overhead of test suite
core Random, elm-check, c8c83c2 0.03 generator is statistically unsound, uses no bitwise ops. Theories below.
core Random, elm-check, c8c83c2 with property-based tests removed 0.02 There is no way all those property tests ran in 10ms. Delta likely represents parsing their results.
core Random, project fuzzball, 85228d5 1.8 - 2.2 core Random is not ridiculously fast when actually measured properly

Observations: With project-fuzzball, there's about 200ms of overhead and either RNG adds another 1.5 to 2 seconds. With elm-check, it takes 30ms.

I was really confused as to what Hassan was doing until I realized that the 30ms figure doesn't include the time to run the property-based tests! It's a completely invalid comparison!

The currently shipping PCG is about twice as slow as core/Random, but PCG-32 is only slightly slower, probably good enough to use. Plus, it will get even faster if bitwise op inlining is ever implemented by the compiler.

One possible flaw is that PCG-32 was measured against project-fuzzball/test version 1.x while core was against 5.x. Were there any major performance changes in the interceding versions?

Hand optimized JS:

$ diff index.html index.html.patched
10507,10522c10507,10509
<   var _p2 = _p1;
<   var _p3 = _p2._0;
<   var word = A2(
<       _elm_lang$core$Bitwise$xor,
<       A2(
<           _elm_lang$core$Bitwise$shiftRightLogical,
<           _p3,
<           A2(_elm_lang$core$Bitwise$shiftRightLogical, _p3, 28) + 4),
<       _p3) * 277803737;
<   return A2(
<       _elm_lang$core$Bitwise$shiftRightLogical,
<       A2(
<           _elm_lang$core$Bitwise$or,
<           A2(_elm_lang$core$Bitwise$shiftRightLogical, word, 22),
<           word),
<       0);

---
>     var _p2 = _p1._0;
>     var _word = ((_p2 >>> ((_p2 >>> 28) + 4)) ^ _p2) * 277803737;
>     return ((_word >>> 22) ^ _word) >>> 0;
10722,10724c10709
<   var _p58 = _p57;
<   return _mgold$elm_random_pcg$Random_Pcg$Seed(
<       A2(_elm_lang$core$Bitwise$shiftRightLogical, (_p58._0 * 1664525) + 1013904223, 0));

---
>   return _mgold$elm_random_pcg$Random_Pcg$Seed( ((_p57._0 * 1664525) + 1013904223) >>> 0 );
10747,10749c10732
<               if (_elm_lang$core$Native_Utils.eq(
<                   A2(_elm_lang$core$Bitwise$and, range, range - 1),
<                   0)) {

---
>               if (_elm_lang$core$Native_Utils.eq(range & (range - 1), 0)) {
10752,10758c10735
<                       _0: A2(
<                           _elm_lang$core$Bitwise$shiftRightLogical,
<                           A2(
<                               _elm_lang$core$Bitwise$and,
<                               _mgold$elm_random_pcg$Random_Pcg$peel(seed0),
<                               range - 1),
<                           0),

---
>                       _0: (_mgold$elm_random_pcg$Random_Pcg$peel(seed0) & (range - 1)) >>> 0,
10762,10768c10739
<                   var threshhold = A2(
<                       _elm_lang$core$Bitwise$shiftRightLogical,
<                       A2(
<                           _elm_lang$core$Basics$rem,
<                           A2(_elm_lang$core$Bitwise$shiftRightLogical, 0 - range, 0),
<                           range),
<                       0);

---
>                   var threshhold = ((-range >>> 0) % range) >>> 0;
10781c10752
<                                   _0: A2(_elm_lang$core$Basics$rem, x, range) + lo,

---
>                                   _0: (x % range) + lo,
13875c13846
@mgold
Copy link
Owner Author

mgold commented Jul 31, 2016

Raw benchmark numbers:

$ head -n 50 *.log
==> core.log <==
flip a coin x 826,532 ops/sec ±3.23% (52 runs sampled)
flip 1000 coins x 833 ops/sec ±2.69% (55 runs sampled)
generate an integer 0-4094 x 790,682 ops/sec ±1.63% (55 runs sampled)
generate an integer 0-4095 x 788,581 ops/sec ±1.49% (58 runs sampled)
generate an integer 0-4096 x 780,343 ops/sec ±1.57% (53 runs sampled)
generate a massive integer x 500,936 ops/sec ±1.30% (56 runs sampled)
generate a percentage x 359,647 ops/sec ±1.09% (58 runs sampled)
generate 1000 percentages x 466 ops/sec ±1.13% (54 runs sampled)
generate a float 0-4094 x 413,948 ops/sec ±1.15% (57 runs sampled)
generate a float 0-4095 x 415,087 ops/sec ±1.23% (56 runs sampled)
generate a float 0-4096 x 417,024 ops/sec ±1.38% (56 runs sampled)
generate a massive float x 358,135 ops/sec ±1.01% (58 runs sampled)

==> pcg32-js.log <==
flip a coin x 6,849,402 ops/sec ±1.17% (56 runs sampled)
flip 1000 coins x 6,699 ops/sec ±1.80% (37 runs sampled)
generate an integer 0-4094 x 1,199,519 ops/sec ±1.35% (56 runs sampled)
generate an integer 0-4095 x 1,580,108 ops/sec ±1.23% (55 runs sampled)
generate an integer 0-4096 x 1,265,778 ops/sec ±1.31% (56 runs sampled)
generate a massive integer x 1,495,435 ops/sec ±1.10% (55 runs sampled)
generate a percentage x 1,952,226 ops/sec ±1.05% (56 runs sampled)
generate 1000 percentages x 5,983 ops/sec ±0.94% (33 runs sampled)
generate a float 0-4094 x 3,224,260 ops/sec ±0.95% (56 runs sampled)
generate a float 0-4095 x 3,235,686 ops/sec ±1.14% (54 runs sampled)
generate a float 0-4096 x 3,023,070 ops/sec ±1.29% (57 runs sampled)
generate a massive float x 1,732,447 ops/sec ±0.93% (58 runs sampled)

==> pcg32.log <==
flip a coin x 3,130,551 ops/sec ±1.05% (55 runs sampled)
flip 1000 coins x 3,233 ops/sec ±1.12% (21 runs sampled)
generate an integer 0-4094 x 839,203 ops/sec ±1.03% (56 runs sampled)
generate an integer 0-4095 x 1,105,071 ops/sec ±1.68% (57 runs sampled)
generate an integer 0-4096 x 843,090 ops/sec ±0.78% (56 runs sampled)
generate a massive integer x 1,129,568 ops/sec ±1.44% (56 runs sampled)
generate a percentage x 1,363,954 ops/sec ±1.50% (56 runs sampled)
generate 1000 percentages x 2,056 ops/sec ±1.10% (57 runs sampled)
generate a float 0-4094 x 1,519,190 ops/sec ±1.11% (57 runs sampled)
generate a float 0-4095 x 1,452,386 ops/sec ±0.99% (57 runs sampled)
generate a float 0-4096 x 1,511,441 ops/sec ±1.24% (55 runs sampled)
generate a massive float x 1,285,961 ops/sec ±1.38% (57 runs sampled)

==> pcg64.log <==
flip a coin x 596,134 ops/sec ±1.19% (53 runs sampled)
flip 1000 coins x 457 ops/sec ±1.23% (54 runs sampled)
generate an integer 0-4094 x 272,902 ops/sec ±2.40% (54 runs sampled)
generate an integer 0-4095 x 332,912 ops/sec ±1.66% (55 runs sampled)
generate an integer 0-4096 x 289,518 ops/sec ±1.46% (55 runs sampled)
generate a massive integer x 339,060 ops/sec ±1.37% (55 runs sampled)
generate a percentage x 216,582 ops/sec ±1.36% (56 runs sampled)
generate 1000 percentages x 283 ops/sec ±1.27% (57 runs sampled)
generate a float 0-4094 x 255,547 ops/sec ±1.58% (54 runs sampled)
generate a float 0-4095 x 257,823 ops/sec ±1.42% (58 runs sampled)
generate a float 0-4096 x 239,220 ops/sec ±2.25% (58 runs sampled)
generate a massive float x 203,079 ops/sec ±1.45% (57 runs sampled)

@mgold
Copy link
Owner Author

mgold commented Jul 31, 2016

@evancz @rtfeldman

This table shows operations per second normalized to core's implementation as 1. >1 means faster than core by that factor. The columns represent the old version of PCG, the new algorithm, and the new algorithm with hand-optimized JS.

test name PCG 64 PCG 32 PCG 32-JS
flip a coin 0.72 3.79 8.29
flip 1000 coins 0.55 3.88 8.04
generate an integer 0-4094 0.35 1.06 1.52
generate an integer 0-4095 0.42 1.40 2.00
generate an integer 0-4096 0.37 1.08 1.62
generate a massive integer 0.68 2.25 2.99
generate a percentage 0.60 3.79 5.43
generate 1000 percentages 0.61 4.41 12.84
generate a float 0-4094 0.62 3.67 7.79
generate a float 0-4095 0.62 3.50 7.80
generate a float 0-4096 0.57 3.62 7.25
generate a massive float 0.57 3.59 4.84

Takeaway: The new version of PCG will always be faster than core, even without the compiler inlining bitwise ops and rem.

Tiny caveat: I haven't implemented independent seeds (aka splitting) yet, which I expect to slow things down slightly. Unless we want to add that feature to core or want to take the (small) perf hit for being able to do so later easily, it won't be included in a PR to core.

But, that doesn't need to happen right away. If 0.18 is going to be about breaking things now instead of later, this can be delayed because it doesn't change the interface. At all. One of the side benefits of only 32 bits of state is that initialSeed2 is no longer necessary. It can literally be added in a patch release.

@rtfeldman
Copy link

This is really great! 🚀 Awesome work @mgold!

I'd say including independentSeed now makes sense, since we already have an important concrete use case for it: elm-test.

@mgold
Copy link
Owner Author

mgold commented Jul 31, 2016

Okay, I've implemented independentSeed but haven't completely tested it yet. Preliminarily, the perf hit is basically within the noise, as expected.

@mgold
Copy link
Owner Author

mgold commented Aug 3, 2016

Implemented in 3.0.0.

@mgold mgold closed this as completed Aug 3, 2016
@mgold mgold added the notes label Aug 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants