-
-
Notifications
You must be signed in to change notification settings - Fork 437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Investigate Canon's uniform-integer sampling method #1196
Conversation
Which algorithm to use for repeated sampling from a distribution? (non-SIMD) Canon and Canon-Lemire are basically the same, and clear winners for i32 and For i8 and i32, Canon appears slightly better than Canon-Lemire, though I'm not |
How much bias is acceptable for samples? Results (see Canon vs Canon64)uniform_dist_int_i8/Old/high_reject time: [1.0550 ns 1.0564 ns 1.0579 ns] change: [-4.8870% -4.7632% -4.6061%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high mild uniform_dist_int_i8/Old/low_reject time: [1.0613 ns 1.0631 ns 1.0655 ns] change: [-4.2396% -4.0519% -3.7987%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 2 (2.00%) low mild 5 (5.00%) high mild 1 (1.00%) high severe uniform_dist_int_i8/Lemire/high_reject time: [1.2753 ns 1.2757 ns 1.2762 ns] change: [+0.0946% +0.1332% +0.1760%] (p = 0.00 < 0.05) Change within noise threshold. Found 17 outliers among 100 measurements (17.00%) 1 (1.00%) low mild 7 (7.00%) high mild 9 (9.00%) high severe uniform_dist_int_i8/Lemire/low_reject time: [1.2861 ns 1.2864 ns 1.2869 ns] change: [-0.4160% -0.3649% -0.3147%] (p = 0.00 < 0.05) Change within noise threshold. Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) low mild 1 (1.00%) high mild 2 (2.00%) high severe uniform_dist_int_i8/Canon/high_reject time: [990.75 ps 990.89 ps 991.05 ps] change: [-5.7790% -5.7415% -5.7063%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) high mild 3 (3.00%) high severe uniform_dist_int_i8/Canon/low_reject time: [982.39 ps 982.87 ps 983.55 ps] change: [-6.6996% -6.6568% -6.6131%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 3 (3.00%) high mild 8 (8.00%) high severe uniform_dist_int_i8/Canon64/high_reject time: [940.71 ps 941.12 ps 941.56 ps] Found 12 outliers among 100 measurements (12.00%) 1 (1.00%) low severe 2 (2.00%) low mild 8 (8.00%) high mild 1 (1.00%) high severe uniform_dist_int_i8/Canon64/low_reject time: [941.74 ps 942.07 ps 942.43 ps] Found 8 outliers among 100 measurements (8.00%) 1 (1.00%) low severe 4 (4.00%) low mild 3 (3.00%) high mild uniform_dist_int_i8/Canon-Lemire/high_reject time: [992.74 ps 994.00 ps 996.01 ps] change: [-5.4800% -5.4102% -5.3175%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 2 (2.00%) high mild 4 (4.00%) high severe uniform_dist_int_i8/Canon-Lemire/low_reject time: [998.76 ps 999.10 ps 999.47 ps] change: [-4.3281% -4.2604% -4.1969%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) high mild 1 (1.00%) high severe uniform_dist_int_i8/Bitmask/high_reject time: [1.0390 ns 1.0397 ns 1.0409 ns] change: [-3.4370% -3.1355% -2.6074%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) high mild 4 (4.00%) high severe uniform_dist_int_i8/Bitmask/low_reject time: [835.01 ps 835.25 ps 835.54 ps] change: [-0.3614% -0.2779% -0.2112%] (p = 0.00 < 0.05) Change within noise threshold. Found 5 outliers among 100 measurements (5.00%) 1 (1.00%) low mild 2 (2.00%) high mild 2 (2.00%) high severe uniform_dist_int_i16/Old/high_reject time: [1.1080 ns 1.1084 ns 1.1088 ns] change: [+5.1971% +5.3899% +5.6302%] (p = 0.00 < 0.05) Performance has regressed. Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe uniform_dist_int_i16/Old/low_reject time: [1.0954 ns 1.0957 ns 1.0960 ns] change: [+2.0607% +2.2329% +2.3994%] (p = 0.00 < 0.05) Performance has regressed. Found 16 outliers among 100 measurements (16.00%) 4 (4.00%) low severe 7 (7.00%) high mild 5 (5.00%) high severe uniform_dist_int_i16/Lemire/high_reject time: [1.0487 ns 1.0490 ns 1.0492 ns] change: [+0.3726% +0.6616% +0.9021%] (p = 0.00 < 0.05) Change within noise threshold. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) low mild 1 (1.00%) high mild uniform_dist_int_i16/Lemire/low_reject time: [1.0490 ns 1.0494 ns 1.0498 ns] change: [-0.4838% -0.4226% -0.3644%] (p = 0.00 < 0.05) Change within noise threshold. Found 4 outliers among 100 measurements (4.00%) 4 (4.00%) high mild uniform_dist_int_i16/Canon/high_reject time: [998.31 ps 998.53 ps 998.78 ps] change: [+0.0661% +0.1725% +0.2543%] (p = 0.00 < 0.05) Change within noise threshold. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild uniform_dist_int_i16/Canon/low_reject time: [991.89 ps 992.23 ps 992.65 ps] change: [-0.2821% -0.2012% -0.1351%] (p = 0.00 < 0.05) Change within noise threshold. Found 8 outliers among 100 measurements (8.00%) 5 (5.00%) high mild 3 (3.00%) high severe uniform_dist_int_i16/Canon64/high_reject time: [938.18 ps 938.39 ps 938.62 ps] Found 10 outliers among 100 measurements (10.00%) 8 (8.00%) high mild 2 (2.00%) high severe uniform_dist_int_i16/Canon64/low_reject time: [937.62 ps 937.98 ps 938.37 ps] Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild uniform_dist_int_i16/Canon-Lemire/high_reject time: [983.96 ps 985.47 ps 987.44 ps] change: [-5.7130% -5.5611% -5.3530%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 2 (2.00%) high mild 5 (5.00%) high severe uniform_dist_int_i16/Canon-Lemire/low_reject time: [999.00 ps 999.33 ps 999.73 ps] change: [-5.0901% -5.0338% -4.9807%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 5 (5.00%) high mild 5 (5.00%) high severe uniform_dist_int_i16/Bitmask/high_reject time: [864.94 ps 865.76 ps 866.93 ps] change: [+0.4297% +0.7857% +1.2749%] (p = 0.00 < 0.05) Change within noise threshold. Found 7 outliers among 100 measurements (7.00%) 2 (2.00%) high mild 5 (5.00%) high severe uniform_dist_int_i16/Bitmask/low_reject time: [845.80 ps 851.63 ps 862.39 ps] change: [+1.6534% +2.0227% +2.5567%] (p = 0.00 < 0.05) Performance has regressed. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) high mild 3 (3.00%) high severe |
This is based on @TheIronBorn's work (#1154, #1172), with some changes.
This is from @TheIronBorn (see #1172)
This beats other uniform_int_i128 results
So, the first round of analysis (non-SIMD integer types only) is complete.
There may be a second round after a few tweaks to these algorithms. However, most important for now is to answer a few questions regarding bias and use of multiple different algorithms (see Final Thoughts). |
This is a very comprehensive report, thank you |
I didn't finish this (changing the uniform sampler) yet because from the report it's not really clear which sampler to use.
It would also be nice if we had some method of determining at compile time the type of RNG used, thus allowing algorithm selection based on the RNG. Unfortunately this isn't really feasible with Rust's current feature set (probably we need specialization and/or generic_const_exprs). |
Is it worth picking some good-enough algorithm which solves the #1145 issue? The over-engineered solution is add the algorithm selection to the |
I uploaded the Intel 1145 benchmarks here. Benches should be run from this branch. If you would like to benchmark on a different CPU:
|
Notes on review of
Best for i8: Canon32. Decent: Canon-reduced. Both have max bias 1-in-2^56. These are all variations of the Canon algorithm (usually with increased max bias):
Additional variants may be worth exploring:
|
Note: fn gen_index<R: Rng + ?Sized>(rng: &mut R, ubound: usize) -> usize {
if ubound <= (core::u32::MAX as usize) {
rng.gen_range(0..ubound as u32) as usize
} else {
rng.gen_range(0..ubound)
}
} Maybe uniform samplers for 64 and 128-bit sizes should test the range and potentially use the algorithm for a smaller size? There are a few issues however: (1) RNG bits used varies by inputs (though this is already non-constant with most algorithms), (2) more complex code with more |
I added a new algorithm:
i8, i16: Canon32-2 is slightly slower than Canon32 (but with less bias). Mostly well ahead of Canon (same bias), but loses in some tests. Therefore Canon32-2 may be the best choice here if we want less bias than 1-in-2^48. i32: high variance, slower than most, easily beaten by Unbiased. These benches give all 64-bit samplers very low variance and all 32-bit samplers very high variance. At a guess, this is because 64-bit samplers (almost) never need a bias-reduction step therefore the branch predictor is highly effective? Best option remains Canon. i64: Canon32-2 is 10-20% behind Canon32, sometimes ahead of Unbiased but behind Canon and Canon-Lemire. Looks like Canon-Lemire is the best option here (so the extra cost of the improved bias check is worth it here, but not for i32 output with 64-bit RNG output). i128: Canon32-2 is well behind Canon-reduced which remains the best choice. |
I previously wrote (in the report):
Looking again at the results: Bias: according to my benchmarks, one uniform distribution sample takes (very vaguely) 100ns, allowing about 3e14 samples per CPU-core per year. Bias of 1-in-2^48 samples implies approximately one biased sample per CPU-year. This only affects (Alternatively, Canon32-2 reduces the bias with 8-16-bit output at little cost.) Therefore I propose we use the algorithms mentioned above: Canon32 for 8-16 bit output, Canon for 32-bit, Canon-Lemire for 64-bit and Canon-reduced for 128-bit. Yes, having so many variants is a little weird, but the differences are mostly small, and not everything reduces well to single a generic implementation anyway. Regarding SIMD types, I propose to leave these for later. |
How would this affect code size compared to the status quo? |
LOC? Probably a bit worse. I may be able to make impls partially share via macros. Conceptually the differences between the algorithms are small. Compiled? Generic code is monomorphised at compile time anyway. |
Note: this branch is behind canon-uniform-benches (which also includes benchmark results). Neither is appropriate for merging. My analysis above misses something: the Canon-Lemire optimisation is essentially an extra cost up-front ( |
Originally I assumed the Canon-Lemire variant was a mistake, but it was mentioned on Twitter. The Canon method is based off multiplying a theoretically infinite-precision random value in the range The Lemire method is effectively the familiar It just so happens that the code for the two, up to checking the first threshold, is identical, but with Lemire's method using a tighter threshold. Presumably this is where the idea for Canon-Lemire came from. Unfortunately, it doesn't work. Lemire rejects some (A waste of time — shows why I should never have accepted a hand-wavey argument without proof in the first place!) |
Sorry you found out about the bias so late, that's really frustating. Seems like Canon-Lemire was not sufficiently tested? |
Ah sorry, I should have been more rigorous before including it |
Testing for bias is only really practical with small samplers, and all of these samplers are at least 32-bit. Yes I could build an artificially smaller version, but I didn't get around to it. |
Some more benchmark results, reduced to five algorithms at each size, chosen from:
Canon32-unbiased and Canon-reduced-unbiased are new to this round of testing. ConclusionsBest for i8, i16: Canon32, or if less bias is wanted, Canon32-2 or Canon32-Un. Best for i32: Canon or Canon-Un (Biased64 is fastest, but too biased). Best for i64: Canon, Lemire, ONeill or Canon-Un. Best for i128: Probably Lemire (distributions only). Otherwise, hard to choose. Thoughts? I can't think of any more variants; we should probably pick from the above. We should consider bias: Canon32 samples up to 64 bits and Canon32-2 up to 96, Canon up to 128 (double for i128 output); subtract the output size to get worst-case bias. Probably any of these (except Biased64 for i32) is fine. |
See #1172. Also includes #1154 as a comparator.
Some of this work is @TheIronBorn's; I have cleaned up and tweaked his impls a bit, and written new benchmarks using Criterion.
Algorithms to compare:
sample_single
we use a bit-mask method for i32 and largersample_single
; for repeated sampling this is just Lemire's method again (I may not have optimised properly however)Of these, Canon's method definitely appears the best all-round method. I'm not done testing yet and may use some tweaks for i128 and some SIMD cases.
Cases:
Research questions
Is
sample
faster thansample_single
?Yes, definitely (sample_single is approx 50% slower). It's clear so I won't bother with evidence here.
Is it better to use
u32
overu64
random numbers when sampling small integers?For Canon-Lemire, maybe for i8/i16. For Canon, no. For i32, no. Evidence.
(Note: several benches vary by 2.5%, one by 9%, so take with a pinch of salt.)
Note further: the answer may be different on 32-bit CPUs. Tested on 5800X.
Is it better to use
u64
instead ofu128
samples for 128-bit integers?Yes (~15% faster for non-SIMD distribution sampling). Evidence.
Which is the best algorithm for XX?
Each case will get a new post below.
CC @ctsrc @stephentyrone