-
-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rng::gen_range() is slowest with power-of-2 input sizes? #1145
Comments
That's a keen observation, thanks! Definitely something we need to investigate. The results you are showing are from the benchmarks in the repository you linked? I would like to run them on my machine as well. It almost looks like (About your experiments with generating only as many bits as necessary: This approach is indeed popular in some academic publications. We also have some discussion in #1014 and #1055. The problem is that it is tricky to make fast without compromising performance when generating |
yeah, you should be able to just pull and run Reading those issues now, I bet they'll be interesting! |
Thanks! The results on my machine are consistent with yours. |
This is an artifact of the leading_zeros threshold approximation. I believe this should fix it: It passes a 16-bit bias test: (1..=u16::max_value()).into_par_iter().for_each(|range| {
let zone = (range << ((range - 1).leading_zeros())).wrapping_sub(1);
let mut map = vec![0; range as usize];
for r in 0..=u16::max_value() {
let m = r as u32 * range as u32;
let h = (m >> 16) as u16;
let l = m as u16;
if l <= zone {
map[h as usize] += 1;
}
}
// must be uniform for every value
assert!(map.iter().all(|&x| x == map[0]), "{:?}", range);
}); |
Hang on, it makes others worse. Like |
Note also that O'neill found that shortcutting the modulo-wide-mul method is actually fastest. See "All-Ranges Benchmark Results" for "Debiased Int Mult (t-opt/m-opt)" vs "Bitmask" (which should have similar speed to our method, perhaps even faster thanks to no multiplication) |
So, would @TheIronBorn like to make a PR? |
Sure I’ll look into it |
I ran @elidupree's benchmark for the new RNGs evaluated in #1196. Summary: all Canon variants do not display poor behaviour except for |
I was just fiddling around with some RNG stuff when I noticed something that looks pretty odd:
These are time-per-call for calls to
rng.gen_range(0..n)
(you can see the value ofn
at the end of each name written on the vertical axis), sampled using Criterion. You'd think that power-of-2 range sizes would be the fastest to generate, since they can theoretically be done as just a single bitmask, with no rejections. But instead, they are the slowest!I'm not deeply familiar with RNG algorithms, but I took a quick look through the code, and I wonder if this line from sample_single_inclusive is the culprit. Maybe it was intended to use
(range-1).leading_zeros()
rather thanrange.leading_zeros()
? I haven't grokked what the line is doing, but that would be something that affects exactly power-of-2 sizes differently than their neighbors.To test, this, I confirmed that
Uniform::new(0, n).sample(&mut rng)
does not have this pattern, and in fact, is faster thanrng.gen_range(0..n)
on my computer in almost every case I tested (but especially much faster on power-of-2 sizes):Is it possible that the
sample_single_inclusive
code is out of date compared to optimizations that have been made more recently insample_inclusive
?The text was updated successfully, but these errors were encountered: