Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rng::gen_range() is slowest with power-of-2 input sizes? #1145

Closed
elidupree opened this issue Jul 15, 2021 · 12 comments
Closed

Rng::gen_range() is slowest with power-of-2 input sizes? #1145

elidupree opened this issue Jul 15, 2021 · 12 comments

Comments

@elidupree
Copy link

elidupree commented Jul 15, 2021

I was just fiddling around with some RNG stuff when I noticed something that looks pretty odd:

gen_range_plot

These are time-per-call for calls to rng.gen_range(0..n) (you can see the value of n at the end of each name written on the vertical axis), sampled using Criterion. You'd think that power-of-2 range sizes would be the fastest to generate, since they can theoretically be done as just a single bitmask, with no rejections. But instead, they are the slowest!

I'm not deeply familiar with RNG algorithms, but I took a quick look through the code, and I wonder if this line from sample_single_inclusive is the culprit. Maybe it was intended to use (range-1).leading_zeros() rather than range.leading_zeros()? I haven't grokked what the line is doing, but that would be something that affects exactly power-of-2 sizes differently than their neighbors.

To test, this, I confirmed that Uniform::new(0, n).sample(&mut rng) does not have this pattern, and in fact, is faster than rng.gen_range(0..n) on my computer in almost every case I tested (but especially much faster on power-of-2 sizes):

gen_range_report_2

Is it possible that the sample_single_inclusive code is out of date compared to optimizations that have been made more recently in sample_inclusive?

@vks
Copy link
Collaborator

vks commented Jul 15, 2021

That's a keen observation, thanks! Definitely something we need to investigate. The results you are showing are from the benchmarks in the repository you linked? I would like to run them on my machine as well.

It almost looks like gen_range is never faster? Maybe we should consider replacing it with Uniform internally.

(About your experiments with generating only as many bits as necessary: This approach is indeed popular in some academic publications. We also have some discussion in #1014 and #1055. The problem is that it is tricky to make fast without compromising performance when generating u64, see #1014.)

@elidupree
Copy link
Author

yeah, you should be able to just pull and run cargo bench. I've also just pushed an alternate branch with my code to run the benchmarks using Uniform.

Reading those issues now, I bet they'll be interesting!

@vks
Copy link
Collaborator

vks commented Jul 15, 2021

Thanks! The results on my machine are consistent with yours.

@TheIronBorn
Copy link
Contributor

This is an artifact of the leading_zeros threshold approximation. I believe this should fix it:
(range << (range - 1).leading_zeros()).wrapping_sub(1)

It passes a 16-bit bias test:

(1..=u16::max_value()).into_par_iter().for_each(|range| {
        let zone = (range << ((range - 1).leading_zeros())).wrapping_sub(1);
        let mut map = vec![0; range as usize];
        for r in 0..=u16::max_value() {
            let m = r as u32 * range as u32;
            let h = (m >> 16) as u16;
            let l = m as u16;
            if l <= zone {
                map[h as usize] += 1;
            }
        }
        // must be uniform for every value
        assert!(map.iter().all(|&x| x == map[0]), "{:?}", range);
    });

@TheIronBorn
Copy link
Contributor

Hang on, it makes others worse. Like range = 1

@TheIronBorn
Copy link
Contributor

TheIronBorn commented Jul 15, 2021

Note that the bitmask method has nearly overlapping rejection probabilities but is better on edge cases like this
image
image

@dhardy
Copy link
Member

dhardy commented Jul 16, 2021

See also #951 and #985 by @jongiddy.

@TheIronBorn
Copy link
Contributor

Note also that O'neill found that shortcutting the modulo-wide-mul method is actually fastest. See "All-Ranges Benchmark Results" for "Debiased Int Mult (t-opt/m-opt)" vs "Bitmask" (which should have similar speed to our method, perhaps even faster thanks to no multiplication)

@dhardy
Copy link
Member

dhardy commented Jul 18, 2021

So, would @TheIronBorn like to make a PR?

@TheIronBorn
Copy link
Contributor

Sure I’ll look into it

@dhardy
Copy link
Member

dhardy commented Feb 3, 2023

I ran @elidupree's benchmark for the new RNGs evaluated in #1196. Summary: all Canon variants do not display poor behaviour except for (1 << 62) + 1.
image
Results are similar for other RNGs.

@dhardy
Copy link
Member

dhardy commented Feb 3, 2023

That isn't to say that these algorithms handle all specially picked values well:
image
Values are:

  • (1 << 31) + 1 (good perf)
  • (1 << 55) + 1 (good)
  • 1 << 62 (middle)
  • (1 << 62) + 1 (middle)
  • (1 << 63) - 1 (good)
  • 1 << 63 (poor)
  • (1 << 63) + 1 (poor)

So special cases still lead to poor performance (no surprise), but we're at least a lot better than the "Old" (current) algorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants