Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve doc on distributions and period #23

Merged
merged 3 commits into from
Nov 19, 2019
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 31 additions & 31 deletions src/guide-rngs.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,40 +175,40 @@ not work well with all algorithms.

The *period* or *cycle length* of a PRNG is the number of values that can be
generated after which it starts repeating the same random number stream.
Many PRNGs have a fixed-size period, but for some only an expected average
cycle length can be given, where the exact length depends on the seed.

On today's hardware, even a fast RNG with a cycle length of *only*
Many PRNGs have a fixed-size period, while for others ("chaotic RNGs") the
cycle length may depend on the seed and short cycles may exist.

Note that a long period does not imply high quality (e.g. a counter through
`u128` values provides a decently long period). Conversely, a short period may
be a problem, especially when multiple RNGs are used simultaneously.
In general, we recommend a period of at least 2<sup>128</sup>.
(Alternatively, a PRNG with shorter period of at least 2<sup>64</sup> and
support for multiple streams may be sufficient. Note however that in the case
of PCG, its streams are closely correlated.)
dhardy marked this conversation as resolved.
Show resolved Hide resolved

*Avoid reusing values!*
On today's hardware, a fast RNG with a cycle length of *only*
2<sup>64</sup> can be used sequentially for centuries before cycling. However,
this is not the case for parallel applications. We recommend a period of
2<sup>128</sup> or more, which most modern PRNGs satisfy. Alternatively a PRNG
with shorter period but support for multiple streams may be chosen. There are
two reasons for this, as follows.

If we see the entire period of an RNG as one long random number stream,
every independently seeded RNG returns a slice of that stream. When multiple
RNG are seeded randomly, there is an increasingly large chance to end up
with a partially overlapping slice of the stream.

If the period of the RNG is 2<sup>128</sup>, and an application consumes
2<sup>48</sup> values, it then takes about 2<sup>32</sup> random
initializations to have a chance of 1 in a million to repeat part of an
already used stream. This seems good enough for common usage of
non-cryptographic generators, hence the recommendation of at least
2<sup>128</sup>. As an estimate, the chance of any overlap in a period of
size `p` with `n` independent seeds and `u` values used per seed is
approximately `1 - e^(-u * n^2 / (2 * p))`.

Further, it is not recommended to use the full period of an RNG. Many
PRNGs have a property called *k-dimensional equidistribution*, meaning that
when multiple RNGs are used in parallel (each with a unique seed), there is a
significant chance of overlap between the sequences generated.
For a generator with a *large* period `P`, `n` independent generators, and
a sequence of length `L` generated by each generator, the chance of any overlap
between sequences can be approximated by `Ln² / P` when `nL / P` is close to
zero.

*Bias and the birthday paradox!*
Many PRNGs have a property called *k-dimensional equidistribution*, meaning that
for values of some size (potentially larger than the output size), all
possible values are produced the same number of times over the generator's
period. This is not a property of true randomness. This is known as the
generalized birthday problem, see the [PCG paper] for a good explanation.
This results in a noticable bias on output after generating more values
than the square root of the period (after 2<sup>64</sup> values for a
period of 2<sup>128</sup>).

period. For some uses this may be a nice property to have, but it may suppress
duplicates expected in a truely random generator;
Copy link

@vigna vigna Nov 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry to insist, and this will be long, but this is typical really O'Neill nonsense (not surprisingly, the paper was rejected). Non-equidistributed generators fail a strong collision test exactly like an equidistributed generator. It is very different to state the you cannot prove that something does not happen, and to prove that something happens.

Suppose you have a 32-bit generator with 16-bits outputs (so you can actually run the tests) that is 2-dimensionally equidistributed (the best), e.g., a Marsaglia xorshift generator, and one that is not, like a PCG generator.

What is true is that if you look at the sequences of pairs of outputs (that is, 32 bits at a time) you will have to wait 2^32 iterations before seeing a collision in the xorshift case, and less in the PCG case.

First of all, "less" will not be the right number. I invite you to try—at this size it is very easy to run the test. In theory, if you take 2^20 samples (pairs of 16-bit outputs) you should get 128 collisions. You won't. True, the result will be 0 for the xorshift case and 126 or something for the PCG case, but the statistic won't be exactly right.

But the real problem is that this is not how you run a collision test. The maximum power of the test is then you enumerate 1.25 times the possible values—in our case, 1.25 * 2^32 values. Both generators will fail the test—there's just not enough state. At that length, the collisions should be about 0.55 * 2^32, but you can easily check that neither generator will give that.

Additionally, in exchange for a not-so-bad-collision-test-in-the-short-run, the PCG generator cannot produce all possible values. If you run a test based on Coupon's collector (after enumerating O(n log n) times elements taken uniformly out of n, the probability that you've seen all elements is high), the PCG generator will disastrously fail the test: there will always be missing elements. The xorshift generator will have a statistic a bit off, but not so horribly bad.

So: either you generate all possible outputs for your state, or not. In the first case, you might approximately win Coupon's collector and do horribly on collisions. In the second case, you might approximately win collisions, and do horribly Coupon's collector. There's just not enough space to have both.

Any statement that either choice is "better" is just bogus.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enumerate 1.25 times the possible values

I don't understand what the point is: so you know at least 1/5th of values are duplicates? (And, as below, it doesn't seem relevant to real usage.)

If you run a test based on Coupon's collector

Is this relevant, given that this situation is only applicable after cycling the generator many times?

For the most part the (non-)equidistribution property doesn't appear important to typical usage either since one should have (at the very least) L^2 < P.

Do you have a better argument for why this bound (L^2 < P) is recommended? That was the main point of this paragraph.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that the collision test is stronger in that way. See https://dl.acm.org/citation.cfm?id=979926

Note that the "missing collisions" can happen only if your state has kw bits, your output is w bits, you are k-dimensionally equidistributed, and you consider collisions on blocks of kw bits. That's a lot of ifs.

"Equidistribution" in general does not imply anything about collisions. It just implies that your source can generate all possible values of a certain size, and generates the same amount of them. This is why the statement is mathematically wrong. You can keep it there—it's your book—but it's wrong.

The argument for L^2 < P is that if you have w bits of state and w bits of outputs, and you output all possible values (nobody would use a generator of this kind without that property), if you use more than √P elements you will notice a lack of collisions. So you stay below that.

The argument extends to larger sizes in the sense that if you have kw bits of state and w bits of output, you can potentially generate in sequence all possible blocks of kw bits. If you do so you're in the same game for collisions of blocks of kw bits.

Generating all kw-bit blocks when kw is large might be not so relevant, but then also collisions not happening after √2^kw blocks is irrelevant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, even with a university affiliation, I cannot access that article.

I get your point about equidistribution not implying correct distribution of repeats and have updated this paragraph. Please take a look.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seriously? It takes 5s to get the public version 😂.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.113.6616&rep=rep1&type=pdf

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, my final three remarks:

  • When you say 2^64 values, I don't think everybody realizes this is after centuries of output generation; maybe it would be useful to remind it. That is, at that size the "bias" is entirely theoretical. It might not be if you have 32-bit outputs and 64 bits of state. Maybe that's a better example.
  • I realize you cannot be completely precise in discussing equidistribution, but I would replace "k-dimensional equidistribution" with "equidistribution in the maximum possible dimension". Otherwise, with an unspecified k every generator emitting each output the same number of times (which happens for almost all generators, including PCG) would fall into your claim (and this quite obviously not true).
  • When you say "nice property", I would explain why: " For some uses this may be a nice property to have, because it means that there are no missing values in the output of the generator, ...". Otherwise, you're showing just one side of the coin.

Copy link
Member Author

@dhardy dhardy Nov 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say 2^64 values, I don't think everybody realizes this is after centuries of output generation

There were only two mentions of 2^64; one of them I forgot to delete (see @vks's comment above; I will fix), and the other does mention this in the same sentence.

"equidistribution in the maximum possible dimension

I thought it was clear, but will make this change.

there are no missing values in the output of the generator

I see no reason why this property is relevant, given that we're talking about generators which take (at least) centuries to cycle. Because certain values may be impossible to generate with any seed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because certain values may be impossible to generate with any seed.

Shuffling is the main application where this applies, I think. (Still not sure how relevant it is, because it wouldn't be feasible to generate all possible shuffles anyway.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you care about shuffling? Most practical examples will have a relatively small sequence (e.g. 52), thus it is not the values output by the generator which matter but the values output by the Uniform distribution (which uses widening multiply + rejection, thus should be fairly resistant to bias in the low bits also).

this is known as the generalized birthday problem
(see the [PCG paper] for a good explanation). In short, this can result in
noticable bias in output after sampling more values than the square root of the
period (after 2<sup>64</sup> values for a period of 2<sup>128</sup>).

For more on this topic, please see these
[remarks by the Xoshiro authors](http://prng.di.unimi.it/#remarks).

## Security

Expand Down