-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve doc on distributions and period #23
Conversation
period of 2<sup>128</sup>). | ||
|
||
period. For some uses this may be a nice property to have, but it may suppress | ||
duplicates expected in a truely random generator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry to insist, and this will be long, but this is typical really O'Neill nonsense (not surprisingly, the paper was rejected). Non-equidistributed generators fail a strong collision test exactly like an equidistributed generator. It is very different to state the you cannot prove that something does not happen, and to prove that something happens.
Suppose you have a 32-bit generator with 16-bits outputs (so you can actually run the tests) that is 2-dimensionally equidistributed (the best), e.g., a Marsaglia xorshift generator, and one that is not, like a PCG generator.
What is true is that if you look at the sequences of pairs of outputs (that is, 32 bits at a time) you will have to wait 2^32 iterations before seeing a collision in the xorshift case, and less in the PCG case.
First of all, "less" will not be the right number. I invite you to try—at this size it is very easy to run the test. In theory, if you take 2^20 samples (pairs of 16-bit outputs) you should get 128 collisions. You won't. True, the result will be 0 for the xorshift case and 126 or something for the PCG case, but the statistic won't be exactly right.
But the real problem is that this is not how you run a collision test. The maximum power of the test is then you enumerate 1.25 times the possible values—in our case, 1.25 * 2^32 values. Both generators will fail the test—there's just not enough state. At that length, the collisions should be about 0.55 * 2^32, but you can easily check that neither generator will give that.
Additionally, in exchange for a not-so-bad-collision-test-in-the-short-run, the PCG generator cannot produce all possible values. If you run a test based on Coupon's collector (after enumerating O(n log n) times elements taken uniformly out of n, the probability that you've seen all elements is high), the PCG generator will disastrously fail the test: there will always be missing elements. The xorshift generator will have a statistic a bit off, but not so horribly bad.
So: either you generate all possible outputs for your state, or not. In the first case, you might approximately win Coupon's collector and do horribly on collisions. In the second case, you might approximately win collisions, and do horribly Coupon's collector. There's just not enough space to have both.
Any statement that either choice is "better" is just bogus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enumerate 1.25 times the possible values
I don't understand what the point is: so you know at least 1/5th of values are duplicates? (And, as below, it doesn't seem relevant to real usage.)
If you run a test based on Coupon's collector
Is this relevant, given that this situation is only applicable after cycling the generator many times?
For the most part the (non-)equidistribution property doesn't appear important to typical usage either since one should have (at the very least) L^2 < P.
Do you have a better argument for why this bound (L^2 < P) is recommended? That was the main point of this paragraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is that the collision test is stronger in that way. See https://dl.acm.org/citation.cfm?id=979926
Note that the "missing collisions" can happen only if your state has kw bits, your output is w bits, you are k-dimensionally equidistributed, and you consider collisions on blocks of kw bits. That's a lot of ifs.
"Equidistribution" in general does not imply anything about collisions. It just implies that your source can generate all possible values of a certain size, and generates the same amount of them. This is why the statement is mathematically wrong. You can keep it there—it's your book—but it's wrong.
The argument for L^2 < P is that if you have w bits of state and w bits of outputs, and you output all possible values (nobody would use a generator of this kind without that property), if you use more than √P elements you will notice a lack of collisions. So you stay below that.
The argument extends to larger sizes in the sense that if you have kw bits of state and w bits of output, you can potentially generate in sequence all possible blocks of kw bits. If you do so you're in the same game for collisions of blocks of kw bits.
Generating all kw-bit blocks when kw is large might be not so relevant, but then also collisions not happening after √2^kw blocks is irrelevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, even with a university affiliation, I cannot access that article.
I get your point about equidistribution not implying correct distribution of repeats and have updated this paragraph. Please take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seriously? It takes 5s to get the public version 😂.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.113.6616&rep=rep1&type=pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, my final three remarks:
- When you say 2^64 values, I don't think everybody realizes this is after centuries of output generation; maybe it would be useful to remind it. That is, at that size the "bias" is entirely theoretical. It might not be if you have 32-bit outputs and 64 bits of state. Maybe that's a better example.
- I realize you cannot be completely precise in discussing equidistribution, but I would replace "k-dimensional equidistribution" with "equidistribution in the maximum possible dimension". Otherwise, with an unspecified k every generator emitting each output the same number of times (which happens for almost all generators, including PCG) would fall into your claim (and this quite obviously not true).
- When you say "nice property", I would explain why: " For some uses this may be a nice property to have, because it means that there are no missing values in the output of the generator, ...". Otherwise, you're showing just one side of the coin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say 2^64 values, I don't think everybody realizes this is after centuries of output generation
There were only two mentions of 2^64; one of them I forgot to delete (see @vks's comment above; I will fix), and the other does mention this in the same sentence.
"equidistribution in the maximum possible dimension
I thought it was clear, but will make this change.
there are no missing values in the output of the generator
I see no reason why this property is relevant, given that we're talking about generators which take (at least) centuries to cycle. Because certain values may be impossible to generate with any seed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because certain values may be impossible to generate with any seed.
Shuffling is the main application where this applies, I think. (Still not sure how relevant it is, because it wouldn't be feasible to generate all possible shuffles anyway.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you care about shuffling? Most practical examples will have a relatively small sequence (e.g. 52), thus it is not the values output by the generator which matter but the values output by the Uniform
distribution (which uses widening multiply + rejection, thus should be fairly resistant to bias in the low bits also).
Whoops, it appears I pushed upstream instead of to this PR accidentally (I should enable branch protections)! I think in any case we're done with this PR, so long as @vigna approves my last commit. |
Addresses @vigna's comments. Closes #18, #19, #20, #21.
@vigna would you please review? Note that the changes on distributions are more extensive; @vks may be a more appropriate reviewer here.