Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve doc on distributions and period #23
Improve doc on distributions and period #23
Changes from 1 commit
37f01c4
fbc3d98
6a9e46b
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry to insist, and this will be long, but this is typical really O'Neill nonsense (not surprisingly, the paper was rejected). Non-equidistributed generators fail a strong collision test exactly like an equidistributed generator. It is very different to state the you cannot prove that something does not happen, and to prove that something happens.
Suppose you have a 32-bit generator with 16-bits outputs (so you can actually run the tests) that is 2-dimensionally equidistributed (the best), e.g., a Marsaglia xorshift generator, and one that is not, like a PCG generator.
What is true is that if you look at the sequences of pairs of outputs (that is, 32 bits at a time) you will have to wait 2^32 iterations before seeing a collision in the xorshift case, and less in the PCG case.
First of all, "less" will not be the right number. I invite you to try—at this size it is very easy to run the test. In theory, if you take 2^20 samples (pairs of 16-bit outputs) you should get 128 collisions. You won't. True, the result will be 0 for the xorshift case and 126 or something for the PCG case, but the statistic won't be exactly right.
But the real problem is that this is not how you run a collision test. The maximum power of the test is then you enumerate 1.25 times the possible values—in our case, 1.25 * 2^32 values. Both generators will fail the test—there's just not enough state. At that length, the collisions should be about 0.55 * 2^32, but you can easily check that neither generator will give that.
Additionally, in exchange for a not-so-bad-collision-test-in-the-short-run, the PCG generator cannot produce all possible values. If you run a test based on Coupon's collector (after enumerating O(n log n) times elements taken uniformly out of n, the probability that you've seen all elements is high), the PCG generator will disastrously fail the test: there will always be missing elements. The xorshift generator will have a statistic a bit off, but not so horribly bad.
So: either you generate all possible outputs for your state, or not. In the first case, you might approximately win Coupon's collector and do horribly on collisions. In the second case, you might approximately win collisions, and do horribly Coupon's collector. There's just not enough space to have both.
Any statement that either choice is "better" is just bogus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what the point is: so you know at least 1/5th of values are duplicates? (And, as below, it doesn't seem relevant to real usage.)
Is this relevant, given that this situation is only applicable after cycling the generator many times?
For the most part the (non-)equidistribution property doesn't appear important to typical usage either since one should have (at the very least) L^2 < P.
Do you have a better argument for why this bound (L^2 < P) is recommended? That was the main point of this paragraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is that the collision test is stronger in that way. See https://dl.acm.org/citation.cfm?id=979926
Note that the "missing collisions" can happen only if your state has kw bits, your output is w bits, you are k-dimensionally equidistributed, and you consider collisions on blocks of kw bits. That's a lot of ifs.
"Equidistribution" in general does not imply anything about collisions. It just implies that your source can generate all possible values of a certain size, and generates the same amount of them. This is why the statement is mathematically wrong. You can keep it there—it's your book—but it's wrong.
The argument for L^2 < P is that if you have w bits of state and w bits of outputs, and you output all possible values (nobody would use a generator of this kind without that property), if you use more than √P elements you will notice a lack of collisions. So you stay below that.
The argument extends to larger sizes in the sense that if you have kw bits of state and w bits of output, you can potentially generate in sequence all possible blocks of kw bits. If you do so you're in the same game for collisions of blocks of kw bits.
Generating all kw-bit blocks when kw is large might be not so relevant, but then also collisions not happening after √2^kw blocks is irrelevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, even with a university affiliation, I cannot access that article.
I get your point about equidistribution not implying correct distribution of repeats and have updated this paragraph. Please take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seriously? It takes 5s to get the public version 😂.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.113.6616&rep=rep1&type=pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, my final three remarks:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were only two mentions of 2^64; one of them I forgot to delete (see @vks's comment above; I will fix), and the other does mention this in the same sentence.
I thought it was clear, but will make this change.
I see no reason why this property is relevant, given that we're talking about generators which take (at least) centuries to cycle.Because certain values may be impossible to generate with any seed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shuffling is the main application where this applies, I think. (Still not sure how relevant it is, because it wouldn't be feasible to generate all possible shuffles anyway.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you care about shuffling? Most practical examples will have a relatively small sequence (e.g. 52), thus it is not the values output by the generator which matter but the values output by the
Uniform
distribution (which uses widening multiply + rejection, thus should be fairly resistant to bias in the low bits also).