Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discussion] Improve usage of Random numbers in PCL #3996

Open
Morwenn opened this issue Apr 29, 2020 · 13 comments
Open

[discussion] Improve usage of Random numbers in PCL #3996

Morwenn opened this issue Apr 29, 2020 · 13 comments
Labels
kind: proposal Type of issue needs: feedback Specify why not closed/merged yet

Comments

@Morwenn
Copy link
Contributor

Morwenn commented Apr 29, 2020

This is a meta-issue concerning the uses of random number generation facilities, as discussed in the Discord. Since several Boost.Random components are being replaced with their standard library counterparts (ex: #2803, #3994), I thought that it was a good time to open a general library-wide discussion about random handling in PCL.

General information: https://codingnest.com/generating-random-numbers-using-c-standard-library-the-problems/

Random number facilities used in PCL

PCL uses a variety of random number facilities. We will start by reviewing those.

std::rand()/std::srand()

The legacy C functions std::rand() and std::srand() are still used quite a lot in PCL: in examples, in tests, in tutorials, in apps, in embedded 3rd-party code but also in general library code itself.

There is plenty of literature on the internet (TODO: links) showing why those are not good enough, but here are a few issues to sum it up:

  • rand has a global state which doesn't allow finer control over the different places that use it.
  • rand is not thread-safe.
  • rand has historically had a bunch of poor implementations and from what I understood the state of things is not that great.
  • The common practices using rand often have subtle numerical issues.

See also issue #3781.

Fortunately there has been Boost.Random for a long time and <random> in the standard library since C++11.

random_device

random_device is the only class from <random> that can produce real random numbers, and is used in PCL here and there to initialize PRNGs. However, the std::random_device implementation in libstdc++ is notably known to produce a fixed sequence of numbers when targeting Windows via MinGW-w64, and its entropy function can't really be used to check whether the underlying implementation will actually return random numbers or not-random-at-all ones - there is even a standard proposal to remove entropy. This will be fixed in GCC10, but that means that the problem will still exist for years.

If the closest thing from an actual random number is needed, boost::random_device is still a better tool today.

Pseudo-random number generators

There are plenty of PRNGs in both <random> and Boost.Random. They are pretty good despite their shortcomings (TODO: links), and are guaranteed to produce the same sequence of pseudo-random numbers on every implementation for a given seed. I don't know whether the Boost.Random ones have any advantages over the <random> ones, so either can be used.

Whether using a Mersenne Twister everywhere is debatable since it seems to be a rather heavy PRNG, and users might want to replace it depending on their needs (see the first answer in that thread). PCL facilities that use random number generators could very allow users to optionally provide their own PRNG when one is needed, just like some standard library algorithms (ex: std::shuffle). Issue #3724 already mentions the will to let users choose their PRNG.

Distributions

Unlike PRNGs, the standard does not specify a specific algorithm to be used for every distribution in <random> and only describes the properties of the distribution. In the real world that means that the distributions in the different standard libraries use different algorithms and that they won't return the same results when given the same PRNG seeded with the same value. As we will see in another section this might be a problem for reproducible tests on different platforms.

The simplest way to achieve reproducible results for distributions is to use the Boost.Random ones: the algorithm for a given distribution will be the same on all platforms. It is worth nothing that the standard committee recently rejected a proposal to make distribution results portable between platforms.

Eigen random number functions

Apparently Eigen random functions use rand(), with all its problems. I have to check whether it is still the case and what can be done about it.

Reproducible issues in code using random numbers

One thing that might be of value is the ability to reproduce errors in algorithms that use random-number generators even when those errors happen during the continuous integration or if someone opens an issue. The need for reproducible results in the presence of RNGs was already mentioned in #3715. This probably requires a bunch of changes, but it's not infeasible.

The first thing it requires is the ability to log the seed used to initialize the PRNG in the algorithm where the issue appeared. We can't really log from the algorithms themselves, so the most flexible solution is to do it like the standard library does: allow users to pass an already seeded PRNG to algorithms that need one, and use that to pass manually seeded PRNGs in the test suite.

The test suite can either use a single seed for the whole project - for example GTest's UnitTest::random_seed() -, or a seed per test case. What matters is that this seed should at the very least be logged when the corresponding test fails.

For truly reproducible failures in the test suite, the distributions from Boost.Random have to be used. If the standard ones are not used an error in libstdc++ might not be reproducible on the tester's computer due to the use of different algorithms for distributions.

Thread safety

I don't know how much PCL cares about thread-safety: currently it uses std::rand() (not thread-safe) in some places, but instantiates PRNGs in some functions despite their cost, which is thread-safe. A good middle ground it to have per-function thread_local PRNGs to have both thread safety without paying the cost of initialization every time.

Easy-to-use interface

The random number facilities in <random> are flexible enough but far from easy to use, maybe PCL wants to have its own facilities built on top to avoid the burden of going manually through the standard interface every time - this kind of burden is what pushes people to use std::rand() even today. Having a simplified interface and making it available would also allow to write tutorials that don't use std::rand().

Currently the only "simple" function that were tentatively standardized are std::experimental::randint and std::experimental::reseed in the Library Fundamentals TS v2, along with overloads of std::sample and std::shuffle that don't explicitly take a PRNG but require the presence of a "global" per-thread engine (which reseed() seeds). PCL could have something along these lines too: each function needing a PRNG would then either use a user-provided one or use the library's global per-thread engine.

@Morwenn Morwenn added the status: triage Labels incomplete label Apr 29, 2020
@peteroupc
Copy link

peteroupc commented Apr 30, 2020

As to the issue of pseudo random generators:

The random number engines currently available in C++ leave a lot to be desired. For example, default_random_engine has unspecified quality; mt19937, ranlux24, and ranlux48 have a bigger-than-necessary state and are nontrivial to seed; and minstd_rand0 and minstd_rand admit very few seeds in comparison to modern PRNGs. Although discard_block_engine and shuffle_block_engine (Bays–Durham shuffle) can be used to improve a random engine's quality, it's hard to tell whether the quality is improved this way without testing. There is a proposal to add counter-based PRNGs to a future version of C++, but for the time being, mt19937 is perhaps the best engine the C++ standard library has available. See also my criteria for high-quality random generators.

As for reproducibility:

If reproducible randomness is a goal, there are several things that can make that goal harder to achieve (besides using different PRNGs or distribution algorithms). They include:

  • Floating-point numbers.
  • Multithreading.
  • Nondeterministic sources (such as file system and system clock).
  • Undefined, undocumented, and implementation-specific behavior.

For further information, see my section "Ensuring Reproducibility", and the references cited there.

If feasible, in unit tests (as opposed to other kinds of tests, such as fuzz tests), a manually-seeded PRNG with a fixed seed could be used in place of other kinds of PRNGs for the purpose of the test, for reproducibility purposes. Notably, it could be argued that "unit tests" that rely on randomness are not really unit tests.

As for ease of use:

The PCL might also want to have an easy way to initialize multiple PRNGs at once while reducing or eliminating the risk of correlated random sequences. This can be helpful, for example, if multiple processes in the same simulation need to produce random numbers at the same time.


See also another discussion on random generators at nest/nest-simulator#1440 . A notable topic mentioned there but not here is the maximum number of random numbers a simulation needs; that discussion mentions 10^11 random numbers per stream per simulation.

@Morwenn
Copy link
Contributor Author

Morwenn commented Apr 30, 2020

Are randomized units tests still really unit tests? I don't know, I use them in my own projects and have a list of failing seeds that I had to the tests when I discover a new one, so as far as regression testing goes it does the job quite nicely.

@kunaltyagi kunaltyagi added kind: proposal Type of issue and removed status: triage Labels incomplete labels Apr 30, 2020
@kunaltyagi kunaltyagi changed the title [discussion] Random number generation in PCL [discussion] Improve usage of Random numbers in PCL Apr 30, 2020
@kunaltyagi kunaltyagi mentioned this issue May 7, 2020
2 tasks
@kunaltyagi kunaltyagi added this to the pcl-1.12.0 milestone May 7, 2020
@aPonza
Copy link
Contributor

aPonza commented May 11, 2020

While definitely out of my field, I remember looking for a proper way to seed the Mersenne Twister and at that time I settled on this SO answer.

@kunaltyagi
Copy link
Member

kunaltyagi commented May 11, 2020

Synopsis of the OP

Things I think I understood, and don't need clarifications on:

  • Remove std::srand() and std::rand()
  • Don't use Eigen's random because it relies on rand and srand
  • Prefer boost::random_device over std::random_device because implementations might be lacking, but boost is a known quantity
  • Prefer boost::{distribution} over std::{distribution} for reproducible results with same seed across different implementations (needed for next point)
  • Better random-ness in testing:
    • Explicitly log the random seed used in test suites (to log, find issue and resolve)
    • Allow testing test-cases with multiple explicit seed (previous failure points)
    • Allow testing with random seed
  • Properly init the PRNG of choice (one implementation per PRNG in PCL, plenty of links above)

Things I think I might have understood and need some clarifications:

  • Don't instantiate new PRNG inside threads (Should we have a global pool? What if we're creating PRNG inside our function and that's being called from multiple threads?)
  • Easy to use interface (Does boost have one that PCL can build upon? Should we rely on another third-party or recreate our own on top of boost/std?)

Things I might not have understood:
* How to achieve reproducibility in main PCL code (I think I understood what to do for tests, but not for PCL's library code. Is it just an extension of what to do for the tests? Does something more need to be done?)

@peteroupc
Copy link

peteroupc commented May 11, 2020

Don't instantiate new PRNG inside threads (Should we have a global pool? What if we're creating PRNG inside our function and that's being called from multiple threads?)

The issue of thread safety occurs when a single (global) instance of a PRNG is shared by multiple threads. If global state is accessed by multiple threads at a time and the program does not synchronize access to that state, problems such as deadlocks and race conditions can occur because each thread could have an inconsistent view of the state this way. In contrast, if each thread or task uses its own PRNG instance, these issues disappear because no other thread or task "cares" about that PRNG's state at the same time.

However, there is another issue with global PRNGs: they can hinder reproducibility, even across runs, since the order in which multiple threads access the PRNG may change from run to run.

See also oscar-system/Oscar.jl#100 (comment).

@kunaltyagi
Copy link
Member

Thanks @peteroupc (and @Morwenn) for taking lead in the discussion and clarifications 😄

However, there is another issue with global PRNGs: reproducibility

A global pool of thread_local PRNG is a bad idea for use in PCL due to reproducibility, if they are used.

  • What about copying a const global (not thread local) PRNG and using a mutable copy for local use? That'd allow a faster path (simple copy, no expensive instantiation) for the most common use-case (default seed, which can be chosen at PCL's compile time) while allowing a custom seed to be used as usual.
  • Is it a bad idea for providing global mutable thread_local PRNGs for ease-of-use of users? Most users who care about reproducibility will know not to use them.

@peteroupc
Copy link

peteroupc commented May 12, 2020

I wasn't necessarily thinking of a pool of "thread local" PRNGs (each of which can be accessed by a single thread) so much as PRNGs that multiple threads can access at the same time.

The following issues on other repositories also tackle "reproducibility" in the face of multithreading:

As you can see with these two issues, random number reproducibility is anything but trivial, especially in the face of multithreading and especially where random number generators are involved.

@kunaltyagi
Copy link
Member

PRNGs that multiple threads can access in indeterminate order

Those are out of question for use in the library. 👍

@kunaltyagi kunaltyagi added the needs: feedback Specify why not closed/merged yet label May 12, 2020
@Morwenn
Copy link
Contributor Author

Morwenn commented May 12, 2020

@kunaltyagi Your points accurately sum up what I was thinking - I'm not sure what to do with Eigen random functions but we can probably replace them somehow.

How to achieve reproducibility in main PCL code [...]?

I don't propose to achieve that because various algorithms rely on randomness and randomness needs stay random. All I was proposing was to add the possibility to make the main PCL code behaviour reproducible when needed by allowing users to feed their own PRNG to parts of the library that use random components. That's basically what we need to write reproducible tests in the first place.

@Morwenn
Copy link
Contributor Author

Morwenn commented May 12, 2020

For the easy interface I think that randutils was somewhat popular. I believe that there are several projects along these lines all over the web but I don't know which ones are good.

@kunaltyagi
Copy link
Member

kunaltyagi commented May 12, 2020

I was worried I might have missed something.

Those sound like reasonable things to do in presence of the data. Could you please create issues for small, easily digestible chucks of actionable items (which can be tracked by the new project I've created)?

There are some more questions that need to be discussed though:

  • interface: candidates:
    • copy-paste (from where)
    • integrate as dependency (from where)
  • source of good design for random:
  • Find and reduce/replace usage of Eigen random functions (replace how?)
  • Providing a static global pool of instantiated copy-only PRNGs for the default seed for faster happy-path?
  • Providing a static global pool of mutable thread_local PRNGs for ease-of-use of users?

@SergioRAgostinho
Copy link
Member

  • Prefer boost::{distribution} over std::{distribution} for reproducible results with same seed across different implementations (needed for next point)

This is not an absolute requirement for me. Adopting std will enforce users to not rely on anything other than the statistical properties because it will never allow expecting the same sequence of samples between implementations. On the other hand, it would make cross platform debugging impossible. So yeah... go boost!

  • Allow testing test-cases with multiple explicit seed (previous failure points)
  • Allow testing with random seed

I hope this was enforced early on. More than half of the unit tests in this library would not have been accepted if so.

@kunaltyagi
Copy link
Member

kunaltyagi commented May 12, 2020

I hope this was enforced early on. More than half of the unit tests in this library would not have been accepted if so.

Did you mean I wish this was enforced early on? If so, I remember someone saying: parts of PCL were written by interns high on amphetamines 🤣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: proposal Type of issue needs: feedback Specify why not closed/merged yet
Projects
None yet
Development

No branches or pull requests

6 participants