-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A new RNG for Julia? #8786
Comments
Adding new RNG implementations would be much less disruptive if we got rid of the notion of global RNG state that can be initialized with |
@ivarne, retrofitting existing code to get repeatable random numbers would then be a big chore. A common use case is to use non-repeatable random numbers (with global RNG state), and only use |
If we can roughly get the same performance and randomness guarantees, then we should definitely do this. I agree @ivarne that passing around RNG objects makes the most sense, particularly if we want this to be useful for parallel computing. |
Of course, at the very least you should be able to pass around an RNG object and have everything work. |
I think a reasonable model to copy is the IO interface. We can still have a global RNG object, but if you are writing a library function that uses random numbers, then you should allow it to accept an RNG argument (with the default being the global RNG) |
We can certainly replace dSFMT, and it is good to reduce yet another binary dependency. dSFMT could still be useful if someone uses the array interface directly, warts and all, but for this purpose, it can easily live in a package. |
No one is arguing that you should not be able to pass around RNG objects; the question is whether you should be required to do so. The latter would be disruptive and awkward to use, in my opinion. I agree that that the model of optional |
But on the present topic, what is the downside to replacing dSFMT as the default RNG (assuming the xorshift* algorithm is as fast and good as the authors claim)? It doesn't seem likely that there are many users who are heavily reliant on a particular sequence of pseudorandom numbers (as opposed to just the ability to have a reproducible sequence). The only disruption I can imagine is in test programs where someone hard-coded the expected result, for regression tests, from a particular random sequence. |
Marsaglia certainly has a good reputation in the random-number field, and |
A quick (untested) transcription of the C 1024-bit xorshift* algorithm: import Base: rand, rand!
type Xorshift1024 <: AbstractRNG
s::Vector{Uint64}
p::Uint64
Xorshift1024() = new(rand(Uint64, 16), 1)
end
function rand(rng::Xorshift1024)
p = rng.p
s = rng.s
@inbounds s0 = s[p]
p = p & 15 + 1
s1 = s[p]
s1 $= s1 << 31 # a
s1 $= s1 >> 11 # b
s0 $= s0 >> 30 # c
rng.p = p
@inbounds val = s[p] = s0 $ s1
return val * 1181783497276652981
end
function rand!(rng::Xorshift1024, a::AbstractArray{Uint64})
for i = 1:length(a)
@inbounds a[i] = rand(rng)
end
return a
end
rand(rng::Xorshift1024, n::Int) = rand!(rng, Array(Uint64, n)) The result is about 30% slower than our current
Am I missing something? Ideally we should use a fixed-length vector rather than a |
In Andreas' gist above @simonster has a thoroughly hacked version that's faster. Can probably be faster still if we special-case filling an array. |
Ah, I didn't see that one. Okay, the trick is to use |
@stevengj I don't think there is any downside to replacing the current RNG. In fact we should do so, and in case someone needs dSFMT, we can provide the current functionality in a package. |
Here's my version, based on @simonster's that unrolls the loop when filling arrays: module XORShift
import Base: rand, rand!
type XORShiftStar1024 <: AbstractRNG
x00::Uint64
x01::Uint64
x02::Uint64
x03::Uint64
x04::Uint64
x05::Uint64
x06::Uint64
x07::Uint64
x08::Uint64
x09::Uint64
x10::Uint64
x11::Uint64
x12::Uint64
x13::Uint64
x14::Uint64
x15::Uint64
p::Uint64
end
XORShiftStar() = XORShiftStar1024(rand(Uint64, 16)..., 0)
const globalXORShiftStar1024 = XORShiftStar()
fromUint64(::Type{Uint64}, u::Uint64) = u
fromUint64(::Type{Float64}, u::Uint64) = 5.421010862427522e-20 * u
function rand(rng::XORShiftStar1024, ::Type{Uint64})
@inbounds begin
p = rng.p
s0 = getfield(rng, reinterpret(Int, p)+1)
p = (p + 1) % 16
s1 = getfield(rng, reinterpret(Int, p)+1)
s1 $= s1 << 31
s1 $= s1 >> 11
s0 $= s0 >> 30
s0 $= s1
unsafe_store!(convert(Ptr{Uint64}, pointer_from_objref(rng))+8,
s0, reinterpret(Int, p)+1)
rng.p = p
end
return s0 * 1181783497276652981
end
rand{T<:Number}(rng::XORShiftStar1024, ::Type{T}) = fromUint64(T, rand(rng, Uint64))
@eval function rand!{T}(rng::XORShiftStar1024, a::AbstractArray{T})
$(Expr(:block, [
let x = symbol(@sprintf("x%02d", p % 16))
:($x = rng.$x)
end
for p = 0:15 ]...))
@inbounds for i = 1:16:length(a)
$(Expr(:block, [
let x0 = symbol(@sprintf("x%02d", p % 16)),
x1 = symbol(@sprintf("x%02d", (p + 1) % 16))
quote
s0 = $x0
s1 = $x1
s1 $= s1 << 31
s1 $= s1 >> 11
s0 $= s0 >> 30
s0 $= s1
$x1 = s0
a[i + $p] = fromUint64(T, s0 * 1181783497276652981)
end
end
for p = 0:15 ]...))
end
$(Expr(:block, [
let x = symbol(@sprintf("x%02d", p % 16))
:(rng.$x = $x)
end
for p = 0:15 ]...))
return a
end
end # module This version seems to be significantly faster than dSFMT at array filling for both Float64 (28% faster) and Uint64 (74% faster). Note that this version is not correctly generalized yet and only handles arrays with lengths that are a multiple of 16 and assumes that Also, this may require a fix to |
For curiosity it can be noted that a 32 bit xorshift generator is included in https://github.com/JuliaLang/julia/blob/master/test/perf/kernel/go_benchmark.jl Somewhat ironically, in light of this discussion, it was greeted with: "Julia also ships with a pretty high-end RNG which you could use instead of this xor-based one." |
@GunnarFarneback – hah, that is rather funny. |
Given the beginning of the thread, allow me to remind that PR #8380 implements more |
I think that Julia's dSFMT based RNG can be made much faster than it currently is (I observed up to x3), see #8808. |
Fully general version of the above: @eval function rand!{T}(rng::XORShiftStar1024, a::AbstractArray{T})
p = rng.p
n = length(a)
q = min((16 - p) % 16, n)
r = (n - q) % 16
@inbounds for i = 1:q
a[i] = rand(rng, T)
end
$(Expr(:block, [
let x = symbol(@sprintf("x%02d", k % 16))
:($x = rng.$x)
end
for k = 0:15 ]...))
r = q + (n-q) >> 4 << 4
@inbounds for i = q+1:16:r
$(Expr(:block, [
let x0 = symbol(@sprintf("x%02d", k % 16)),
x1 = symbol(@sprintf("x%02d", (k + 1) % 16))
quote
s0 = $x0
s1 = $x1
s1 $= s1 << 31
s1 $= s1 >> 11
s0 $= s0 >> 30
s0 $= s1
$x1 = s0
a[i + $k] = fromUint64(T, s0 * 1181783497276652981)
end
end
for k = 0:15 ]...))
end
$(Expr(:block, [
let x = symbol(@sprintf("x%02d", k % 16))
:(rng.$x = $x)
end
for k = 0:15 ]...))
rng.p = 0
@inbounds for i = r+1:n
a[i] = rand(rng, T)
end
return a
end We should also strongly consider @rfourquet's patches to dSFMT, which might be even faster. |
We should see if we can use the tricks used by dSFMT to have a SIMD version of this algorithm. That may be non-trivial work, but if possible, doing so would put us on par or even faster than dSFMT, since it does not go beyond SSE2, IIRC. |
I think this is a good opportunity to realize the SIMD potential within Julia that was first promised in #4042. Here's a small start with the easy part, the xor operation. julia> typealias Uint64x2 NTuple{2, Uint64}
(Uint64,Uint64)
julia> typealias Uint64x4 NTuple{4, Uint64}
(Uint64,Uint64,Uint64,Uint64)
julia> function ($)(x::Uint64x2, y::Uint64x2)
Base.llvmcall("""%3 = xor <2 x i64> %1, %0
ret <2 x i64> %3""",
Uint64x2, (Uint64x2, Uint64x2), x, y)
end
$ (generic function with 34 methods)
julia> function ($)(x::Uint64x4, y::Uint64x4)
Base.llvmcall("""%3 = xor <4 x i64> %1, %0
ret <4 x i64> %3""",
Uint64x4, (Uint64x4, Uint64x4), x, y)
end
$ (generic function with 35 methods)
julia> (0x0123456789abcdef,0x0123456789abcdef)$(0x0011223344556677,0x8899aabbccddeeff)
(0x01326754cdfeab98,0x89baefdc45762310)
julia> code_native($, (Uint64x2, Uint64x2))
.text
Filename: none
Source line: 2
push RBP
mov RBP, RSP
vxorps XMM0, XMM1, XMM0
Source line: 2
pop RBP
ret
julia> code_native($, (Uint64x2, Uint64x2, Uint64x2))
.text
Filename: operators.jl
Source line: 82
push RBP
mov RBP, RSP
vxorps XMM0, XMM1, XMM0
vxorps XMM0, XMM0, XMM2
Source line: 82
pop RBP
ret
julia> code_native($, (Uint64x4, Uint64x4))
.text
Filename: none
Source line: 2
push RBP
mov RBP, RSP
vxorps YMM0, YMM1, YMM0
Source line: 2
pop RBP
ret The Uint64x2 xor only requires SSE2. The Uint64x4 needs something later, possibly AVX, in any case available on my Sandy Bridge processor. Getting sensible shift instructions out of this is beyond my LLVM skills however. But load, store, xor, multiplication, left shift, and right shift are all the components needed for a xorshift* generator. This should be within reach from Julia. |
With SIMD vectorization, we could produce multiple streams from different xorshift seeds in parallel. I guess that's ok, even though it increases the size of the RNG state. What I find more troublesome is that the optimal number of parallel streams depends on the width of the SIMD instructions available (and probably also on their latency). But if we should interleave multiple streams it should always be the same number, so that we produce the same random sequence independent on the machine where the random number generation is done. I guess that we would want the smallest number of parallel streams that seems future proof in that it will be possible to do the evaluation in parallel on future processors, which might have wider SIMD instructions. But I'm not sure how easy it is to find a figure that is both sensible now and future-proof. Looking at the xorshift1024* algorithm, it seems to have an inherent parallelization bottleneck in that the previous 64-bit result |
Now, if there were a xorshift1024* algorithm where the next output depends on the output from e.g. 15 and 16 steps ago, there would have been a lot more room for parallelization... |
Looking at the papers, it seems that the validation that the proposed generators have received was to verify the maximum period property and to evaluate a hundred seeds with BigCrush. This is bordering on original research, but I wrote some code to check the maximum period property, see this gist. It takes around half an hour to verify the property on my somewhat dated laptop. If someone were to come up with a variation on xorshift1024 with better parallelization properties that passes the test, and BigCrush likes it, we could consider using it for an RNG. Edit: Changed the order of loops in the matrix multiply, and some other optimizations. Now it takes a little more than a minute to verify the maximum period property of xorshift1024 on my machine. |
Note that with @rfourquet's commit 06fb0a3, we are now back to using the fast array generators from dSFMT. |
And now with 84b6aec the |
I think we should have the XORSHIFT as another RNG in julia, with dsfmt continuing to be the default for now. |
Michael Stumpf gave a great talk at the Julia London meetup last night, and as someone who has a lot of experience with large-scale Monte Carlo simulations, he does not hold a very high opinion of Mersenne twister RNGs. I'm sure he would be keen to provide some input on this, particularly for working with parallel streams. He did have good things to say about this work (free link available here), based on cryptographic functions of simple transitions. |
Was his major concern lack of independence between streams, i.e. in our situation the MT with different seed values? @StefanKarpinski has mentioned the Random123 library before and I have tried to wrap it, but I had difficulties in getting high speed when calling the library from Julia. It's a headers only library, so you'd have to write a little wrapper to make a dynamic library and somehow this step made step the speed drop. There are several RNGs in the library and some of them are quite simple to write in Julia which I did at some point. Maybe it would be competitive in terms of speed if @toivoh's Julia SIMD tricks could be applied to the Julia translation. |
That was certainly a major problem, though I think he also viewed as relatively slow given its unreliability (i.e. failing BigCrush). I've sent him an email to ask for input. |
The parallelization issue is another big problem. But it's not really just parallelism – it's about the fact that determinancy of sequential RNGs are inherently dynamically scoped and brittle. Consider e.g. if some function you call decides to use a randomized algorithm in the future – if both codes are using a default global RNG then this changes the random stream your code sees. This is ameliorated by explicitly passing RNG state around but does address parallelism. The indexed RNG approach of Random123 is totally immune to this – the stream depends only on the indexes you use. Fast-forwarding and parallelism are completely trivial. You may need to use some label at each location to ensure that different streams are used in each lexical place; I'd be interested in hearing how that's done. One approach is to generate random values at code writing time like we've done for the hashing stuff. |
The Random123 algorithms look very nice. If there were some pure Julia implementations, I might very well be able to speed them up by making good use of SIMD instructions and such. It also seems that @ArchRobison has come a long way in #6271 to make Julia/LLVM do a lot of such vectorization automatically. |
I just found a Julia implementation I did some time ago and made a gist
|
We could also potentially use llvmcall in case we can't get LLVM to do sufficiently clever optimizations. |
Yes, that is one of the tools that I have been using :) |
@andreasnoack: I looked a bit at your code. It already generates very straightforward machine code (basically plus, rotate, xor, then repeat), especially when I parameterized on The options for parallelization are basically to process more input points at the same time, or use a wider version of the generator (i.e. 64x8). For normal rng purposes each might be similarly useful, but in general I think that the former will be more interesting. I didn't get it to produce any SIMD code yet for evaluating multiple points in parallel, but I might try some more. |
Another resource that looks like it could be interesting is Agner Fog's page (also known for his tables of instruction latencies). |
Anyone familiar with this work? Looks interesting http://www.pcg-random.org/pdf/toms-oneill-pcg-family-v1.02.pdf |
Should we try and include XORShift in Julia by default, or should we have a package? For now, I am thinking it is a good idea to have it in a package, which will also help clean up the API to make it easy to add other RNGs. |
+1 for starting in a package.
|
I have put the code from here into: https://github.com/JuliaLang/XORShiftRNG.jl Ideally one would just provide |
I was also hoping the new pcg-random rng's would have been relicensed to MIT by now so a more direct port of the source code instead of going from the paper could avoid Apache, but no update there. |
I rather think |
Yes, I recollect that too, but it is worth revisiting. @rfourquet may have a PR for this as well. Int64 is certainly a better basis, but that won't work for dSFMT, although it will work for SFMT. Perhaps we can have either one. |
I wouldn't say that "a fully generic pluggable system [necessarily] creates ambiguity problems" (I may have expressed my concerns unclearly in this comment and above), only that it must be general enough and allow the library writer to define the specializations she wants without getting ambiguity warnings. I wish |
@tkelman I've recently stumbled upon the PCG family of random number generators again. The algorithm seems simple enough that it might be straightforward to create an original implementation from the paper. I'll put it on my TODO list. |
It's just a first pass, but this is an attempt to reproduce # bitwise rotate right
function rotate{T <: Unsigned}(x::T, n::T)
mask = 8*sizeof(x) - 1
return (x >> n) | (x << ((-n) & mask))
end
# multiplier chosen from table 4 in:
# P. L'Ecuyer, ``Tables of Linear Congruential Generators of Different Sizes and
# Good Lattice Structure'', Mathematics of Computation, 68, 225 (1999), 249--260.
const multiplier = UInt64(3935559000370003845)
# L'Ecuyer only says that you need an odd increment
const increment = UInt64(204209821)
let state::UInt64 = 0x23cc2170ddd0af6e
global pcg_rand
function pcg_rand()
state = state * multiplier + increment;
rotate(((state $ (state >> 18)) >> 27) % UInt32,
(state >> 59) % UInt32)
end
end |
Forwarding this on, as promised, from art4711/random-double#1:
The conversation started roughly ages ago but seeing as there was a follow-up and benchmarks (though in Go) I figured I should pass it along. I don't agree that "just 2.5x slower" is acceptable for random sampling applications. |
I wonder if this is worth revisiting given the interesting developments in libm.jl. Having a good implementation in Julia is probably desirable compared to dsfmt, which does not leverage the modern wider vector units. I am not sure if this should be xorshift or MT or perhaps both. Reducing the dependency on dsfmt also will give us faster RNGs on non-Intel platforms. |
I should mention the GSOC project by @sunoru: Some benchmarks are here: (note, however, that these were for generating |
It has been discussed if we should consider this RNG alongside or instead of dSFMT.
The RNG we are using now is a Mersenne Twister implementation adapted to use SIMD, but we don't really get anything from SIMD right now because of #6573.
In addition, dSFMT actually consistently fails one of the tests in BigCrush in contrast to the proposed XORShift* algorithm.
Finally, the XORShift* is very simple and can be written in Julia. See some implementations with various perfomance tweaks here
https://gist.github.com/andreasnoack/9f3f63d7f1d00a07359b
The text was updated successfully, but these errors were encountered: