Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task-local xoshiro rebased #40546

Merged
merged 2 commits into from
Jun 2, 2021
Merged

task-local xoshiro rebased #40546

merged 2 commits into from
Jun 2, 2021

Conversation

JeffBezanson
Copy link
Member

This is a cleaned-up and rebased version of #34852.

@rfourquet Please make sure I addressed your comments correctly. There is also a stack overflow in the tests due (I think) to missing overloads for the new RNG types; please advise on what should be added.

@chriselrod I haven't touched the SIMD stuff; let me know if there are specific changes I should make, or feel free to add commits here if needed.

I included a separate commit to change the default RNG to the task-local one, to see what that would look like. The main issues are

  1. We don't promise the same random streams between releases, so that part is ok, but there are also APIs for e.g. copying between the global state and MersenneTwister, which could no longer be supported. Is that ok?
  2. The copy! and == methods for TaskLocal are a bit sketchy, since TaskLocal is a singleton but the operations actually affect the task-local state. Do we care?
  3. Single generation (rand()) of all types, and bulk generation of integers seem to be faster, but bulk generation of floats seems to be slower. Can this be optimized further?

@JeffBezanson JeffBezanson added randomness Random number generation and the Random stdlib multithreading Base.Threads and related functionality needs news A NEWS entry is required for this change needs docs Documentation for this change is required labels Apr 20, 2021
@vtjnash vtjnash mentioned this pull request Apr 21, 2021
stdlib/Random/src/RNGs.jl Outdated Show resolved Hide resolved
stdlib/Random/src/RNGs.jl Outdated Show resolved Hide resolved
stdlib/Random/src/RNGs.jl Outdated Show resolved Hide resolved
stdlib/Random/src/RNGs.jl Outdated Show resolved Hide resolved
stdlib/Random/src/RNGs.jl Outdated Show resolved Hide resolved

function seed!(rng::Union{TaskLocal, Xoshiro}, seed::AbstractVector{UInt64})
if length(seed) > 4
throw(ArgumentError("seed should have no more than 256 bits"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might break code which does seed!(rand(UInt32, 100)). So we might have to run PkgEval, and I could also easily add a version which accepts arbitrary lenght vectors by cryptographically hashing it (I already did that in another unmerged PR, so will be very easy to adapt the code).

end
@noinline _rng_length_assert() = @assert false "0 < tid <= length(THREAD_RNGs)"
@inline default_rng() = TaskLocal()
@inline default_rng(tid::Int) = TaskLocal()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still used? otherwise:

Suggested change
@inline default_rng(tid::Int) = TaskLocal()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used here, and not exported. Just wondering if it's used in packages.


module XoshiroSimd
# Getting the xoroshiro RNG to reliably vectorize is somewhat of a hassle without Simd.jl.
import ..Random: TaskLocal, rand, rand!, UnsafeView, Xoshiro, SamplerType, CloseOpen12, SamplerTrivial
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import ..Random: TaskLocal, rand, rand!, UnsafeView, Xoshiro, SamplerType, CloseOpen12, SamplerTrivial
import ..Random: TaskLocal, rand, rand!, Xoshiro, SamplerType, CloseOpen01, SamplerTrivial

AFAICT:

  • UnsafeView is unused by Xoshiro, so methods for it are useless. (UnsafeView is used only internally for MersenneTwister, at no point a rand! method will be called on UnsafeView for another RNG type.)
  • Sorry to have mistakenly written CloseOpen12 in a comment from the original PR, here we generate in [0, 1) so should be CloseOpen01 (might be worth adding a version for [1, 2) but not urgent and I can take care of that). AFAICT this solves the efficiency issue for bulk generation (the methods below were not called because of that).

I can push a commit fixing those two issues in this file, or let you do it, as you like.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. That fixes it for rand; is there anything we can do for randn?

Copy link
Member

@rfourquet rfourquet Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there anything we can do for randn?

I could do the same trick as was done for MersenneTwister, IIRC it lead to a (very roughly) 2x speedup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented it in the branch rf/xoshiro/randn, you can incorporate the commit here, or I will open a PR once this one is merged. I actually had to add back the rand! method for UnsafeView ;-p

stdlib/Random/test/runtests.jl Outdated Show resolved Hide resolved
@@ -627,7 +627,7 @@ guardseed() do
m = MersenneTwister(0)
@test Random.seed!() === g
@test Random.seed!(rand(UInt)) === g
@test Random.seed!(rand(UInt32, rand(1:10))) === g
@test Random.seed!(rand(UInt32, rand(1:8))) === g
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the error for using too-long seeds. If we remove the error then this is ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right of course, for some reason I thought 1:10 was the sampler from which values were sampled.

@rfourquet
Copy link
Member

but there are also APIs for e.g. copying between the global state and MersenneTwister, which could no longer be supported. Is that ok?

copy etc. was supported, but if I'm not mistaken, neither Random.default_rng() nor Random.GLOBAL_RNG were documented or exported. Of course this was used in packages (I'm not sure to which extent), so I would say it's OK to not support that anymore (at least we are not tied by semver for this).

The copy! and == methods for TaskLocal are a bit sketchy, since TaskLocal is a singleton but the operations actually affect the task-local state. Do we care?

I'm not sure I understand, but if you refer to copy(::TaskLocal) returning a Xoshiro, this seems fine to me, and is useful. As Kristoffer noticed very recently (probably on slack's #gripes), there is precedent of that, with copy(ENV) returning a Dict, or copy on views.

Single generation (rand()) of all types, and bulk generation of integers seem to be faster, but bulk generation of floats seems to be slower. Can this be optimized further?

Should be trivially fixed, see inline comment.

@rfourquet
Copy link
Member

I don't think I saw a discussion on Xoshiro256++ vs. Xoshiro256**, why was Xoshiro256++ preferred? (@chethega).

using Base: BitInteger_types

# Vector-width. Influences random stream. We may want to tune this before merging.
xoshiroWidth() = Val(8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also from https://github.com/JuliaLang/julia/pull/34852/files#r469740269 :

The capitalization in this file seems a bit off (camelCase). Is that intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷

s3::UInt64
end

Xoshiro(::Nothing) = Xoshiro()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used somewhere without initializing seeds?
Junk is likely to be 0-heavy, and Xoshiro doesn't like 0-heavy inits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will end up using RandomDevice, but I also wonder why this exists. It's only here to copy other RNGs that also have such methods. Similarly, why do we have a seed!(rng, nothing) method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Xoshiro() randomizes all the fields with RandomDevice(); having nothing equivalent to no explicitly passed seed can be useful to forward seeds, like in e.g. myalgo(...; seed=nothing). I don't think this is documented though...

ret <$N x i64> %res
"""
@eval @inline _rotl45(x::NTuple{$N, VecElement{UInt64}}) = Core.Intrinsics.llvmcall($code,
NTuple{$N, VecElement{UInt64}},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://llvm.org/docs/LangRef.html#llvm-fshl-intrinsic
LLVM now has the "funnel shift left/right" intrinsics for rotates.

For example

using Base: llvmcall, VecElement
for N in [1,2,4,8,16,32]
    s = ntuple(_ -> Core.VecElement(45%UInt64), Val(N))
    op = "llvm.fshl.v$(N)i64"
    @eval @inline function _rotl45(x::NTuple{$N,VecElement{UInt64}})
        ccall($op, llvmcall, NTuple{$N,VecElement{UInt64}}, (NTuple{$N,VecElement{UInt64}},NTuple{$N,VecElement{UInt64}},NTuple{$N,VecElement{UInt64}}), x, x, $s)
    end
end

A little shorter, and as LLVM's documentation specifically says that rotate can be implemented by passing the same vector as the first two arguments, we can be confident that this'll reliably produce rotate instructions. E.g.:

julia> vxu = ntuple(_ -> VecElement(rand(UInt64)), Val(8));

julia> @code_native debuginfo=:none syntax=:intel _rotl45(vxu)
        .text
        vprolq  zmm0, zmm0, 45
        ret
        nop     dword ptr [rax + rax]

I didn't try the current definition, but I believe (last time I tried) that it was also optimized into a rotate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using too long a vector naturally does exactly what we'd want it to:

julia> vxu = ntuple(_ -> VecElement(rand(UInt64)), Val(16));

julia> @cn _rotl45(vxu)
        .text
        vprolq  zmm0, zmm0, 45
        vprolq  zmm1, zmm1, 45
        ret
        nop

So xoshiroWidth() = 8 should be fine. 512-bit SIMD architectures won't get as much out of order execution as others, but that's not the end of the world. That's what SMT is for...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just confirmed that the current code in the PR produces the same assembly on my machine.

ret <$N x i64> %res
"""
@eval @inline _lshr(x::NTuple{$N, VecElement{UInt64}}, y::Int64) = Core.Intrinsics.llvmcall($code,
NTuple{$N, VecElement{UInt64}},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could just use ccall for all of these and interpolate the VecElements, like in my rotate example.

As an aside, I'd love it if Julia's codegen lowered NTuple{N,VecElement{T}}s with the same operations as Ts. I assume this was already discussed somewhere (and rejected?).

s0, s1, s2, s3 = task.rngState0, task.rngState1, task.rngState2, task.rngState3
else
rng::Xoshiro
s0, s1, s2, s3 = rng.s0, rng.s1, rng.s2, rng.s3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use the fancy new syntax:

(; s0, s1, s2, s3) = rng::Xoshiro


function rand!(rng::Union{TaskLocal, Xoshiro}, dst::Union{Array{Float64}, UnsafeView{Float64}}, ::SamplerTrivial{CloseOpen12{Float64}})
GC.@preserve dst xoshiro_bulk(rng, convert(Ptr{UInt8}, pointer(dst)), length(dst)*8, Float64, xoshiroWidth())
@inbounds @simd for i = 1:length(dst)
Copy link
Contributor

@chriselrod chriselrod Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a single pass method?
We could make xoshiro_bulk_simd except a function argument to apply to toWrite before storing it, which can do the masking to Float(64/32) and subtract 1.0.

For big arrays, passing over the data just once will likely be a big performance win.

test/threads_exec.jl Outdated Show resolved Hide resolved
@JeffBezanson
Copy link
Member Author

Plenty of random-stream-dependent tests to fix here ... 😅

@fredrikekre
Copy link
Member

For doctests there is make -C doc html doctest=fix to automatically fix them.

@JeffBezanson JeffBezanson added the triage This should be discussed on a triage call label Apr 22, 2021
@JeffBezanson
Copy link
Member Author

Ok there is something wrong with the simd code causing memory corruption; that's the cause of the bitarray failures.

@JeffBezanson
Copy link
Member Author

Made some fixes to the simd code. Would be good if somebody can review.

@JeffBezanson
Copy link
Member Author

Another unanticipated change: MersenneTwister saves its original seed. The test system uses that, and I can fix it by saving the seed of the global default generator, but I don't know if there are other uses we should worry about.

@rfourquet
Copy link
Member

MersenneTwister saves its original seed. The test system uses that, and I can fix it by saving the seed of the global default generator, but I don't know if there are other uses we should worry about.

I noticed that too, and I think that's OK without needing to save the seed of the global RNG. It's kinda breaking as this was documented, but I think that most of the benefit for the test system is kept by restoring at the end of a @testset the state of the global RNG as it was before the @testset started (this behavior is maintained in this PR). I don't think this was relied upon in packages, it's more of an interactive feature useful for debugging failing tests, and it's not too difficult to work around that (e.g. by avoiding calls to rand() outside of a @testset).

@JeffBezanson
Copy link
Member Author

most of the benefit for the test system is kept by restoring at the end of a @testset the state of the global RNG as it was before the @testset started (this behavior is maintained in this PR)

I agree, but then there is this evil test:

    a = rand()
    @testset begin
        # global RNG must re-seeded at the beginning of @testset
        @test a == rand()

I can just keep it working for now.

@rfourquet
Copy link
Member

I agree, but then there is this evil test:

Sure, this was testing the feature in question, so if we drop the feature we can drop the test...

I can just keep it working for now.

You mean by saving the seed for TaskLocalRNG ? I though this would be out of question, as this would double the size taken for storing the RNG in each task... or did I misunderstand something?

@JeffBezanson
Copy link
Member Author

Right. It's only tested for the case of seeding the global default RNG (seed!(seed)), which only requires saving a single global seed.

@rfourquet
Copy link
Member

Yes this is great if this is made to work. To be explicit, the feature was meant to be useful in this context: one of your @testset fails, and you can reproduce the failure by doing Random.seed!(GLOBAL_RNG.seed) followed by including only the failed testset. An example of this in for Base tests, where the seed is shown in CI and you can re-run a given failed testset by copy pasting this seed. I didn't think enough about the task-local RNG setting to understand to which extent this can be replicated by storing a single seed globally.

@JeffBezanson
Copy link
Member Author

We could even make GLOBAL_RNG.seed work via getproperty overloading if we want.

@JeffBezanson JeffBezanson force-pushed the jb/chethega/tlrng branch 2 times, most recently from f1a6382 to c2bcc8f Compare April 23, 2021 22:24
@JeffBezanson
Copy link
Member Author

Linear algebra tests (ノಠ益ಠ)ノ彡┻━┻

@JeffBezanson JeffBezanson removed the needs news A NEWS entry is required for this change label May 25, 2021
@JeffBezanson
Copy link
Member Author

Ok so we have 153 package test failures due to this; not TOO terrible really. I think this is ready to merge. Any objections?

@oscardssmith
Copy link
Member

I vote merge and see what happens.

@vtjnash
Copy link
Member

vtjnash commented May 25, 2021

That sounds sad, but I’m assuming it is bad tests (and not numeric issues, say)?

@JeffBezanson JeffBezanson added this to the 1.7 milestone May 27, 2021
@JeffBezanson JeffBezanson removed the needs pkgeval Tests for all registered packages should be run with this change label May 28, 2021
Co-Authored-By: Jeff Bezanson <[email protected]>

Co-Authored-by: Rafael Fourquet <[email protected]>

- try to use high bits instead of low when we take a subset of them
- use shift and multiply instead of mask and subtract for generating floats
@oscardssmith
Copy link
Member

Is this ready to merge? Feature freeze is today.

@JeffBezanson
Copy link
Member Author

Yes, but I adjusted the simd thresholds (they were way too high) and so now there are new test failures to fix...

@chriselrod
Copy link
Contributor

chriselrod commented Aug 28, 2021

Note that the approach of generating Float64 numbers here is only SIMD with AVX512, so SIMD performance is bad without it.

I don't recall where this was discussed, but seems worth it's own issue.

Two possibilities:

  1. use feature detection to (given absence of AVX512) decompose the vector of 64 bit integers into two vectors of 32 bit integers, convert, and recombine. This has full accuracy and is much faster than what LLVM does (decompose the vector of 64 bit integers into scalars and convert these one at a time, and then recombine).
  2. change algorithms to cast to [1,2) instead. This has the added bonus of once again changing streams and encouraging unit tests to use StableRNG. ;)

Alternatively, someone who knows LLVM could make a PR to do "1)" at the LLVM level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality needs docs Documentation for this change is required randomness Random number generation and the Random stdlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants