-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
task-local xoshiro rebased #40546
task-local xoshiro rebased #40546
Conversation
db62af9
to
6d15eb7
Compare
stdlib/Random/src/RNGs.jl
Outdated
|
||
function seed!(rng::Union{TaskLocal, Xoshiro}, seed::AbstractVector{UInt64}) | ||
if length(seed) > 4 | ||
throw(ArgumentError("seed should have no more than 256 bits")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might break code which does seed!(rand(UInt32, 100))
. So we might have to run PkgEval, and I could also easily add a version which accepts arbitrary lenght vectors by cryptographically hashing it (I already did that in another unmerged PR, so will be very easy to adapt the code).
stdlib/Random/src/RNGs.jl
Outdated
end | ||
@noinline _rng_length_assert() = @assert false "0 < tid <= length(THREAD_RNGs)" | ||
@inline default_rng() = TaskLocal() | ||
@inline default_rng(tid::Int) = TaskLocal() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still used? otherwise:
@inline default_rng(tid::Int) = TaskLocal() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used here, and not exported. Just wondering if it's used in packages.
stdlib/Random/src/XoshiroSimd.jl
Outdated
|
||
module XoshiroSimd | ||
# Getting the xoroshiro RNG to reliably vectorize is somewhat of a hassle without Simd.jl. | ||
import ..Random: TaskLocal, rand, rand!, UnsafeView, Xoshiro, SamplerType, CloseOpen12, SamplerTrivial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import ..Random: TaskLocal, rand, rand!, UnsafeView, Xoshiro, SamplerType, CloseOpen12, SamplerTrivial | |
import ..Random: TaskLocal, rand, rand!, Xoshiro, SamplerType, CloseOpen01, SamplerTrivial |
AFAICT:
UnsafeView
is unused byXoshiro
, so methods for it are useless. (UnsafeView
is used only internally forMersenneTwister
, at no point arand!
method will be called onUnsafeView
for another RNG type.)- Sorry to have mistakenly written
CloseOpen12
in a comment from the original PR, here we generate in[0, 1)
so should beCloseOpen01
(might be worth adding a version for[1, 2)
but not urgent and I can take care of that). AFAICT this solves the efficiency issue for bulk generation (the methods below were not called because of that).
I can push a commit fixing those two issues in this file, or let you do it, as you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. That fixes it for rand
; is there anything we can do for randn
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there anything we can do for randn?
I could do the same trick as was done for MersenneTwister
, IIRC it lead to a (very roughly) 2x speedup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented it in the branch rf/xoshiro/randn, you can incorporate the commit here, or I will open a PR once this one is merged. I actually had to add back the rand!
method for UnsafeView
;-p
@@ -627,7 +627,7 @@ guardseed() do | |||
m = MersenneTwister(0) | |||
@test Random.seed!() === g | |||
@test Random.seed!(rand(UInt)) === g | |||
@test Random.seed!(rand(UInt32, rand(1:10))) === g | |||
@test Random.seed!(rand(UInt32, rand(1:8))) === g |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the error for using too-long seeds. If we remove the error then this is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right of course, for some reason I thought 1:10
was the sampler from which values were sampled.
I'm not sure I understand, but if you refer to
Should be trivially fixed, see inline comment. |
I don't think I saw a discussion on Xoshiro256++ vs. Xoshiro256**, why was Xoshiro256++ preferred? (@chethega). |
using Base: BitInteger_types | ||
|
||
# Vector-width. Influences random stream. We may want to tune this before merging. | ||
xoshiroWidth() = Val(8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also from https://github.com/JuliaLang/julia/pull/34852/files#r469740269 :
The capitalization in this file seems a bit off (camelCase). Is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷
stdlib/Random/src/RNGs.jl
Outdated
s3::UInt64 | ||
end | ||
|
||
Xoshiro(::Nothing) = Xoshiro() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used somewhere without initializing seeds?
Junk is likely to be 0-heavy, and Xoshiro doesn't like 0-heavy inits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will end up using RandomDevice, but I also wonder why this exists. It's only here to copy other RNGs that also have such methods. Similarly, why do we have a seed!(rng, nothing)
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Xoshiro()
randomizes all the fields with RandomDevice()
; having nothing
equivalent to no explicitly passed seed can be useful to forward seeds, like in e.g. myalgo(...; seed=nothing)
. I don't think this is documented though...
stdlib/Random/src/XoshiroSimd.jl
Outdated
ret <$N x i64> %res | ||
""" | ||
@eval @inline _rotl45(x::NTuple{$N, VecElement{UInt64}}) = Core.Intrinsics.llvmcall($code, | ||
NTuple{$N, VecElement{UInt64}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://llvm.org/docs/LangRef.html#llvm-fshl-intrinsic
LLVM now has the "funnel shift left/right" intrinsics for rotates.
For example
using Base: llvmcall, VecElement
for N in [1,2,4,8,16,32]
s = ntuple(_ -> Core.VecElement(45%UInt64), Val(N))
op = "llvm.fshl.v$(N)i64"
@eval @inline function _rotl45(x::NTuple{$N,VecElement{UInt64}})
ccall($op, llvmcall, NTuple{$N,VecElement{UInt64}}, (NTuple{$N,VecElement{UInt64}},NTuple{$N,VecElement{UInt64}},NTuple{$N,VecElement{UInt64}}), x, x, $s)
end
end
A little shorter, and as LLVM's documentation specifically says that rotate can be implemented by passing the same vector as the first two arguments, we can be confident that this'll reliably produce rotate instructions. E.g.:
julia> vxu = ntuple(_ -> VecElement(rand(UInt64)), Val(8));
julia> @code_native debuginfo=:none syntax=:intel _rotl45(vxu)
.text
vprolq zmm0, zmm0, 45
ret
nop dword ptr [rax + rax]
I didn't try the current definition, but I believe (last time I tried) that it was also optimized into a rotate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using too long a vector naturally does exactly what we'd want it to:
julia> vxu = ntuple(_ -> VecElement(rand(UInt64)), Val(16));
julia> @cn _rotl45(vxu)
.text
vprolq zmm0, zmm0, 45
vprolq zmm1, zmm1, 45
ret
nop
So xoshiroWidth() = 8
should be fine. 512-bit SIMD architectures won't get as much out of order execution as others, but that's not the end of the world. That's what SMT is for...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just confirmed that the current code in the PR produces the same assembly on my machine.
stdlib/Random/src/XoshiroSimd.jl
Outdated
ret <$N x i64> %res | ||
""" | ||
@eval @inline _lshr(x::NTuple{$N, VecElement{UInt64}}, y::Int64) = Core.Intrinsics.llvmcall($code, | ||
NTuple{$N, VecElement{UInt64}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could just use ccall
for all of these and interpolate the VecElement
s, like in my rotate example.
As an aside, I'd love it if Julia's codegen lowered NTuple{N,VecElement{T}}
s with the same operations as T
s. I assume this was already discussed somewhere (and rejected?).
stdlib/Random/src/XoshiroSimd.jl
Outdated
s0, s1, s2, s3 = task.rngState0, task.rngState1, task.rngState2, task.rngState3 | ||
else | ||
rng::Xoshiro | ||
s0, s1, s2, s3 = rng.s0, rng.s1, rng.s2, rng.s3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use the fancy new syntax:
(; s0, s1, s2, s3) = rng::Xoshiro
stdlib/Random/src/XoshiroSimd.jl
Outdated
|
||
function rand!(rng::Union{TaskLocal, Xoshiro}, dst::Union{Array{Float64}, UnsafeView{Float64}}, ::SamplerTrivial{CloseOpen12{Float64}}) | ||
GC.@preserve dst xoshiro_bulk(rng, convert(Ptr{UInt8}, pointer(dst)), length(dst)*8, Float64, xoshiroWidth()) | ||
@inbounds @simd for i = 1:length(dst) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a single pass method?
We could make xoshiro_bulk_simd
except a function argument to apply to toWrite
before storing it, which can do the masking to Float(64/32)
and subtract 1.0.
For big arrays, passing over the data just once will likely be a big performance win.
6d15eb7
to
00cf679
Compare
00cf679
to
74a4d01
Compare
Plenty of random-stream-dependent tests to fix here ... 😅 |
For doctests there is |
Ok there is something wrong with the simd code causing memory corruption; that's the cause of the bitarray failures. |
Made some fixes to the simd code. Would be good if somebody can review. |
Another unanticipated change: |
I noticed that too, and I think that's OK without needing to save the seed of the global RNG. It's kinda breaking as this was documented, but I think that most of the benefit for the test system is kept by restoring at the end of a |
I agree, but then there is this evil test:
I can just keep it working for now. |
Sure, this was testing the feature in question, so if we drop the feature we can drop the test...
You mean by saving the seed for |
Right. It's only tested for the case of seeding the global default RNG ( |
Yes this is great if this is made to work. To be explicit, the feature was meant to be useful in this context: one of your |
We could even make |
f1a6382
to
c2bcc8f
Compare
Linear algebra tests (ノಠ益ಠ)ノ彡┻━┻ |
501493b
to
d638c88
Compare
Ok so we have 153 package test failures due to this; not TOO terrible really. I think this is ready to merge. Any objections? |
I vote merge and see what happens. |
That sounds sad, but I’m assuming it is bad tests (and not numeric issues, say)? |
Co-Authored-By: Jeff Bezanson <[email protected]> Co-Authored-by: Rafael Fourquet <[email protected]> - try to use high bits instead of low when we take a subset of them - use shift and multiply instead of mask and subtract for generating floats
d638c88
to
0e3ea6e
Compare
Is this ready to merge? Feature freeze is today. |
Yes, but I adjusted the simd thresholds (they were way too high) and so now there are new test failures to fix... |
0e3ea6e
to
cfd940e
Compare
Note that the approach of generating I don't recall where this was discussed, but seems worth it's own issue. Two possibilities:
Alternatively, someone who knows LLVM could make a PR to do "1)" at the LLVM level. |
This is a cleaned-up and rebased version of #34852.
@rfourquet Please make sure I addressed your comments correctly. There is also a stack overflow in the tests due (I think) to missing overloads for the new RNG types; please advise on what should be added.
@chriselrod I haven't touched the SIMD stuff; let me know if there are specific changes I should make, or feel free to add commits here if needed.
I included a separate commit to change the default RNG to the task-local one, to see what that would look like. The main issues are
copy!
and==
methods forTaskLocal
are a bit sketchy, sinceTaskLocal
is a singleton but the operations actually affect the task-local state. Do we care?rand()
) of all types, and bulk generation of integers seem to be faster, but bulk generation of floats seems to be slower. Can this be optimized further?