-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Tasklocal xoshiro #34852
WIP: Tasklocal xoshiro #34852
Conversation
Thanks for doing this! I'd be awesome to have reproducible multi-threaded program out-of-the-box.
I'm no RNG expert, but isn't "counter-based PRNG" a more rigorous approach? Repeating what I already wrote in #34386 (comment): Quoting what I mentioned in the discourse thread:
There is a Julia implementation of counter-based PRNG: https://github.com/sunoru/Random123.jl |
No, not really. Mathematically, the construction I use here uses any RNG that accepts arbitrary seeds, plus a random function. From that you get a splittable RNG: Use the old state to generate K pseudo-random bits, use them to seed the child RNG; put a random function into the middle because almost no RNGs make any guarantees about correlations when they get seeded with the output of the same RNG. We could afford something like sha2-256 that is indistinguishable from random oracle, and the resulting construction (i.e. state-splittable RNG) would be just as good as the original non-splittable RNG (i.e. xoshiro), by reduction proof. The competely unfounded non-rigorous no-good temporary thing I did here was to conjecture that murmur3-internals and xoshiro-internals don't correlate. That is obviously (a) kinda incorrect, because both use similar constructions, (b) a leap of faith, and (c) @vigna might comment more. I have seen no suitably fast (< 4 cpb) crypto function lying around in the julia repo (needs to be written in C, used during task creation). Counter-based is super cool because it vectorizes trivially / is embarassingly parallel. However, I had no fast variant lying around that doesn't use AES hardware acceleration. Related #34386 #34216 |
Thanks for the explanation. I guess I can see that you can split off an independent child RNG by putting CSPRNG in the middle. I just didn't have a good intuition if you can make it lightweight enough to do this for every |
Really impressive work! |
That's amazing!! I was secretly hoping you would work on that since the last discussion, and you did :D |
I guess one immediate (and possibly naive) question I have is whether the task overhead could be reduced to just a pointer to the RNG state which could be lazily initialized. |
Lazy initialization is fundamentally impossible, and pointer indirection is too slow. A pointer is 8 bytes. I currently eat 32 bytes. If 32 are too much, then it is probably possible to cook up a counter-based RNG: All tasks share some global key object, and each task just has a counter. Now 4 byte are not enough (would overflow), but there are already 7 juicy bytes of padding after
The overhead from this patch is definitely <= 32 byte and < 60ns. Imo the correct thing to compare to is time of task creation + scheduling, because tasks that never get scheduled are not very interesting. I'm not sure how to measure the additional task creation overhead, with whatever gcc emits. The equivalent julia code is
But if we're lucky the overhead is smaller, because the superscalar CPU can fit the arithmetic into something while waiting on memory / other long dependency chains. This part will probably get slower as this PR progresses: I don't know whether murmur3 is good enough for the job, waiting on @vigna to comment or on myself to write something cryptographic. |
Would be great if @JeffBezanson or @vtjnash could chime in with some advice on how to measure the task overhead this introduces. |
Maybe I'm too naive but can't you just take the snapshot of the parent RNG state and store it in the child task with a flag somewhere indicating that it's not initialized yet? Or is it fundamental that the parent task's RNG state is advanced when the RNG is split? Or the initialization (including calling a "practically random function") is very cheap compared to the entire task creation? |
Both of that. So, I locally reverted to master, to compare:
The random state initialization with murmur3 adds a whopping 7 nanoseconds to task creation time, and a whopping 16 bytes to the counted allocated memory. With immediate scheduling that presumably has some memory barriers I get the more realistic 40 nanoseconds (probably something with superscalar execution). PS, for comparison, on my machine:
A single dynamic dispatch with tiny allocation appears to cost ~30 ns on my machine. |
This is really cool, impressive work! These are my quick initial thoughts:
|
We don't really need unsafe shifts. It's just that we're in stdlib, so no The main reason that bulk generation is in julia instead of C (dsfmt style) is that either I suck at C, or statically compiled-for-distribution C for tight kernel loops cannot compete, performance and convenience wise, with handwritten llvm-IR or julia that is "late AOT / early JIT" compiled and knows CPU capabilities beyond "generic". (rant: please clang, give me an analogue to The single-number case must be in julia, for performance reasons.
Yay, #32812. I should have searched better, would have saved me some time ;) I'll rebase once that merges. I'm taking the general feedback as "go ahead and clean up the mess"? Cleaning up the mess means fixing code formatting, typos, bugs, writing tests, etc? Or would that be too early? |
If you'd like to clean it up and add tests that's fine, but I think I need to do some more background reading here and maybe find some others to help review this. This will take a bunch of thought, but it does seem very interesting so far. |
Then let's wait with the cleanup. In the meantime, I opened a discourse thread to solicit some user input, at https://discourse.julialang.org/t/reproducible-multithreaded-monte-carlo-task-local-random/35269 I'll continue to work on this as my schedule permits, but feel free to push on this branch or fork from it if I become too unresponsive. |
It's wonderful to see someone attacking this problem, and very impressive progress so far!
This is my main concern, though I'd point out that logging also gets special attention because it copies the current logger. In that case it's simply copying a pointer rather than running code, but nevertheless we carry along some special state during task spawning which other libraries couldn't emulate. I feel like certain things like loggers and RNG state are naturally dynamically scoped for composability. But in general that involves passing user defined state from parent to child task, and in some cases running code to update that state during spawning. It's really irksome that carrying the state and running such code might be for nothing if callees don't use the facilities it provides. It also seems non-sustainable to be hard coding this kind of stuff into the runtime if there's other dynamically scoped facilities people would like to behave in a similar way. That doesn't mean I'm against this change. In fact I'm cautiously positive if we can pay such a small cost for a decent benefit which is very relevant to the community. I do wish there was a generalization in sight though which would allow a normal user package to act in a similar way. What would you need for such a package? I would guess:
Currently I guess the task local state access in this patch is one load relative to one of the segment registers to get the task pointer (LICM'd as mentioned?) + another to get the RNG state? In the past I've wondered whether it would be possible to use types as unique keys into a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a quick browse of the code changes. It all looks fairly reasonable at surface level — here are some superficial comments of a few minor things I happened to notice.
stdlib/Random/src/RNGs.jl
Outdated
|
||
|
||
|
||
#needed because julia semintics and x86 semantics differ on shl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This triggered a memory; I wondered whether the following would help with this:
https://discourse.julialang.org/t/how-to-shift-bits-faster/19405/2
Ok, I did a big rebase / refactor / clean-up of the code. The batch of llvmcall is now restricted to the submodule responsible for bulk generation. Reliable good compilation results still elude me if I try to go without llvmcall. We now also get a free-standing xoshiro -- i.e. one that uses conventional state management instead being baked into the runtime. I am somewhat hyped to get this feature into 1.6, if possible. The way I see it, we should proceed along:
Extra long benchmark of vector widths on broadwell (avx2)
|
Results on a system with AVX512: ============== Vector width: 1 ==================
T = UInt64
(bulksize, throughput_default, throughput_tasklocal) = (512, 4.2757499063473645, 8.092668090338446)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.740279144346051, 8.738858784163927)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.637414186135379, 8.529749168751069)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.674097191261703, 9.02351710084265)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.523505049771126, 9.136802481614444)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.424131577764797, 9.154351166364018)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 5.377260254346873, 9.141597065131236)
T = Float32
(bulksize, throughput_default, throughput_tasklocal) = (512, 5.271612882807803, 5.082979147645584)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.0474539024139, 5.3814252558688125)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.471307362466929, 5.121870606516539)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 7.702143663031216, 5.394442249440274)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 7.872867684915052, 5.3051268213707505)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 6.913763887515248, 4.829009721793674)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 4.3880977552123195, 4.268776609837788)
T = Float64
(bulksize, throughput_default, throughput_tasklocal) = (512, 8.734398178493809, 4.999401333581058)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 11.029538861457446, 5.602426104046543)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 11.535898892559318, 6.63076576914423)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 13.711104842501177, 7.199859377746529)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 16.791664294254968, 6.998077920493029)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 17.54792067609405, 6.219385757838172)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 13.58361271616484, 5.155057751489517)
T = Bool
(bulksize, throughput_default, throughput_tasklocal) = (512, 6.032262780837273, 9.182015345641439)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 7.631386861313869, 7.9850927293977545)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 7.667888092726898, 7.848128221102113)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.895816033216224, 23.462726518375764)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.164912771387618, 24.124272988294194)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.167717285817954, 24.322701862633668)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 6.452277194366627, 15.73074306516174)
============== Vector width: 2 ==================
T = UInt64
(bulksize, throughput_default, throughput_tasklocal) = (512, 4.346622175928913, 8.025181473630115)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.739678988579901, 8.770834560339242)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.671546617364073, 12.905818496558085)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.58100558659218, 14.272864024982388)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.528351264902588, 14.550621669626997)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.43252433297951, 14.568006890994473)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 5.372379314652598, 14.057143748877058)
T = Float32
(bulksize, throughput_default, throughput_tasklocal) = (512, 5.250021340162186, 5.084366438176417)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.066490443269622, 5.385796788728819)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.4777499252581485, 11.312610006581266)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 7.724658180103725, 13.124479314480324)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 7.855073456388482, 13.513815699033131)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 6.765923125068558, 10.711669101347418)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 4.374775663462648, 7.352087499319734)
T = Float64
(bulksize, throughput_default, throughput_tasklocal) = (512, 8.655567331031419, 5.081539300851904)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 11.004049807592128, 5.604249667994687)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 11.57331426608624, 11.329777123762225)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 13.832355438166209, 13.12327789016944)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 16.05312721136574, 13.500824024903864)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 17.477431828788585, 11.029979172364463)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 13.516939488314454, 7.335982738121835)
T = Bool
(bulksize, throughput_default, throughput_tasklocal) = (512, 6.004337133372909, 9.155484480628045)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 7.656194236901787, 7.978994617915586)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 7.667665682673459, 7.850236392017702)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.87948760059123, 51.55054700731216)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.162350145049107, 56.113305013387325)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.136404429767621, 41.78092999163247)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 6.468780640716049, 16.174476167430235)
============== Vector width: 4 ==================
T = UInt64
(bulksize, throughput_default, throughput_tasklocal) = (512, 4.3484053810338805, 7.921876981610653)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.751810606650572, 8.783163440839076)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.719912841597386, 19.529354087510725)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.652398688087166, 23.003851091142494)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.526273711752308, 23.983019834589772)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.4493547689424, 23.87957459406527)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 5.389122469218704, 15.81244534264236)
T = Float32
(bulksize, throughput_default, throughput_tasklocal) = (512, 5.276654096674666, 5.096778945817681)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.04977664326739, 5.2161548631726)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.4870976822745625, 17.295658115869767)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 7.713747645951036, 23.219460265076915)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 7.853190906600932, 24.96419320432729)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 6.934430240786176, 17.878838513870654)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 4.3963456911488015, 7.57705647430128)
T = Float64
(bulksize, throughput_default, throughput_tasklocal) = (512, 8.095575618595552, 4.939951093240023)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 10.994517731711495, 5.608338762214984)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 11.584927373517651, 17.1107618225594)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 13.7782872606318, 23.264039251239936)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 16.06012089527855, 24.994660564454616)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 17.587360158333475, 17.18441796817385)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 13.46564994707003, 7.557070899103401)
T = Bool
(bulksize, throughput_default, throughput_tasklocal) = (512, 6.0239691442893815, 9.146010710719796)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 7.652161794934328, 7.9597426310755575)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 7.6335474465855775, 7.896615607067011)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.88134944676509, 83.01245446437207)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.165553651970212, 87.47577872695072)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.1508360386777, 52.21471964943731)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 6.419267118873478, 16.120388745515385)
============== Vector width: 8 ==================
T = UInt64
(bulksize, throughput_default, throughput_tasklocal) = (512, 4.302717729251253, 7.727496022121638)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.74069946743032, 8.77066237386627)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.630145712238236, 22.1360595048059)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.601528320597328, 40.367821608233996)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.531469294258809, 50.071838213099916)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.489032071237242, 46.05683664953661)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 5.388097109830302, 15.113085786961788)
T = Float32
(bulksize, throughput_default, throughput_tasklocal) = (512, 5.276201902476648, 5.1053746666526125)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.029760621091222, 5.214919007893457)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.514445953256275, 16.589052672781136)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 7.6855239703536915, 29.33284283069075)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 7.822658754518792, 35.480657545033964)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 6.834765151416392, 21.444149044950716)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 4.397779860130736, 7.263325298281252)
T = Float64
(bulksize, throughput_default, throughput_tasklocal) = (512, 8.791416368623912, 4.99074703760948)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 11.031460448286259, 5.302311496921369)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 11.489921285927252, 17.37287127240277)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 13.770523824728135, 29.155937410242643)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 16.844413982179578, 35.301754995655955)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 17.464041837380503, 23.715927081919755)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 13.598782026182912, 7.441702103877209)
T = Bool
(bulksize, throughput_default, throughput_tasklocal) = (512, 5.979805787719213, 9.159812370459257)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 7.233953172298758, 7.972393372667697)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 7.582404583251187, 7.866126267270644)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.89042101567238, 84.5365215114757)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.161709712368506, 109.16038064741049)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.093933480768397, 48.599184278828325)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 6.480986839902164, 15.265156457422892)
============== Vector width: 16 ==================
T = UInt64
(bulksize, throughput_default, throughput_tasklocal) = (512, 4.343714766279135, 7.740410058027079)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.734773453949933, 8.781868574685982)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.718601538098113, 19.064076659931587)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.601237842617152, 39.11714770797963)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.537017499181433, 53.09068432340298)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.383734249713632, 41.860992454788615)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 5.271051712398194, 14.998236710755695)
T = Float32
(bulksize, throughput_default, throughput_tasklocal) = (512, 5.2225489051404805, 4.990541348115483)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 5.98783982109162, 5.156197472452219)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.507907797507445, 15.37113984756359)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 7.632535171899749, 28.23175136069994)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 7.827196724108514, 35.5389964592172)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 6.800280162909544, 20.781164532878833)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 4.438002776328215, 7.432791437004642)
T = Float64
(bulksize, throughput_default, throughput_tasklocal) = (512, 8.737412430548366, 5.0068949833052505)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 11.01719268691473, 5.604032498412657)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 11.542699038933092, 15.350222275314216)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 13.817366313309776, 28.092347309511368)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 15.88794310957871, 35.71660409349643)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 17.563204529085642, 22.384902760284355)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 13.453872394145193, 7.363418360529808)
T = Bool
(bulksize, throughput_default, throughput_tasklocal) = (512, 6.01423677753213, 9.200124175964648)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 7.596463573435679, 7.989319092122829)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 7.620724404341331, 7.841606371641708)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.878518376844857, 70.24313843605286)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.164912771387618, 88.07452832650505)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.155630064263761, 44.3278799408159)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 6.372652454236797, 15.299578849792344)
============== Vector width: 32 ==================
T = UInt64
(bulksize, throughput_default, throughput_tasklocal) = (512, 4.360980582974188, 8.161761343220201)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.727334516482272, 8.783077104549207)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.727231782113951, 10.925464969934277)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.64807200800089, 28.35036459064963)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.526619907693426, 46.913112164297004)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.4096701245558, 42.7658550511848)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 5.418603840911265, 15.285752633340612)
T = Float32
(bulksize, throughput_default, throughput_tasklocal) = (512, 5.2559562515323694, 5.003136866435529)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 6.050730434568569, 5.210435822680721)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 6.452562182210946, 9.437308920013812)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 7.751703255109766, 21.83713557370049)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 7.829868578255676, 32.673247582012166)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 6.851378017066764, 21.441956526184487)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 4.3849451783547515, 7.32115340500625)
T = Float64
(bulksize, throughput_default, throughput_tasklocal) = (512, 8.938736641625225, 5.082392444910808)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 11.001286675525744, 5.607535076405987)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 11.579621476356568, 9.457378475819487)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 13.803059255424294, 21.877206466657114)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 16.040902909980964, 32.443564356435644)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 17.53794176186255, 21.865832551350227)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 13.58626854642585, 7.348995046790955)
T = Bool
(bulksize, throughput_default, throughput_tasklocal) = (512, 5.919039689480432, 9.089249309928016)
(bulksize, throughput_default, throughput_tasklocal) = (2040, 7.639563599262617, 7.9882198952879575)
(bulksize, throughput_default, throughput_tasklocal) = (2056, 7.6475235210797265, 7.858708727276817)
(bulksize, throughput_default, throughput_tasklocal) = (8192, 8.884465709728868, 41.48898455305141)
(bulksize, throughput_default, throughput_tasklocal) = (32768, 9.164271980423003, 59.74024771385577)
(bulksize, throughput_default, throughput_tasklocal) = (1048576, 9.041396852769994, 43.853289281084024)
(bulksize, throughput_default, throughput_tasklocal) = (268435456, 6.453375269282844, 15.045824345939197) A vector width of 8 performs best. FWIW, this is what I get with VectorizedRNG.jl: julia> for N in [2VectorizedRNG.W64]
println("============== Vector width: $(N) ==================")
@eval Random.XoshiroSimd xoshiroWidth()=Val($N)
for T in [UInt64, Float32, Float64]
println("\n")
@show T
for bulksize in [1<<9, (1<<11) - 8, (1<<11) + 8, 1<<13, 1<<15, 1<<20, 1<<28]
arr = zeros(T, Int(bulksize/sizeof(T)))
throughput_default = 1e-9 * bulksize / (@belapsed rand!($arr))
throughput_tasklocal = 1e-9 * bulksize / ( @belapsed rand!(Random.TaskLocal(), $arr))
throughput_vrng = 1e-9 * bulksize / ( @belapsed rand!(local_rng(), $arr))
@show bulksize, throughput_default, throughput_tasklocal, throughput_vrng
end
end
end
============== Vector width: 16 ==================
T = UInt64
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (512, 4.317000627597937, 8.158099675784877, 31.84444721425901)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2040, 6.731527190352106, 8.782055652411051, 45.01070854899161)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2056, 6.637851514113275, 19.02396800286542, 46.393909693889185)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (8192, 8.58050625545534, 39.28219040789349, 51.66561021835035)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (32768, 9.53354911444885, 52.93798384460943, 53.7593280083999)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (1048576, 9.432863748403232, 42.98147237252009, 42.106412881982095)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (268435456, 5.438526861203929, 15.268634801450354, 15.285453211337668)
T = Float32
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (512, 5.1034855894287805, 5.068721523505575, 31.081265206812652)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2040, 6.0572077507275175, 5.252449133383572, 42.308192867188644)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2056, 6.509435920858681, 15.706879955195411, 41.65001436722631)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (8192, 7.594326504125336, 28.297063903281522, 44.78717885276912)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (32768, 7.860995921724528, 35.79616179509891, 44.99023183666625)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (1048576, 6.933192277175351, 23.920976388730466, 43.4174982402385)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (268435456, 4.403767046491716, 7.357364196926021, 15.373862891045325)
T = Float64
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (512, 8.725876140182429, 5.13528618026116, 31.80282566751728)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2040, 11.019134943891538, 5.633072906442621, 41.93670543684068)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2056, 11.421664212384934, 15.460772265144941, 41.58770365816171)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (8192, 13.821794497176954, 28.223673762135302, 44.90401465987021)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (32768, 16.047885944387005, 35.65614798694233, 44.99966454209997)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (1048576, 17.519188678930046, 23.172950276243093, 41.199795685827674)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (268435456, 13.703102286017746, 7.409493045077719, 15.307004960776135) It doesn't guarantee reproducibility across architectures, so it defaults to 2x or 4x SIMD vector width, depending on the sampler (here, I would set the SIMD threshold to be much smaller. That seems like it would be highly beneficial on Broadwell as well as Skylake-X/Cascadelake-X. |
%lshifted = shl <$N x i64> %0, %lshiftOp | ||
%rshifted = lshr <$N x i64> %0, %rshiftOp | ||
%res = or <$N x i64> %lshifted, %rshifted | ||
ret <$N x i64> %res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use the funnel shift intrinsic.
Although LLVM compiles either approach into to
vprolq $45, %zmm0, %zmm0
given AVX512.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaiu fshr was introduced in llvm 7.0.0 (release late '18) -- do we still support older llvm versions? Regardless, either compiles ok at this moment.
How are you initializing the parallel instances? This is something that should be done with some care (e.g., using long jumps). |
oneBits(::Type{Float32}) = Base.exponent_one(Float32) + UInt64(Base.exponent_one(Float32))<<32 | ||
|
||
|
||
function forkRand(rng::Union{TaskLocal, Xoshiro}, ::Val{N}) where N |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are you initializing the parallel instances? This is something that should be done with some care (e.g., using long jumps).
Good point. It seems to be using forkRand
.
FWIW, VectorizedRNG.jl does initialize them with jumps.
|
||
|
||
#Vector-width. Influences random stream. We may want to tune this before merging. | ||
xoshiroWidth() = Val(8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The capitalization in this file seems a bit off (camelCase). Is that intentional?
I don't oppose throwing old architectures under the performance bus, but still probably worth pointing out that it may also regress CPUs lacking AVX2, such as sandybridge. SIMD bitshifts were introduced there. Maybe I shouldn't set |
The fundamental model is not that of a random stream, but instead of a random tree: I.e. we don't have a map I.e. given an RNG parent-state, we need to (1) advance the parent at least once, and (2) initialize the child RNG, such that future output of the child will be uncorrelated (in the sense of e.g. bigcrush) to future and past output of the parent. They will obviously not be independent (we are writing a PRNG after all) and their dependence will be discoverable by clever tricks (we are not writing a CSPRNG after all). These two operations fit together like bread and butter! I mean, we must harvest pseudo-random numbers from the parent (in order to avoid collisions if the parent forks twice without generating randomness in between), and we are in need of pseudo-random numbers to seed the child. Gee, what are we to do, I wonder? The code is:
and
|
This is a very good point: We don't want to distribute fully compiled code for generic arch in the sysimage for this function. Due to all the I must admit that I don't understand https://github.com/chriselrod/VectorizedRNG.jl/blob/865757a23905686abff1202e83d5fafccb3ae295/src/api.jl#L129 The way I generate bulk Float32 / Float64 is that I first use xoshiro and masking to generate random floats in [1,2), and then subtract 1.0 in a second loop. For some reason (you probably know why?) I got pretty bad perf by running the floating-point subtract on the register that holds the xoshiro result -- so instead, I round-trip through memory. A more advanced design would chunk that: Fill 8kb of data, spill vector-rng state to the stack and subtract 1, go on. That way, we could avoid the very annoying doubling of memory traffic. An even better design would figure out how to solve this very annoying problem, and then try to use non-temporal stores (no need to trash the entire L1d). Both of these optimizations could be done without changing the random stream -- i.e. this PR doesn't lock us into a suboptimal design.
Presumably. Unfortunately we can't afford to keep the vectorized rng alive :( |
Okay, testing on Haswell (after a change on master to make ============== Vector width: 8 ==================
T = UInt64
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (512, 1.380471877908609, 2.8312320283123205, 7.50946615382313)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2040, 2.363368885619713, 3.228238710938591, 9.692474583112592)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2056, 2.3086086248982918, 6.589561517420497, 9.670169491525424)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (8192, 3.12406779661017, 9.355157974876287, 10.399911256825213)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (32768, 3.4140445926234633, 10.365110118223875, 10.667968908965126)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (1048576, 3.2505820243597734, 10.829040586595063, 10.83956334766788)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (268435456, 2.003070157690122, 4.785000072104206, 4.7845748285930565)
T = Float32
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (512, 2.0273678237932264, 1.9461093378148229, 6.6193994153600855)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2040, 2.5065701901088775, 2.2042883673813947, 8.113502565770833)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2056, 2.5714078989529394, 5.485592315901815, 8.047094762298734)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (8192, 3.002565668906536, 7.4209620436633745, 8.650826530197799)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (32768, 2.994699323706818, 8.15559111111111, 8.871801813997564)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (1048576, 2.59226406791561, 5.656787113062806, 8.998798530774776)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (268435456, 1.5789669106483593, 2.382474101715775, 4.803865090209645)
T = Float64
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (512, 2.771631519756662, 1.818756448263703, 6.484977870346265)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2040, 4.4685190180739145, 2.2339163498098857, 8.150624670068625)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (2056, 4.470342668290138, 5.523998330550919, 8.102293212593805)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (8192, 5.293699515347335, 7.417602318000724, 8.648367711547833)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (32768, 4.752567152056622, 8.131593874078277, 8.880517632711136)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (1048576, 3.9372488941957484, 5.727200729707735, 9.000884142939304)
(bulksize, throughput_default, throughput_tasklocal, throughput_vrng) = (268435456, 3.3682848702750454, 2.364016911760853, 4.7888480960911375) As before, we really see both the cost of passing over the arrays twice when they are large, and the overhead of forking when they are small. But
No, I don't know why, and I also don't think I'm seeing this. I tried this on the Haswell laptop because Haswell (and Broadwell) only have a single execution unit for floating point vector addition/subtraction (Skylake and beyond have 2). Broadwell/Haswell do have 2 execution units & ports available for fma, so you could try As an aside, given the large dip in performance for large arrays, I think I'll probably add a version that aligns the destination pointer and then uses nontemporal stores. Then a size check (e.g., vs L2 cache size) can pick which version.
I think the former approach would likely be an improvement over what I'm currently doing for the random normal generator. It has to unroll 4x (instead of 2x) because of the high (compared to bit ops) latency of all the arithmetic operations. That leads to a lot more spilling (extra unrolling, plus needing registers to hold polynomial coefficients) than from the batched double-pass approach. However, I'm wondering if the problem you observed is Broadwell specific, or if you can provide an example to reproduce it. I don't seem to see it on either Cascadelake-X or the very-similar-to-Broadwell Haswell.
Why not? That probably accounts for almost all the performance difference. For avx512, I'm also using
The function |
Unfortunately not: Each task needs to hold a private (task-local) entire instance with all its state. Whether that memory is inside the task-struct or malloced somewhere on the heap doesn't make a difference for that. Thread-local RNG can afford a large internal state (you won't see more than 128 threads, typically) but affords no reproducibility. Task-local gives reproducibility for our composable multi-threading, but there can be thousands or millions of tasks. This PoC has similar or better perf than the current design for sequential generation, and for large bulks has the potential to reach almost parity with your (super impressive!) implementation (I.e. I fully intend to steal your tricks, or talk you into pushing onto this branch, until we get close). For small bulks, well, it's awkward, because we need to pay the cost for bootstrapping a xoshiro-simd instance from our tiny task-local xoshiro-sequential instance. edit: Unless the core people are willing to pay an extra 8 x 8 x 4 = 256 bytes per task.... I don't think that would be a good tradeoff for the language. An extra 32 bytes is probably ok, though. |
On 13 Aug 2020, at 10:51, chethega ***@***.***> wrote:
@vigna
How are you initializing the parallel instances? This is something that should be done with some care (e.g., using long jumps).
to -> rngState0 = 0x02011ce34bce797f * jl_tasklocal_genrandom(from);
to -> rngState1 = 0x5a94851fb48a6e05 * jl_tasklocal_genrandom(from);
to -> rngState2 = 0x3688cf5d48899fa7 * jl_tasklocal_genrandom(from);
to -> rngState3 = 0x867b4bb4c42e5661 * jl_tasklocal_genrandom(from);
I would pass over the PRNG state with a good mixing function (e.g., SpookyHash, which has a 4-word state). See the code for split() (the operation you're describing) here:
http://dsiutils.di.unimi.it/docs/it/unimi/dsi/util/XoShiRo256PlusPlusRandom.html
Ciao,
seba
|
So, after triage call today, it looks like this can go forward -- modulo implementation details reaching sufficient quality. Yay! @vigna The reason why I am so reluctant about your proposed scheme (that you use in dsiutils) is the following: I come more from the crypto-engineering world, where we think about how to compose primitives into modes of operation / protocols (not the symmetric cryptography world -- these people would be qualified to comment on the internals). There, people think in a pretty modular fashion; accessing internal state of an RNG is a big fat no-no. If you want to seed a new RNG from an old one, then you only access its output stream, and never its internal state. The reason is that this is a composable action with provable security: If both stream-output RNGs are "secure", then the resulting pair of RNGs is "secure" (I am deliberately vague here). The point is not so much whether it is insecure / leads to bad randomness to access the internal state of the parent, but rather that it makes security proofs more difficult. The security proof is the following: If it is possible to distinguish the parent/child pair from true random without knowing the parent seed/key, then one can use this distinguisher to break the single stream-random, simply by performing the split, i.e. a proof-by-reduction. You see how this breaks once we start accessing internal state of the parent -- the attacker can't access that, and our proof breaks down. Now xoshiro is a very specific RNG, and it is not intended to be cryptographically secure. Nevertheless, it hurts me to access internal bits -- it deprives people of a lot of tools to reason about the resulting construction. Can we find something that receives your blessing and only uses the output stream of the parent? Your blessing would be super nice, because I am not really qualified to analyse the internals of xoshiro. What we need to do here is only to break self-interactions -- i.e. bad effects that come from the fact that the 4 seed values are verbatim subsequent output-values of a xoshiro instance. The reason I chose multiplication with effectively random constants is that "integer multiply" is not used in the xoshiro iteration, and hence has a lower chance of bad interaction. We have a handful of cycles to burn here. If you tell me that it is better to apply one round of spooky to 32 output bytes to get the 32 seed bytes, I'll happily write that (but spooky uses the same kind of primitives as xoshiro...). But I just don't want to read out the internal state :( |
For me the main problem is that you're almost using bits of state to seed the new state (multiplication is a weak scrambler). I'd use at least a MurmurHash mixing instead of a simple multiplication. For example, the lowest bit you're using for the state is exactly the lowest output bit, which is not good IMHO. |
|
||
Apart from the high speed, Xoshiro has a small memory footprint, making it suitable for applications where many different random states need to be held for long time. | ||
|
||
Julia's Xoshiro implementation has a bulk-generation mode; this seeds new virtual PRNGs from the parent, and uses SIMD to generate in parallel (i.e. the bulk stream consists of multiple interleaved xoshiro instances). The virtual PRNGs are discarded once the bulk request has been serviced (and should cause no heap allocations). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please insert newlines (here and elsewhere), this one doesn't even fit on a big screen! This is will make reviewing on Github easier.
end | ||
|
||
|
||
@inline function rand(rng::Xoshiro, ::Type{UInt64}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@inline function rand(rng::Xoshiro, ::Type{UInt64}) | |
@inline function rand(rng::Xoshiro, ::SamplerType{UInt64}) |
end | ||
|
||
|
||
@inline function rand(::TaskLocal, ::Type{UInt64}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@inline function rand(::TaskLocal, ::Type{UInt64}) | |
@inline function rand(::TaskLocal, ::SamplerType{UInt64}) |
end | ||
|
||
#Shared implementation between Xoshiro and TaskLocal -- seeding | ||
function seed!(rng::Union{TaskLocal, Xoshiro}, seed::Union{Int128, UInt128, BigInt}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I would prefer being more conservative on valid seeds here (like with MersenneTwister
): with a big BigInt
, high bits will simply be discarded, e.g. two seeds which differ by 2^130
lead to the same initialization state. Similarly, assuming s::Int128 < 0
, then s
and s % UInt128
lead to the same state, while they are represent distinct numbers.
So I suggest defining this method only for UInt128
(like MT does for Vector{UInt32}
), and define other integer methods with something like seed!(rng, seed::Integer) = seed!(rng, UInt128(seed))
(two equal integers should lead to the same state).
rand(RandomDevice(), UInt64), | ||
rand(RandomDevice(), UInt64)) | ||
|
||
function seed!(rng::Union{TaskLocal, Xoshiro}, seed::AbstractVector{UInt64}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An error should be thrown if length(seed) > 4
.
@inline rand(rng::Union{TaskLocal, Xoshiro}, ::Type{Int128}) = rand(rng, UInt128) % Int128 | ||
@inline rand(rng::Union{TaskLocal, Xoshiro}, ::SamplerType{Int128}) = rand(rng, UInt128) % Int128 | ||
|
||
@inline rand(rng::Union{TaskLocal, Xoshiro}, ::Type{T}) where {T<:Union{Bool, UInt8, Int8, UInt16, Int16, UInt32, Int32, Int64}} = rand(rng, UInt64) % T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@inline rand(rng::Union{TaskLocal, Xoshiro}, ::Type{T}) where {T<:Union{Bool, UInt8, Int8, UInt16, Int16, UInt32, Int32, Int64}} = rand(rng, UInt64) % T |
@inline rand(rng::Union{TaskLocal, Xoshiro}, ::SamplerType{Int128}) = rand(rng, UInt128) % Int128 | ||
|
||
@inline rand(rng::Union{TaskLocal, Xoshiro}, ::Type{T}) where {T<:Union{Bool, UInt8, Int8, UInt16, Int16, UInt32, Int32, Int64}} = rand(rng, UInt64) % T | ||
@inline rand(rng::Union{TaskLocal, Xoshiro}, ::SamplerType{T}) where {T<:Union{Bool, UInt8, Int8, UInt16, Int16, UInt32, Int32, Int64, UInt64}} = rand(rng, UInt64) % T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the UInt64
from the Union
as rand(rng, ::SamplerType{UInt64})
should be defined elsewhere as the "native" method.
@inline rand(rng::Union{TaskLocal, Xoshiro}, ::Type{T}) where {T<:Union{Bool, UInt8, Int8, UInt16, Int16, UInt32, Int32, Int64}} = rand(rng, UInt64) % T | ||
@inline rand(rng::Union{TaskLocal, Xoshiro}, ::SamplerType{T}) where {T<:Union{Bool, UInt8, Int8, UInt16, Int16, UInt32, Int32, Int64, UInt64}} = rand(rng, UInt64) % T | ||
|
||
Sampler(rng::Union{TaskLocal, Xoshiro}, ::Type{T}) where T<:Union{Float64, Float32} = SamplerType{T}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be necessary, this definition is already the default.
|
||
|
||
|
||
function rand!(rng::Union{TaskLocal, Xoshiro}, dst::Union{Array{Float32}, UnsafeView{Float32}}, ::SamplerType{Float32} = SamplerType{Float32}()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use ::SamplerTrivial{CloseOpen12{Float32}}
instead of SamplerType
, and the default value shouldn't be necessary, it should already be handled by the Random
API.
end | ||
|
||
function seed!(rng::Union{TaskLocal, Xoshiro}, seed::AbstractVector{UInt32}) | ||
#can only process an even length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't the last entry of an odd-lengthed vector be cast to an UInt64
? I guess that could be implemented later, but no bits of the input should be discarded (i.e. better to error out on odd length).
We discussed this on triage last month and decided to go ahead with it. The reasoning was the following:
|
@chethega, are you willing to update this? If not, perhaps we can have someone else fix it up. |
I think there's a way to integrate something like this to context variable #35833 for supporting this. Cilk devs investigated splitter "hyperobject" which is quite similar. |
Updated in #40546 |
This introduces a tasklocal random number generator. In view of #27614, I have chosen xoshiro256 as the initial proposal.
Task Locality
Task locality of the RNG state has considerable upside: First, it is much easier to get single requests fast (less pointer chasing, no synchronization issues / threadsafety issues). Second, we obtain an extremely nice reproducibility: When using a task-local RNG, then results of multi-threaded computations are independent of scheduler decisions, as long as the user code has no race conditions. In fact, as long as task creation is independent of the number of threads (e.g. recursive divide-and-conquer), randomized simulations are also independent of the number of threads.
There are two downsides of that: First, tasks become 32 bytes heavier, and task creation becomes a handful of cycles slower (should be <100 cycles). Second, it is not possible to separate a task's RNG instance from the owning task.
The way this is implemented is that task creation consumes 32 random bytes from the parent's task-local stream, in order to seed the new task's RNG.
Internals
As you see, the speed of this relies on some codegen trickery to accelerate
current_task()
. I'd appreciate any comments on that; if desired, we can split that change off into a separate commit/PR.It is, unfortunately, fundamentally impossible for packages to provide fast task-local random: There can be only 0-1 task-local RNGs that are handled by the julia runtime.
Performance: Cache Pressure
The tasklocal RNG has 32 contiguous bytes of state. No cache of generated random numbers, no giant DSFMT array -- hence, no eviction of precious cache-lines filled with valuable user data.
Speed: Single numbers
Thanks to the newly accelerated
current_task()
, single random generation can be inlined into the callsite, and the access to the random state can be LICMd. The smaller function-call overhead more than compensates for the simplified approach (always consume 64-128 random bits, even forrand(Bool)
).Speed: Bulk generation
For bulk generation of large arrays, we use the task-local RNG to seed an ephemeral RNG instance (4 x xoshiro), and use that to fill the array. The ephemeral RNG has 128 bytes of state, which is imho too expensive to maintain. By choosing an ephemeral instance, we can just keep the state in register until the array is full and discard it afterwards, no need to save state to either stack or heap.
I had some trouble getting the large bulk generation reliably vectorized, and resorted to effectively writing it in LLVM-IR. The tests in #27614 suggest that 256 or 512 byte of ephemeral state would be even faster on powerful hardware; but this variant is conservative (changing the size is a simple search-and-replace), I don't want to break rasberry pi perf.
Making this the default
I think something like this should become the default. It appears like a strict (pareto) improvement over DSFMT, and has much simpler reproducibility / thread-safety semantics.
Warts
The names need quite a bit of bikeshedding.
We probably should use some better seeding function. I just used the murmur-hash that was lying around, but we really want a fast pseudo-random function on 32 byte (e.g. sha2-256) that has no chance in hell of interacting with xoshiro. This is only called on seeding / task creation, and at 4 cycles/byte we could get real crypto (and 128 extra cycles for task creation is not a lot).
@rfourquet , please help me clean up my misuse of the Random / Sampler API. It is too complicated for me.
The bulk generation of floats currently traverses the array twice: First to do all the bitwise ops, then to do all floating point ops. This is due to x86 issues: There is afaiu some split between how registers are used. Hence we must commit the intermediate values to "memory", and cannot fully interleave the bitwise and floating point
(x-> x-1)
operations. Main memory is slow, so the correct way would be to write ~25% of L2d with the RNG, then run over the data again for the floating point ops, etc. Still faster than DSFMT on my machine, but might be bad for some users (if they are bandwidth-bound, instead of CPU-bound, and we perturb other threads/cores by hogging whatever shared resources, e.g. shared L3).Reproducibility
With task-local RNG, the following becomes reproducible:
Benchmark: Single generation speed
gives
Benchmark: Bulk generation speed
gives