Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs on task-specific buffering using multithreading #48542

Merged

Conversation

IanButterworth
Copy link
Member

@IanButterworth IanButterworth commented Feb 5, 2023

It's common to see people using threadid()-based buffers, for instance in a toy sum example

function sum_multi(a)
     buffers = zeros(eltype(a), Threads.nthreads())
     Threads.@threads for i in eachindex(a)
         buffers[Threads.threadid()] += a[i]
     end
     s = 0
     for b in buffers
         s += b
     end
     return s
 end

but after task migration #40715 (which is undocumented AFAICT) this practice is not race-safe.

This is an attempt to document some best practice.

I expected the task_local_storage api to be the right answer here, but I couldn't figure out how to reduce it after the @threads loop returned.

Help appreciated in making this correct

@IanButterworth IanButterworth added docs This change adds or pertains to documentation multithreading Base.Threads and related functionality labels Feb 5, 2023
@IanButterworth IanButterworth force-pushed the ib/threads_data_race_docs branch 3 times, most recently from cd1070f to d6d0234 Compare February 5, 2023 18:33
@vchuravy vchuravy requested a review from tkf February 5, 2023 18:46
Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this

doc/src/manual/multi-threading.md Outdated Show resolved Hide resolved
doc/src/manual/multi-threading.md Outdated Show resolved Hide resolved
500000500000
```

Note that we do not use buffers based on the `threadid()` i.e. `buffers = zeros(Threads.nthreads())` because tasks can
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also because nthreads is soft-deprecated (no warning, but may return incorrect answers now)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add a note on this. I don't understand whether the docs should or shouldn't recommend nthreads()

To fix this, buffers that are specific to the task may be used to segment the sum into chunks that are race-free.

```julia-repl
julia> function sum_multi_good(a)
Copy link

@torfjelde torfjelde Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert on multi-threaded parallelism, so I might just be wrong.

But this doesn't seem like it'll be particularly performant as the values of the WeakKeyDict won't be contiguous in memory, and so the reduction at the needs to gather these from likely different parts of the heap => slow reduction. In fact, when I tried running something like this on an example of mine it was incredibly slow; probably partially due to what I just mentioned, and partially due to the slowness of WeakKeyDict (I believe you can also use an IdDict which at least should be faster that WeakKeyDict).

An alternative is Atomic as mentioned below, or, even better, you can do a Vector{Atomic{T}} of some buffer_length. In the latter scenario you can then just randomly pick an index for each Task, and act on the corresponding Atomic{T} atomically, e.g. atomic_add!(buffer[rand(1:buffer_size)], i). If the work within each of the tasks is fairly uniform, then this random picking of index to add to should result in fairly even congestion for the different buffers.

I'm sure there are much, much better ways of doing this, but the above is a quick-and-easy way to implement a much faster version of this kind of map-reduce approach using a buffer.

@Moelf
Copy link
Contributor

Moelf commented May 30, 2023

just to move some of the comments here to provide some context as to why this PR is very much needed.

Problem with task local storage

  • it's not type stable out of the box
  • if what you're buffering involves large RAM or slow I/O and the result is used across task boundary, tasks may over spawn and overlap with each other (see this for example)

Our ecosystem contains many silently incorrect code now without this

The easiest way for these to be fixed is probably to provide a replacement for what used to work, thus this PR is pretty important.

@vtjnash
Copy link
Member

vtjnash commented May 31, 2023

At least most of those had the decency to avoid having inbounds annotations, which can quickly (and often) turns mild wrong code into drastic wrong answers

@vtjnash
Copy link
Member

vtjnash commented May 31, 2023

Those seem more impetus to continue looking into #48543 perhaps?

@KristofferC
Copy link
Member

I think many of those mimiced the implementation of the old default rng.

@IanButterworth IanButterworth force-pushed the ib/threads_data_race_docs branch from 1d330f7 to 1c0ebe6 Compare June 16, 2023 18:23
@IanButterworth IanButterworth marked this pull request as ready for review June 16, 2023 18:23
@IanButterworth
Copy link
Member Author

Perhaps this is ready for final review? @MasonProtter perhaps?

@IanButterworth IanButterworth changed the title RFC: add docs on task-specific buffering using @threads Add docs on task-specific buffering using multithreading Jun 16, 2023
Copy link
Contributor

@MasonProtter MasonProtter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for suggested that code and now I'm suggesting changes, but I guess it would be nice and concise here to just re-use the single threaded version of the function and run it on each of the spawned chunks, and also use it for reducing the returned data?

doc/src/manual/multi-threading.md Outdated Show resolved Hide resolved
doc/src/manual/multi-threading.md Outdated Show resolved Hide resolved
doc/src/manual/multi-threading.md Outdated Show resolved Hide resolved
@IanButterworth IanButterworth force-pushed the ib/threads_data_race_docs branch from 1c0ebe6 to 0c21503 Compare June 29, 2023 18:29
@IanButterworth IanButterworth force-pushed the ib/threads_data_race_docs branch from 0c21503 to 7f8d474 Compare June 29, 2023 18:34
@IanButterworth
Copy link
Member Author

@MasonProtter I absorbed your suggestions. Can you give this a last review please. Thanks

Copy link
Contributor

@MasonProtter MasonProtter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it!

@IanButterworth IanButterworth added backport 1.9 Change should be backported to release-1.9 backport 1.10 Change should be backported to the 1.10 release merge me PR is reviewed. Merge when all tests are passing labels Jun 29, 2023
@IanButterworth IanButterworth merged commit 02f80c6 into JuliaLang:master Jun 30, 2023
@IanButterworth IanButterworth deleted the ib/threads_data_race_docs branch June 30, 2023 00:12
@IanButterworth IanButterworth removed the merge me PR is reviewed. Merge when all tests are passing label Jun 30, 2023
IanButterworth added a commit that referenced this pull request Jun 30, 2023
@KristofferC KristofferC mentioned this pull request Jul 11, 2023
35 tasks
KristofferC added a commit that referenced this pull request Aug 18, 2023
Backported PRs:
- [x] #47782 <!-- Generalize Bool parse method to AbstractString -->
- [x] #48634 <!-- Remove unused "deps" mechanism in internal sorting
keywords [NFC] -->
- [x] #49931 <!-- Lock finalizers' lists at exit -->
- [x] #50064 <!-- Fix numbered prompt with input only with comment -->
- [x] #50474 <!-- docs: Fix a `!!! note` which was miscapitalized -->
- [x] #50516 <!-- Fix visibility of assert on GCC12/13 -->
- [x] #50635 <!-- `versioninfo()`: include build info and unofficial
warning -->
- [x] #49915 <!-- Revert "Remove number / vector (#44358)" -->
- [x] #50781 <!-- fix `bit_map!` with aliasing -->
- [x] #50845 <!-- fix #50438, use default pool for at-threads -->
- [x] #49031 <!-- Update inference.md -->
- [x] #50289 <!-- Initialize prev_nold and nold in gc_reset_page -->
- [x] #50559 <!-- Expand kwcall lowering positional default check to
vararg -->
- [x] #49582 <!-- Update HISTORY.md for `DelimitedFiles` -->
- [x] #50341 <!-- invokelatest docs should say not exported before 1.9
-->
- [x] #50525 <!-- only check that values are finite in `generic_lufact`
when `check=true` -->
- [x] #50444 <!-- Optimize getfield lowering to avoid boxing in some
cases -->
- [x] #50523 <!-- Avoid generic call in most cases for getproperty -->
- [x] #50860 <!-- Add `Base.get_extension` to docs/API -->
- [x] #50164 <!-- codegen: handle dead code with unsafe_store of FCA
pointers -->
- [x] #50568 <!-- `Array(::AbstractRange)` should return an `Array` -->
- [x] #50871 <!-- macOS: Don't inspect dead threadtls during exception
handling. -->

Need manual backport:
- [ ] #48542 <!-- Add docs on task-specific buffering using
multithreading -->
- [ ] #50591 <!-- build: fix various makefile bugs -->


Non-merged PRs with backport label:
- [ ] #50842 <!-- Avoid race conditions with recursive rm -->
- [ ] #50823 <!-- Make ranges more robust with unsigned indexes. -->
- [ ] #50663 <!-- Fix Expr(:loopinfo) codegen -->
- [ ] #49716 <!-- Update varinfo() docstring signature -->
- [ ] #49713 <!-- prevent REPL from erroring in numbered mode in some
situations -->
- [ ] #49573 <!-- Implement jl_cpu_pause on PPC64 -->
- [ ] #48726 <!-- fix macro expansion of property destructuring -->
- [ ] #48642 <!-- Use gc alloc instead of alloc typed in lowering -->
- [ ] #48183 <!-- Don't use pkgimage for package if any includes fall in
tracked path for coverage or alloc tracking -->
- [ ] #48050 <!-- improve `--heap-size-hint` arg handling -->
- [ ] #47615 <!-- Allow threadsafe access to buffer of type inference
profiling trees -->
IanButterworth added a commit that referenced this pull request Aug 19, 2023
Co-authored-by: Mason Protter <[email protected]>
(cherry picked from commit 02f80c6)
IanButterworth added a commit that referenced this pull request Aug 19, 2023
Co-authored-by: Mason Protter <[email protected]>
(cherry picked from commit 02f80c6)
KristofferC added a commit that referenced this pull request Oct 2, 2023
Backported PRs:
- [x] #48625 <!-- add replace(io, str, patterns...) -->
- [x] #48387 <!-- Improve documentation of sort-related functions -->
- [x] #48363 <!-- Revise sort.md and docstrings in sort.jl -->
- [x] #48977 <!-- Update SparseArrays.jl stdlib for SuiteSparse 7 -->
- [x] #50719 <!-- fix `CyclePadding(::DataType)` -->
- [x] #50694 <!-- inference: permit recursive type traits -->
- [x] #50860 <!-- Add `Base.get_extension` to docs/API -->
- [x] #50594 <!-- Disallow non-index Integer types in isassigned -->
- [x] #50802 <!-- Makes IntrusiveLinkedListSynchronized mutable to avoid
allocation on poptask -->
- [x] #50858 <!-- Add a `threadpool` parameter to `Channel` constructor
-->
- [x] #50874 <!-- Restrict COFF to a single thread when symbol count is
high -->
- [x] #50822 <!-- Add default method for setmodifiers! -->
- [x] #50730 <!-- Fix integer overflow in `isapprox` -->
- [x] #50850 <!-- Remove weird Rational dispatch and add pi functions to
list -->
- [x] #50809 <!-- Limit type-printing in MethodError -->
- [x] #50915 <!-- Add note the `Task` about sticky bit -->
- [x] #50929 <!-- when widening tuple types in tmerge, only widen the
complex parts -->
- [x] #50928 <!-- Bump JuliaSyntax to 0.4.6 -->
- [x] #50959 <!-- Update libssh2 patches -->
- [x] #50823 <!-- Make ranges more robust with unsigned indexes. -->
- [x] #48542 <!-- Add docs on task-specific buffering using
multithreading -->
- [x] #50912 <!-- Separate foreign threads into a :foreign threadpool
-->
- [x] #51010 <!-- Add ORIGIN to SuiteSparse rpath on Linux/FreeBSD -->
- [x] #50753 <!-- cat: remove unused promote_eltype methods that confuse
inference -->
- [x] #51027 <!-- Implement realloc accounting correctly -->
- [x] #51019 <!-- fix a case of potentially use of undefined variable
when handling error in distributed message processing -->
- [x] #51039 <!-- Revert "Optimize findall(f, ::AbstractArray{Bool})
(#42202)" -->
- [x] #51036 <!-- add missing invoke edge for nospecialize targets -->
- [x] #51042 <!-- inference: fix return_type_tfunc modeling of concrete
functions -->
- [x] #51114 <!-- Workaround upstream FreeBSD issue #272992 -->
- [x] #50892 <!-- Add `JL_DLLIMPORT` to `small_typeof` declaration -->
- [x] #51154 <!-- broadcast: use recursion rather than ntuple to map
over a tuple -->
- [x] #51153 <!-- fix debug typo in "add missing invoke edge for
nospecialize targets (#51036)" -->
- [x] #51222 <!-- Check again if the tty is open inside the IO lock -->
- [x] #51236 <!-- Add lock around uv_unref during init -->
- [x] #51243 <!-- GMP: Gracefully handle more overflows. -->
- [x] #51254 <!-- Ryu: make sure adding zeros does not overwrite
trailing dot -->
- [x] #51175 <!-- shorten stale_age for cachefile lock -->
- [x] #51300 <!-- fix method definition error for bad vararg -->
- [x] #51307 <!-- fix force-throw ctrl-C on Windows -->
- [x] #51303 <!-- ensure revising structs is safe -->
- [x] #51393 
- [x] #51403 

Need manual backport:
- [x] #51009 <!-- fix #50562, regression in `in` of tuple of Symbols -->
- [x] #51053 <!-- Bump Statistics stdlib -->
- [x] #51013 <!-- fix #50709, type bound error due to free typevar in
sparam env -->
- [x] #51305 <!-- reduce test time for rounding and floatfuncs -->

Contains multiple commits, manual intervention needed:
- [ ] #50663 <!-- Fix Expr(:loopinfo) codegen -->
- [ ] #51035 <!-- refactor GC scanning code to reflect jl_binding_t are
now first class -->
- [ ] #51092 <!-- inference: fix bad effects for recursion -->
- [x] #51247 <!-- Check if malloc has succeeded before incrementing gc
stats -->
- [x] #51294 <!-- use LibGit2_jll for LibGit2 library -->

Non-merged PRs with backport label:
- [ ] #51132 <!-- Handle `AbstractQ` in concatenation -->
- [x] #51029 <!-- testdefs: make sure that if a test set changes the
active project, they change it back when they're done -->
- [ ] #50919 <!-- Code loading: do the "skipping mtime check for stdlib"
check regardless of the value of `ispath(f)` -->
- [ ] #50824 <!-- Add some aliasing warnings to docstrings for mutating
functions -->
- [x] #50385 <!-- Precompile pidlocks: add to NEWS and docs -->
- [ ] #49805 <!-- Limit TimeType subtraction to AbstractDateTime -->
@KristofferC KristofferC removed the backport 1.10 Change should be backported to the 1.10 release label Oct 2, 2023
nalimilan pushed a commit that referenced this pull request Nov 5, 2023
Co-authored-by: Mason Protter <[email protected]>
(cherry picked from commit 02f80c6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 1.9 Change should be backported to release-1.9 docs This change adds or pertains to documentation multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants