Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Work-stealing scheduler #43366

Closed
wants to merge 14 commits into from
Closed

RFC: Work-stealing scheduler #43366

wants to merge 14 commits into from

Conversation

tkf
Copy link
Member

@tkf tkf commented Dec 8, 2021

This PR implements a work-stealing scheduler for Julia's Parallel Task Runtime. It is based on the lock-free 1 work-stealing deque by Chase and Lev (2005). The scheduler can be specified upon startup using the environment variable $JULIA_THREAD_SCHEDULER. Starting Julia with JULIA_THREAD_SCHEDULER=workstealing julia will use the work-stealing scheduler. Otherwise, it uses the default multi-queue 2.

A quick benchmark for computing fib(30) using tasks shows that this implementation can be up to 2x faster than the current scheduler, especially at relatively low thread counts. I think there are a few more optimization opportunities.

image

@vtjnash From our discussion in the threading BoF a couple of months ago, I think you were in favor of adding alternative schedulers to Julia? What do you think about adding an alternative scheduler like this?

A bit more context: I'm playing with the depth-first scheduler now and I need some well-described scheduler, like work-stealing, as a baseline. I then thought to package it up in a way useful for everyone.

Before merge:

  • Mention JULIA_THREAD_SCHEDULER in the documentation
  • Revert back to the original scheduler by default
  • Wait for @vtjnash's review

Footnotes

  1. Due to the way "task locking" is handled ATM, I don't think the current enqueue and dequeue implementation is lock-free. I haven't found a better way to do this so far.

  2. For testing purposes, this branch currently defaults to work-stealing. This should be reverted before merging.

@tkf tkf added multithreading Base.Threads and related functionality needs compat annotation Add !!! compat "Julia x.y" to the docstring needs docs Documentation for this change is required needs news A NEWS entry is required for this change labels Dec 8, 2021
src/partr.c Outdated
jl_task_t *task = jl_atomic_load_relaxed(
(_Atomic(jl_task_t **))&wsdeques[tid].tasks[t % tasks_per_heap]);
if (!atomic_compare_exchange_strong_explicit(
&wsdeques[tid].top, &t, t + 1, memory_order_seq_cst, memory_order_relaxed))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this shrink the workqueue from the top? And thus everytime we steal we lose space ontop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It uses the .tasks buffer as a circular array so it's OK that the .top is never decremented. (In fact, that's the main cleverness of the Chase-Lev deque.)

@tkf
Copy link
Member Author

tkf commented Dec 9, 2021

tester_linux64 for the second last commit b1faf76 failed with a long trace of rr output(?) (see below) but the test passed for the last commit which just merged the master branch. The merged commits do not seem to contain anything relevant: 8197c41...63f5b8a

Is it a known error? Or can it be related to this PR?


Worker 7 terminated.
UNHANDLED TASK ERROR: EOFError: read end of file
Stacktrace:
 [1] (::Base.var"#wait_locked#660")(s::TCPSocket, buf::IOBuffer, nb::Int64)
   @ Base ./stream.jl:941
 [2] unsafe_read(s::TCPSocket, p::Ptr{UInt8}, nb::UInt64)
   @ Base ./stream.jl:950
 [3] unsafe_read
   @ ./io.jl:751 [inlined]
 [4] unsafe_read(s::TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
   @ Base ./io.jl:750
 [5] read!
   @ ./io.jl:752 [inlined]
 [6] deserialize_hdr_raw
   @ /buildworker/worker/tester_linux64/build/share/julia/stdlib/v1.8/Distributed/src/messages.jl:167 [inlined]
 [7] message_handler_loop(r_stream::TCPSocket, w_stream::TCPSocket, incoming::Bool)
   @ Distributed /buildworker/worker/tester_linux64/build/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:165
 [8] process_tcp_streams(r_stream::TCPSocket, w_stream::TCPSocket, incoming::Bool)
   @ Distributed /buildworker/worker/tester_linux64/build/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:126
 [9] (::Distributed.var"#99#100"{TCPSocket, TCPSocket, Bool})()
   @ Distributed ./task.jl:466
LinearAlgebra/addmul               (7) |         failed at 2021-12-08T23:40:51.785
ProcessExitedException(7)
Stacktrace:
  [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
    @ Base ./task.jl:820
  [2] wait()
    @ Base ./task.jl:880
  [3] wait(c::Base.GenericCondition{ReentrantLock})
    @ Base ./condition.jl:124
  [4] take_buffered(c::Channel{Any})
    @ Base ./channels.jl:415
  [5] take!(c::Channel{Any})
    @ Base ./channels.jl:409
  [6] take!(::Distributed.RemoteValue)
    @ Distributed /buildworker/worker/tester_linux64/build/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:719
  [7] remotecall_fetch(::Function, ::Distributed.Worker, ::String, ::Vararg{String}; kwargs::Base.Pairs{Symbol, UInt128, Tuple{Symbol}, NamedTuple{(:seed,), Tuple{UInt128}}})
    @ Distributed /buildworker/worker/tester_linux64/build/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:462
  [8] remotecall_fetch(::Function, ::Int64, ::String, ::Vararg{String}; kwargs::Base.Pairs{Symbol, UInt128, Tuple{Symbol}, NamedTuple{(:seed,), Tuple{UInt128}}})
    @ Distributed /buildworker/worker/tester_linux64/build/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:493
  [9] macro expansion
    @ /buildworker/worker/tester_linux64/build/share/julia/test/runtests.jl:251 [inlined]
 [10] (::var"#43#55"{Vector{Task}, var"#print_testworker_errored#51"{ReentrantLock, Int64, Int64}, var"#print_testworker_stats#49"{ReentrantLock, Int64, Int64, Int64, Int64, Int64, Int64}, Vector{Any}, Dict{String, DateTime}})()
    @ Main ./task.jl:466
igaction for time 61128
[RecordSession] after cont: status=0x7057f (PTRACE_EVENT_SECCOMP)
[RecordSession] EXEC_START: status=0x7057f (PTRACE_EVENT_SECCOMP)
[Task] resuming execution of 30060 with PTRACE_SYSCALL tick_period

(... many similar lines ...)

[Task] Advancing tid 28542 to exit
[Task] Waiting for exit of 28542
[Task] Reaping 28542
[FATAL /workspace/srcdir/rr/src/Task.cc:140:reap() errno: ECHILD] Unexpected wait status for tid 28542
=== Start rr backtrace:
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr(_ZN2rr13dump_rr_stackEv+0x28)[0x5a1208]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr(_ZN2rr15notifying_abortEv+0x47)[0x5a5367]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr[0x5d2aa2]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr(_ZN2rr4Task4reapEv+0x110)[0x5e67e0]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr(_ZN2rr10RecordTaskD1Ev+0x4e8)[0x5b1ee8]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr(_ZN2rr10RecordTaskD0Ev+0x9)[0x5b1f29]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr[0x5dd8b0]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr(_ZN2rr19rec_process_syscallEPNS_10RecordTaskE+0x2a5)[0x53a745]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr(_ZN2rr13RecordSession21syscall_state_changedEPNS_10RecordTaskEPNS0_9StepStateE+0x917)[0x52de87]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr(_ZN2rr13RecordSession11record_stepEv+0x411)[0x530e31]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr(_ZN2rr13RecordCommand3runERSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0xd43)[0x514993]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr(main+0x2c1)[0x4c16c1]
/lib64/libc.so.6(__libc_start_main+0x100)[0x7fa84de61d20]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr[0x4c1afe]
=== End rr backtrace
`rr` returned 0, packing and uploading traces...
rr: Trace file `/tmp/jl_oprE7Q/rr_traces/julia-0/incomplete' found.
rr recording terminated abnormally and the trace is incomplete.

from https://build.julialang.org/#/builders/70/builds/273

@codecov
Copy link

codecov bot commented Dec 9, 2021

Codecov Report

Merging #43366 (b6bca19) into master (63f5b8a) will decrease coverage by 0.02%.
The diff coverage is n/a.

❗ Current head b6bca19 differs from pull request most recent head 0e50d01. Consider uploading reports for the commit 0e50d01 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master   #43366      +/-   ##
==========================================
- Coverage   89.55%   89.52%   -0.03%     
==========================================
  Files         342      342              
  Lines       80387    80387              
==========================================
- Hits        71992    71970      -22     
- Misses       8395     8417      +22     
Impacted Files Coverage Δ
stdlib/LibGit2/src/error.jl 81.81% <0.00%> (-9.10%) ⬇️
stdlib/Logging/src/ConsoleLogger.jl 90.10% <0.00%> (-3.30%) ⬇️
base/secretbuffer.jl 96.84% <0.00%> (-3.16%) ⬇️
base/pcre.jl 85.14% <0.00%> (-1.99%) ⬇️
stdlib/Dates/src/io.jl 94.88% <0.00%> (-1.87%) ⬇️
stdlib/Artifacts/src/Artifacts.jl 87.40% <0.00%> (-1.19%) ⬇️
stdlib/LibGit2/src/callbacks.jl 79.74% <0.00%> (-0.85%) ⬇️
stdlib/LibGit2/src/types.jl 90.76% <0.00%> (-0.81%) ⬇️
stdlib/LibGit2/src/LibGit2.jl 85.80% <0.00%> (-0.67%) ⬇️
base/binaryplatforms.jl 82.85% <0.00%> (-0.58%) ⬇️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 63f5b8a...0e50d01. Read the comment docs.

@tkf tkf removed needs docs Documentation for this change is required needs news A NEWS entry is required for this change needs compat annotation Add !!! compat "Julia x.y" to the docstring labels Dec 9, 2021
@tkf tkf requested a review from vtjnash December 10, 2021 01:46
@tkf
Copy link
Member Author

tkf commented Dec 10, 2021

It's good to go on my end.

@tkf
Copy link
Member Author

tkf commented Dec 10, 2021

I ran additional benchmarks to help decide if we want this PR. A short summary is that workstealing works better than multiqueue with a relatively low thread count (< 10). On the other hand, the current multiqueue-based scheduler can outperform workstealing with many cores in some cases. This relative advantage of multiqueue is a bit surprizing to me since we are essentially using it only as a concurrent bag ATM and no special properties of multiqueue are exploited. But I haven't had time to investigate this further yet.

That said, I think it is already worth adding the workstealing scheduler because:

  1. For people using Julia in machines with relative small core counts (e.g., laptops), this workstealing scheduler may already be useful.
  2. We can always switch a scheduler at a certain core count by default if we see the relative advantages of one scheduler consistently in various machines.
  3. It helps tuning both schedulers if we have a reference point that can be easily compared against.

Benchmarks

fib(30)

This benchmark compares the run-time of pfib(30) which is taken from the test suite:

julia/test/threads_exec.jl

Lines 696 to 702 in b6bca19

function pfib(n::Int)
if n <= 1
return n
end
t = Threads.@spawn pfib(n-2)
return pfib(n-1) + fetch(t)::Int
end

image

The above plot shows the speedup; i.e., the run-time with the multiqueue scheduler relative to the run-time using workstealing 1. The left plot is with JULIA_EXCLUSIVE = 0 and the right plot is with JULIA_EXCLUSIVE = 1. This shows the workstealing scheduler is more robust for fib(30) when the thread is not affinized. Unfortunately, with JULIA_EXCLUSIVE = 1 and many cores (roughly nthreads > 32), it seems multiqueue actually performs better.

producer-consumer

This is a benchmark with the following function

function producer_consumer(;
    niters = 1000,
    nreplicas = 10 * Threads.nthreads(),
)
    @sync begin
        for _ in 1:nreplicas
            ch = Channel{Int}()
            Threads.@spawn try
                for i in 1:niters
                    put!(ch, i)
                end
            finally
                close(ch)
            end
            Threads.@spawn foreach(identity, ch)
        end
    end
end

image

The result is similar to fib(30) but the multiqueue performs better even with JULIA_EXCLUSIVE = 0. Furthermore, workstealing looses the advantage at a lower thread count than fib(30).

This is rather puzzling since I thought the above program strongly favors workstealing. Since the channel ch is unbuffered, I'd expect the producer and consumer tasks to do a "ping-pong" using a single bottom slot of the deque. I haven't checked if the program is behaving the way I expected, though.

Machine information

Machine 1
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  2
Core(s) per socket:  32
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7502 32-Core Processor
Stepping:            0
CPU MHz:             1498.762
CPU max MHz:         2500.0000
CPU min MHz:         1500.0000
BogoMIPS:            5000.16
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-31,64-95
NUMA node1 CPU(s):   32-63,96-127
Machine 2
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              40
On-line CPU(s) list: 0-39
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
Stepping:            4
CPU MHz:             800.550
CPU max MHz:         3000.0000
CPU min MHz:         800.0000
BogoMIPS:            4400.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            14080K
NUMA node0 CPU(s):   0-9,20-29
NUMA node1 CPU(s):   10-19,30-39

Footnotes

  1. This explanation was initially wrong. @jonas-schulze noticed it was the other way around https://github.com/JuliaLang/julia/pull/43366#issuecomment-990932813

@tkf
Copy link
Member Author

tkf commented Dec 10, 2021

Actually, I don't think the current implementation is correct wrt GC (Edit: oops, I was commenting on my local version which did some tweaks that were GC-unsafe)

@tkf tkf marked this pull request as draft December 10, 2021 10:42
@jonas-schulze
Copy link
Contributor

This shows the workstealing scheduler is more robust for fib(30) when the thread is not affinized. Unfortunately, with JULIA_EXCLUSIVE = 1 and many cores (roughly nthreads > 32), it seems multiqueue actually performs better.

Shouldn't it be the other way around? A ratio < 1 means the worksteeling scheduler takes less time to complete, i.e. worksteeling outperforms multiqueue. For the fib(30) benchmark this is only the case for JULIA_EXCLUSIVE = 1 and nthreads > 32, and for the producer-consumer benchmark for nthreads > 12.

@tkf
Copy link
Member Author

tkf commented Dec 10, 2021

Ah, thanks for noticing it. The equation in the figure label was the other way around. The numbers and the conclusions are correct. I fixed it.

@tkf tkf marked this pull request as ready for review December 11, 2021 01:04
@tkf
Copy link
Member Author

tkf commented Dec 11, 2021

With 531e479 that simplifies the task lock handing, workstealing now outperforms multiqueue more consistently

image

image

The only case where it looks like multiqueue is working better is fib(30) with JULIA_EXCLUSIVE = 1 (right) and nthreads > 32. But, if I use the median instead of the minimum for computing the run-time, I get this picture

image

It's noisier but it looks like that the trend is on speedup with workstealing. It may be reflecting that multiqueue is a more RNG-driven algorithm and it can get lucky sometimes.

@tkf
Copy link
Member Author

tkf commented Dec 13, 2021

As a more practical example, I ran some benchmarks of quicksort and mergesort implemented in ThreadsX for sorting rand(2^24).

image

You can see that workstealing improves the performance of quicksort up to 20%. There's no clear performance improvement in mergesort.

@DilumAluthge
Copy link
Member

DilumAluthge commented Dec 13, 2021

It could be interesting to see if Gaius.jl gets faster or slower with this new scheduler.

cc: @MasonProtter @chriselrod

@tkf
Copy link
Member Author

tkf commented Dec 13, 2021

Haha, I just commented it on https://julialang.zulipchat.com/#narrow/stream/137791-general/topic/Gaius.2Ejl's.20divide.20strategy/near/264786592 but I couldn't find any clear change in the performance of Gaius.mul!:

image

image

I think it makes sense as you don't need load-balancing for GEMM (which is what workstealing is good at). OTOH, algorithms like quicksort have the task DAG determined by the input data and so load-balancing is beneficial.

That said, it's conceivable that improving task scheduling performance can improve Gaius.mul! by reducing destructive interference of L3 cache utilization (by making it reasonable to use smaller tasks). To really verify this effect, I think we need to re-tune the base case size of Gaius.mul! for each scheduler.

@JeffBezanson JeffBezanson added the performance Must go faster label Dec 15, 2021
_Atomic(int64_t) top;
_Atomic(int64_t) bottom;
};
uint8_t padding[JL_CACHE_BYTE_ALIGNMENT];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has no alignment whatsoever

Comment on lines +372 to +375
wsdeques = (wsdeque_t *)(((uintptr_t)calloc(1, sizeof(wsdeque_t) * jl_n_threads +
JL_CACHE_BYTE_ALIGNMENT - 1) +
JL_CACHE_BYTE_ALIGNMENT - 1) &
(-JL_CACHE_BYTE_ALIGNMENT));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
wsdeques = (wsdeque_t *)(((uintptr_t)calloc(1, sizeof(wsdeque_t) * jl_n_threads +
JL_CACHE_BYTE_ALIGNMENT - 1) +
JL_CACHE_BYTE_ALIGNMENT - 1) &
(-JL_CACHE_BYTE_ALIGNMENT));
wsdeques = (wsdeque_t *)malloc_cache_align(sizeof(wsdeque_t) * jl_n_threads);

{
int16_t tid = jl_threadid();
int64_t b = jl_atomic_load_relaxed(&wsdeques[tid].bottom);
int64_t t = jl_atomic_load_acquire(&wsdeques[tid].top);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is being load-acquired? There are no stores in wsdeque_steal

Suggested change
int64_t t = jl_atomic_load_acquire(&wsdeques[tid].top);
int64_t t = jl_atomic_load(&wsdeques[tid].top); // ensures that `tasks` is successfully stolen before we try to reuse the slot below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to digest this and I realized that I still don't get it. As we discussed, there is a CAS on .top in wsdeque_steal

julia/src/partr.c

Lines 316 to 318 in 0e50d01

jl_task_t *task = jl_atomic_load_relaxed(
(_Atomic(jl_task_t *) *)&wsdeques[tid].tasks[t % tasks_per_heap]);
if (!jl_atomic_cmpswap(&wsdeques[tid].top, &t, t + 1))

The loaded task only matters in the success path where we have a seq_cst write that supersets release write. So, we have:

  • a sequenced-before edge from the load .tasks[t % tasks_per_heap] to the store on the .top via CAS in the next line (as they both happens in the same thread)
  • a synchronizes-with edge from the store on the .top to the load of .top in wsdeque_push you quoted above

So they establish happens-before and it looks like we know that the task is loaded by the time we load the .top? IIRC, your concern was that loading .tasks[t % tasks_per_heap] and storing .top can be reordered, but doesn't release forbids that?

A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store.
https://en.cppreference.com/w/cpp/atomic/memory_order

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. But edges are not transitive, unless all of them are seq-cst, IIUC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yes, looks like we have a proper release/acquire pair here on top to ensure the ops on tasks are okay.

Copy link
Member Author

@tkf tkf Jan 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.... Lahav et al.'s definition of happens-before hb is the transitive closure of the union of sequenced-before sb and synchronizes-with sw

image

R^+ is the transitive closure of R:

image

But I don't know if that's a mismatch between their definition and the C11 semantics (though they use the same notions for discussing C11 so probably not) and/or some difference to the actual definition in the standard.

(_Atomic(jl_task_t *) *)&wsdeques[tid].tasks[t % tasks_per_heap]);
if (!jl_atomic_cmpswap(&wsdeques[tid].top, &t, t + 1))
return NULL;
return task;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return task;
if (jl_set_task_tid(task))
return task;
wsdequeue_push(task, 1); // FIXME: the sticky queue would be a better place for this, but we need to handle that inside Julia instead of handling these conditions here
return NULL;

Comment on lines +275 to +278
if (jl_atomic_load_acquire(&task->tid) != -1)
// If the `task` still hasn't finished the context switch at this point, abort push
// and put it in the sticky queue.
return -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (jl_atomic_load_acquire(&task->tid) != -1)
// If the `task` still hasn't finished the context switch at this point, abort push
// and put it in the sticky queue.
return -1;

The task->tid is fairly likely to be our own tid here. I think we can handle this when we pop the value (putting it into the sticky queue when we pop the value, rather than when we pushed it)

// If the `task` still hasn't finished the context switch at this point, abort push
// and put it in the sticky queue.
return -1;
jl_fence_release();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only store was the relaxed to tasks, so maybe just make that a release store?

(_Atomic(jl_task_t *) *)&wsdeques[tid].tasks[b % tasks_per_heap]);
if (size > 0)
return task;
if (!jl_atomic_cmpswap(&wsdeques[tid].top, &t, t + 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might need a fence, since it is trying to order the relaxed load above with the relaxed store below it?

return NULL;
jl_task_t *task = jl_atomic_load_relaxed(
(_Atomic(jl_task_t *) *)&wsdeques[tid].tasks[t % tasks_per_heap]);
if (!jl_atomic_cmpswap(&wsdeques[tid].top, &t, t + 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly also need an explicit fence? I am not certain if seq-cst cmpswp is sufficient to enforce an order on relaxed ops nearby.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised this doesn't try to steal size*fraction items, instead of a constant (1), but that does not need to be changed for this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's Stealing Multi-Queue that does something similar to what you said https://arxiv.org/abs/2109.00657

But stealing like this works quite well for fork-join use cases in which you typically only enqueue/"materialize" O(log n) tasks in the queue while the task itself can "reveal" O(n) child tasks upon execution (n = e.g., length of array for a parallel map).

@vtjnash
Copy link
Member

vtjnash commented Mar 15, 2022

bump?

@vchuravy
Copy link
Member

Since Taka is no longer active, I rejuvinated this work in #50221

@vchuravy vchuravy closed this Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants