RFC: Work-stealing scheduler #43366

tkf · 2021-12-08T09:24:21Z

This PR implements a work-stealing scheduler for Julia's Parallel Task Runtime. It is based on the lock-free ¹ work-stealing deque by Chase and Lev (2005). The scheduler can be specified upon startup using the environment variable $JULIA_THREAD_SCHEDULER. Starting Julia with JULIA_THREAD_SCHEDULER=workstealing julia will use the work-stealing scheduler. Otherwise, it uses the default multi-queue ².

A quick benchmark for computing fib(30) using tasks shows that this implementation can be up to 2x faster than the current scheduler, especially at relatively low thread counts. I think there are a few more optimization opportunities.

@vtjnash From our discussion in the threading BoF a couple of months ago, I think you were in favor of adding alternative schedulers to Julia? What do you think about adding an alternative scheduler like this?

A bit more context: I'm playing with the depth-first scheduler now and I need some well-described scheduler, like work-stealing, as a baseline. I then thought to package it up in a way useful for everyone.

Before merge:

Mention JULIA_THREAD_SCHEDULER in the documentation
Revert back to the original scheduler by default
Wait for @vtjnash's review

Due to the way "task locking" is handled ATM, I don't think the current enqueue and dequeue implementation is lock-free. I haven't found a better way to do this so far. ↩
For testing purposes, this branch currently defaults to work-stealing. This should be reverted before merging. ↩

vchuravy · 2021-12-08T15:37:22Z

src/partr.c

+    jl_task_t *task = jl_atomic_load_relaxed(
+        (_Atomic(jl_task_t **))&wsdeques[tid].tasks[t % tasks_per_heap]);
+    if (!atomic_compare_exchange_strong_explicit(
+            &wsdeques[tid].top, &t, t + 1, memory_order_seq_cst, memory_order_relaxed))


Doesn't this shrink the workqueue from the top? And thus everytime we steal we lose space ontop?

It uses the .tasks buffer as a circular array so it's OK that the .top is never decremented. (In fact, that's the main cleverness of the Chase-Lev deque.)

tkf · 2021-12-09T03:50:43Z

tester_linux64 for the second last commit b1faf76 failed with a long trace of rr output(?) (see below) but the test passed for the last commit which just merged the master branch. The merged commits do not seem to contain anything relevant: 8197c41...63f5b8a

Is it a known error? Or can it be related to this PR?

Worker 7 terminated.
UNHANDLED TASK ERROR: EOFError: read end of file
Stacktrace:
 [1] (::Base.var"#wait_locked#660")(s::TCPSocket, buf::IOBuffer, nb::Int64)
   @ Base ./stream.jl:941
 [2] unsafe_read(s::TCPSocket, p::Ptr{UInt8}, nb::UInt64)
   @ Base ./stream.jl:950
 [3] unsafe_read
   @ ./io.jl:751 [inlined]
 [4] unsafe_read(s::TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
   @ Base ./io.jl:750
 [5] read!
   @ ./io.jl:752 [inlined]
 [6] deserialize_hdr_raw
   @ /buildworker/worker/tester_linux64/build/share/julia/stdlib/v1.8/Distributed/src/messages.jl:167 [inlined]
 [7] message_handler_loop(r_stream::TCPSocket, w_stream::TCPSocket, incoming::Bool)
   @ Distributed /buildworker/worker/tester_linux64/build/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:165
 [8] process_tcp_streams(r_stream::TCPSocket, w_stream::TCPSocket, incoming::Bool)
   @ Distributed /buildworker/worker/tester_linux64/build/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:126
 [9] (::Distributed.var"#99#100"{TCPSocket, TCPSocket, Bool})()
   @ Distributed ./task.jl:466
LinearAlgebra/addmul               (7) |         failed at 2021-12-08T23:40:51.785
ProcessExitedException(7)
Stacktrace:
  [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
    @ Base ./task.jl:820
  [2] wait()
    @ Base ./task.jl:880
  [3] wait(c::Base.GenericCondition{ReentrantLock})
    @ Base ./condition.jl:124
  [4] take_buffered(c::Channel{Any})
    @ Base ./channels.jl:415
  [5] take!(c::Channel{Any})
    @ Base ./channels.jl:409
  [6] take!(::Distributed.RemoteValue)
    @ Distributed /buildworker/worker/tester_linux64/build/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:719
  [7] remotecall_fetch(::Function, ::Distributed.Worker, ::String, ::Vararg{String}; kwargs::Base.Pairs{Symbol, UInt128, Tuple{Symbol}, NamedTuple{(:seed,), Tuple{UInt128}}})
    @ Distributed /buildworker/worker/tester_linux64/build/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:462
  [8] remotecall_fetch(::Function, ::Int64, ::String, ::Vararg{String}; kwargs::Base.Pairs{Symbol, UInt128, Tuple{Symbol}, NamedTuple{(:seed,), Tuple{UInt128}}})
    @ Distributed /buildworker/worker/tester_linux64/build/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:493
  [9] macro expansion
    @ /buildworker/worker/tester_linux64/build/share/julia/test/runtests.jl:251 [inlined]
 [10] (::var"#43#55"{Vector{Task}, var"#print_testworker_errored#51"{ReentrantLock, Int64, Int64}, var"#print_testworker_stats#49"{ReentrantLock, Int64, Int64, Int64, Int64, Int64, Int64}, Vector{Any}, Dict{String, DateTime}})()
    @ Main ./task.jl:466
igaction for time 61128
[RecordSession] after cont: status=0x7057f (PTRACE_EVENT_SECCOMP)
[RecordSession] EXEC_START: status=0x7057f (PTRACE_EVENT_SECCOMP)
[Task] resuming execution of 30060 with PTRACE_SYSCALL tick_period

(... many similar lines ...)

[Task] Advancing tid 28542 to exit
[Task] Waiting for exit of 28542
[Task] Reaping 28542
[FATAL /workspace/srcdir/rr/src/Task.cc:140:reap() errno: ECHILD] Unexpected wait status for tid 28542
=== Start rr backtrace:
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr(_ZN2rr13dump_rr_stackEv+0x28)[0x5a1208]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr(_ZN2rr15notifying_abortEv+0x47)[0x5a5367]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr[0x5d2aa2]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr(_ZN2rr4Task4reapEv+0x110)[0x5e67e0]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr(_ZN2rr10RecordTaskD1Ev+0x4e8)[0x5b1ee8]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr(_ZN2rr10RecordTaskD0Ev+0x9)[0x5b1f29]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr[0x5dd8b0]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr(_ZN2rr19rec_process_syscallEPNS_10RecordTaskE+0x2a5)[0x53a745]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr(_ZN2rr13RecordSession21syscall_state_changedEPNS_10RecordTaskEPNS0_9StepStateE+0x917)[0x52de87]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr(_ZN2rr13RecordSession11record_stepEv+0x411)[0x530e31]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr(_ZN2rr13RecordCommand3runERSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0xd43)[0x514993]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr(main+0x2c1)[0x4c16c1]
/lib64/libc.so.6(__libc_start_main+0x100)[0x7fa84de61d20]
/home/buildworker/.julia/artifacts/0b4d25eb5d440987b9777d85f1200cd692c3fc45/bin/rr[0x4c1afe]
=== End rr backtrace
`rr` returned 0, packing and uploading traces...
rr: Trace file `/tmp/jl_oprE7Q/rr_traces/julia-0/incomplete' found.
rr recording terminated abnormally and the trace is incomplete.

from https://build.julialang.org/#/builders/70/builds/273

codecov · 2021-12-09T12:36:12Z

Codecov Report

Merging #43366 (b6bca19) into master (63f5b8a) will decrease coverage by 0.02%.
The diff coverage is n/a.

❗ Current head b6bca19 differs from pull request most recent head 0e50d01. Consider uploading reports for the commit 0e50d01 to get more accurate results

@@            Coverage Diff             @@
##           master   #43366      +/-   ##
==========================================
- Coverage   89.55%   89.52%   -0.03%     
==========================================
  Files         342      342              
  Lines       80387    80387              
==========================================
- Hits        71992    71970      -22     
- Misses       8395     8417      +22

Impacted Files	Coverage Δ
stdlib/LibGit2/src/error.jl	`81.81% <0.00%> (-9.10%)`	⬇️
stdlib/Logging/src/ConsoleLogger.jl	`90.10% <0.00%> (-3.30%)`	⬇️
base/secretbuffer.jl	`96.84% <0.00%> (-3.16%)`	⬇️
base/pcre.jl	`85.14% <0.00%> (-1.99%)`	⬇️
stdlib/Dates/src/io.jl	`94.88% <0.00%> (-1.87%)`	⬇️
stdlib/Artifacts/src/Artifacts.jl	`87.40% <0.00%> (-1.19%)`	⬇️
stdlib/LibGit2/src/callbacks.jl	`79.74% <0.00%> (-0.85%)`	⬇️
stdlib/LibGit2/src/types.jl	`90.76% <0.00%> (-0.81%)`	⬇️
stdlib/LibGit2/src/LibGit2.jl	`85.80% <0.00%> (-0.67%)`	⬇️
base/binaryplatforms.jl	`82.85% <0.00%> (-0.58%)`	⬇️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 63f5b8a...0e50d01. Read the comment docs.

This reverts commit f2b47d2.

tkf · 2021-12-10T01:47:26Z

It's good to go on my end.

tkf · 2021-12-10T03:46:34Z

I ran additional benchmarks to help decide if we want this PR. A short summary is that workstealing works better than multiqueue with a relatively low thread count (< 10). On the other hand, the current multiqueue-based scheduler can outperform workstealing with many cores in some cases. This relative advantage of multiqueue is a bit surprizing to me since we are essentially using it only as a concurrent bag ATM and no special properties of multiqueue are exploited. But I haven't had time to investigate this further yet.

That said, I think it is already worth adding the workstealing scheduler because:

For people using Julia in machines with relative small core counts (e.g., laptops), this workstealing scheduler may already be useful.
We can always switch a scheduler at a certain core count by default if we see the relative advantages of one scheduler consistently in various machines.
It helps tuning both schedulers if we have a reference point that can be easily compared against.

Benchmarks

fib(30)

This benchmark compares the run-time of pfib(30) which is taken from the test suite:

julia/test/threads_exec.jl

Lines 696 to 702 in b6bca19

    
           function pfib(n::Int) 
        
               if n <= 1 
        
                   return n 
        
               end 
        
               t = Threads.@spawn pfib(n-2) 
        
               return pfib(n-1) + fetch(t)::Int 
        
           end

The above plot shows the speedup; i.e., the run-time with the multiqueue scheduler relative to the run-time using workstealing ¹. The left plot is with JULIA_EXCLUSIVE = 0 and the right plot is with JULIA_EXCLUSIVE = 1. This shows the workstealing scheduler is more robust for fib(30) when the thread is not affinized. Unfortunately, with JULIA_EXCLUSIVE = 1 and many cores (roughly nthreads > 32), it seems multiqueue actually performs better.

producer-consumer

This is a benchmark with the following function

function producer_consumer(;
    niters = 1000,
    nreplicas = 10 * Threads.nthreads(),
)
    @sync begin
        for _ in 1:nreplicas
            ch = Channel{Int}()
            Threads.@spawn try
                for i in 1:niters
                    put!(ch, i)
                end
            finally
                close(ch)
            end
            Threads.@spawn foreach(identity, ch)
        end
    end
end

The result is similar to fib(30) but the multiqueue performs better even with JULIA_EXCLUSIVE = 0. Furthermore, workstealing looses the advantage at a lower thread count than fib(30).

This is rather puzzling since I thought the above program strongly favors workstealing. Since the channel ch is unbuffered, I'd expect the producer and consumer tasks to do a "ping-pong" using a single bottom slot of the deque. I haven't checked if the program is behaving the way I expected, though.

Machine information

Machine 1

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  2
Core(s) per socket:  32
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7502 32-Core Processor
Stepping:            0
CPU MHz:             1498.762
CPU max MHz:         2500.0000
CPU min MHz:         1500.0000
BogoMIPS:            5000.16
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-31,64-95
NUMA node1 CPU(s):   32-63,96-127

Machine 2

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              40
On-line CPU(s) list: 0-39
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
Stepping:            4
CPU MHz:             800.550
CPU max MHz:         3000.0000
CPU min MHz:         800.0000
BogoMIPS:            4400.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            14080K
NUMA node0 CPU(s):   0-9,20-29
NUMA node1 CPU(s):   10-19,30-39

This explanation was initially wrong. @jonas-schulze noticed it was the other way around https://github.com/JuliaLang/julia/pull/43366#issuecomment-990932813 ↩

tkf · 2021-12-10T10:42:18Z

~~Actually, I don't think the current implementation is correct wrt GC~~ (Edit: oops, I was commenting on my local version which did some tweaks that were GC-unsafe)

jonas-schulze · 2021-12-10T12:28:48Z

This shows the workstealing scheduler is more robust for fib(30) when the thread is not affinized. Unfortunately, with JULIA_EXCLUSIVE = 1 and many cores (roughly nthreads > 32), it seems multiqueue actually performs better.

Shouldn't it be the other way around? A ratio < 1 means the worksteeling scheduler takes less time to complete, i.e. worksteeling outperforms multiqueue. For the fib(30) benchmark this is only the case for JULIA_EXCLUSIVE = 1 and nthreads > 32, and for the producer-consumer benchmark for nthreads > 12.

tkf · 2021-12-10T21:46:53Z

Ah, thanks for noticing it. The equation in the figure label was the other way around. The numbers and the conclusions are correct. I fixed it.

tkf · 2021-12-11T05:42:51Z

With 531e479 that simplifies the task lock handing, workstealing now outperforms multiqueue more consistently

The only case where it looks like multiqueue is working better is fib(30) with JULIA_EXCLUSIVE = 1 (right) and nthreads > 32. But, if I use the median instead of the minimum for computing the run-time, I get this picture

It's noisier but it looks like that the trend is on speedup with workstealing. It may be reflecting that multiqueue is a more RNG-driven algorithm and it can get lucky sometimes.

This reverts commit f2b47d2.

tkf · 2021-12-13T23:28:50Z

As a more practical example, I ran some benchmarks of quicksort and mergesort implemented in ThreadsX for sorting rand(2^24).

You can see that workstealing improves the performance of quicksort up to 20%. There's no clear performance improvement in mergesort.

DilumAluthge · 2021-12-13T23:36:09Z

It could be interesting to see if Gaius.jl gets faster or slower with this new scheduler.

cc: @MasonProtter @chriselrod

tkf · 2021-12-13T23:41:10Z

Haha, I just commented it on https://julialang.zulipchat.com/#narrow/stream/137791-general/topic/Gaius.2Ejl's.20divide.20strategy/near/264786592 but I couldn't find any clear change in the performance of Gaius.mul!:

I think it makes sense as you don't need load-balancing for GEMM (which is what workstealing is good at). OTOH, algorithms like quicksort have the task DAG determined by the input data and so load-balancing is beneficial.

That said, it's conceivable that improving task scheduling performance can improve Gaius.mul! by reducing destructive interference of L3 cache utilization (by making it reasonable to use smaller tasks). To really verify this effect, I think we need to re-tune the base case size of Gaius.mul! for each scheduler.

vtjnash · 2022-01-26T20:48:01Z

src/partr.c

+            _Atomic(int64_t) top;
+            _Atomic(int64_t) bottom;
+        };
+        uint8_t padding[JL_CACHE_BYTE_ALIGNMENT];


this has no alignment whatsoever

vtjnash · 2022-01-26T20:54:46Z

src/partr.c

+    wsdeques = (wsdeque_t *)(((uintptr_t)calloc(1, sizeof(wsdeque_t) * jl_n_threads +
+                                                       JL_CACHE_BYTE_ALIGNMENT - 1) +
+                              JL_CACHE_BYTE_ALIGNMENT - 1) &
+                             (-JL_CACHE_BYTE_ALIGNMENT));


Suggested change

wsdeques = (wsdeque_t *)(((uintptr_t)calloc(1, sizeof(wsdeque_t) * jl_n_threads +

JL_CACHE_BYTE_ALIGNMENT - 1) +

JL_CACHE_BYTE_ALIGNMENT - 1) &

(-JL_CACHE_BYTE_ALIGNMENT));

wsdeques = (wsdeque_t *)malloc_cache_align(sizeof(wsdeque_t) * jl_n_threads);

vtjnash · 2022-01-26T21:17:31Z

src/partr.c

+{
+    int16_t tid = jl_threadid();
+    int64_t b = jl_atomic_load_relaxed(&wsdeques[tid].bottom);
+    int64_t t = jl_atomic_load_acquire(&wsdeques[tid].top);


What is being load-acquired? There are no stores in wsdeque_steal

Suggested change

int64_t t = jl_atomic_load_acquire(&wsdeques[tid].top);

int64_t t = jl_atomic_load(&wsdeques[tid].top); // ensures that `tasks` is successfully stolen before we try to reuse the slot below

I'm trying to digest this and I realized that I still don't get it. As we discussed, there is a CAS on .top in wsdeque_steal

julia/src/partr.c

Lines 316 to 318 in 0e50d01

jl_task_t *task = jl_atomic_load_relaxed(

(_Atomic(jl_task_t *) *)&wsdeques[tid].tasks[t % tasks_per_heap]);

if (!jl_atomic_cmpswap(&wsdeques[tid].top, &t, t + 1))

The loaded task only matters in the success path where we have a seq_cst write that supersets release write. So, we have:

a sequenced-before edge from the load .tasks[t % tasks_per_heap] to the store on the .top via CAS in the next line (as they both happens in the same thread)

a synchronizes-with edge from the store on the .top to the load of .top in wsdeque_push you quoted above

So they establish happens-before and it looks like we know that the task is loaded by the time we load the .top? IIRC, your concern was that loading .tasks[t % tasks_per_heap] and storing .top can be reordered, but doesn't release forbids that?

A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store.
https://en.cppreference.com/w/cpp/atomic/memory_order

I am not sure. But edges are not transitive, unless all of them are seq-cst, IIUC.

But yes, looks like we have a proper release/acquire pair here on top to ensure the ops on tasks are okay.

Hmm.... Lahav et al.'s definition of happens-before hb is the transitive closure of the union of sequenced-before sb and synchronizes-with sw

R^+ is the transitive closure of R:

But I don't know if that's a mismatch between their definition and the C11 semantics (though they use the same notions for discussing C11 so probably not) and/or some difference to the actual definition in the standard.

vtjnash · 2022-01-26T21:36:06Z

src/partr.c

+        (_Atomic(jl_task_t *) *)&wsdeques[tid].tasks[t % tasks_per_heap]);
+    if (!jl_atomic_cmpswap(&wsdeques[tid].top, &t, t + 1))
+        return NULL;
+    return task;


Suggested change

return task;

if (jl_set_task_tid(task))

return task;

wsdequeue_push(task, 1); // FIXME: the sticky queue would be a better place for this, but we need to handle that inside Julia instead of handling these conditions here

return NULL;

vtjnash · 2022-01-26T21:40:44Z

src/partr.c

+    if (jl_atomic_load_acquire(&task->tid) != -1)
+        // If the `task` still hasn't finished the context switch at this point, abort push
+        // and put it in the sticky queue.
+        return -1;


Suggested change

if (jl_atomic_load_acquire(&task->tid) != -1)

// If the `task` still hasn't finished the context switch at this point, abort push

// and put it in the sticky queue.

return -1;

The task->tid is fairly likely to be our own tid here. I think we can handle this when we pop the value (putting it into the sticky queue when we pop the value, rather than when we pushed it)

vtjnash · 2022-01-26T21:46:52Z

src/partr.c

+        // If the `task` still hasn't finished the context switch at this point, abort push
+        // and put it in the sticky queue.
+        return -1;
+    jl_fence_release();


The only store was the relaxed to tasks, so maybe just make that a release store?

vtjnash · 2022-01-26T21:49:46Z

src/partr.c

+        (_Atomic(jl_task_t *) *)&wsdeques[tid].tasks[b % tasks_per_heap]);
+    if (size > 0)
+        return task;
+    if (!jl_atomic_cmpswap(&wsdeques[tid].top, &t, t + 1))


this might need a fence, since it is trying to order the relaxed load above with the relaxed store below it?

vtjnash · 2022-01-26T21:51:33Z

src/partr.c

+        return NULL;
+    jl_task_t *task = jl_atomic_load_relaxed(
+        (_Atomic(jl_task_t *) *)&wsdeques[tid].tasks[t % tasks_per_heap]);
+    if (!jl_atomic_cmpswap(&wsdeques[tid].top, &t, t + 1))


possibly also need an explicit fence? I am not certain if seq-cst cmpswp is sufficient to enforce an order on relaxed ops nearby.

I am surprised this doesn't try to steal size*fraction items, instead of a constant (1), but that does not need to be changed for this PR.

There's Stealing Multi-Queue that does something similar to what you said https://arxiv.org/abs/2109.00657

But stealing like this works quite well for fork-join use cases in which you typically only enqueue/"materialize" O(log n) tasks in the queue while the task itself can "reveal" O(n) child tasks upon execution (n = e.g., length of array for a parallel map).

vtjnash · 2022-03-15T20:51:01Z

bump?

vchuravy · 2023-06-20T00:08:20Z

Since Taka is no longer active, I rejuvinated this work in #50221

tkf added 2 commits December 8, 2021 02:52

Implement Work-Stealing scheduler

bcaf64e

DEBUG: Default to workstealing for testing

f2b47d2

tkf added multithreading Base.Threads and related functionality needs compat annotation Add !!! compat "Julia x.y" to the docstring needs docs Documentation for this change is required needs news A NEWS entry is required for this change labels Dec 8, 2021

vchuravy reviewed Dec 8, 2021

View reviewed changes

tkf added 6 commits December 8, 2021 15:47

Pad and align wsdeques

b702ba9

Use jl_atomic_cmpswap instead of C11 API

a730ab1

Fix casts to _Atomic

dd7d5c2

Fix wsdeque_gc_mark_enqueued_tasks

36a8436

Fix another cast to _Atomic

b1faf76

Merge branch 'master' into workstealing

46173e6

tkf added 2 commits December 9, 2021 17:45

Mention JULIA_THREAD_SCHEDULER in documentation

f94ca11

Mention JULIA_THREAD_SCHEDULER in NEWS

ecec557

tkf removed needs docs Documentation for this change is required needs news A NEWS entry is required for this change needs compat annotation Add !!! compat "Julia x.y" to the docstring labels Dec 9, 2021

Revert "DEBUG: Default to workstealing for testing"

a4a20cb

This reverts commit f2b47d2.

tkf requested a review from vtjnash December 10, 2021 01:46

DEBUG: Default to workstealing for testing

e862411

tkf marked this pull request as draft December 10, 2021 10:42

tkf marked this pull request as ready for review December 11, 2021 01:04

Put "straggler" task in sticky queue

531e479

Revert "DEBUG: Default to workstealing for testing"

0e50d01

This reverts commit f2b47d2.

JeffBezanson added the performance Must go faster label Dec 15, 2021

vtjnash reviewed Jan 26, 2022

View reviewed changes

d-netto mentioned this pull request Mar 16, 2022

GC/Parallel marking #44643

Closed

3 tasks

vtjnash mentioned this pull request Mar 23, 2022

port partr multiq to julia #44653

Merged

kpamnany mentioned this pull request Apr 26, 2022

Add threadpool support to runtime #42302

Merged

vchuravy mentioned this pull request Jun 20, 2023

Experiment with WorkStealing #50221

Draft

vchuravy closed this Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Work-stealing scheduler #43366

RFC: Work-stealing scheduler #43366

tkf commented Dec 8, 2021 •

edited

Loading

vchuravy Dec 8, 2021

tkf Dec 8, 2021

tkf commented Dec 9, 2021 •

edited

Loading

codecov bot commented Dec 9, 2021 •

edited

Loading

tkf commented Dec 10, 2021

tkf commented Dec 10, 2021 •

edited

Loading

tkf commented Dec 10, 2021 •

edited

Loading

jonas-schulze commented Dec 10, 2021

tkf commented Dec 10, 2021

tkf commented Dec 11, 2021 •

edited

Loading

tkf commented Dec 13, 2021 •

edited

Loading

DilumAluthge commented Dec 13, 2021 •

edited

Loading

tkf commented Dec 13, 2021 •

edited

Loading

vtjnash Jan 26, 2022

vtjnash Jan 26, 2022

vtjnash Jan 26, 2022

tkf Jan 27, 2022

vtjnash Jan 27, 2022

vtjnash Jan 27, 2022

tkf Jan 27, 2022 •

edited

Loading

vtjnash Jan 26, 2022

vtjnash Jan 26, 2022

vtjnash Jan 26, 2022

vtjnash Jan 26, 2022

vtjnash Jan 26, 2022

vtjnash Jan 26, 2022

tkf Jan 27, 2022

vtjnash commented Mar 15, 2022

vchuravy commented Jun 20, 2023

	int64_t t = jl_atomic_load_acquire(&wsdeques[tid].top);
	int64_t t = jl_atomic_load(&wsdeques[tid].top); // ensures that `tasks` is successfully stolen before we try to reuse the slot below

	jl_task_t *task = jl_atomic_load_relaxed(
	(_Atomic(jl_task_t ) )&wsdeques[tid].tasks[t % tasks_per_heap]);
	if (!jl_atomic_cmpswap(&wsdeques[tid].top, &t, t + 1))

RFC: Work-stealing scheduler #43366

RFC: Work-stealing scheduler #43366

Conversation

tkf commented Dec 8, 2021 • edited Loading

Footnotes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkf commented Dec 9, 2021 • edited Loading

codecov bot commented Dec 9, 2021 • edited Loading

Codecov Report

tkf commented Dec 10, 2021

tkf commented Dec 10, 2021 • edited Loading

Benchmarks

fib(30)

producer-consumer

Machine information

Footnotes

tkf commented Dec 10, 2021 • edited Loading

jonas-schulze commented Dec 10, 2021

tkf commented Dec 10, 2021

tkf commented Dec 11, 2021 • edited Loading

tkf commented Dec 13, 2021 • edited Loading

DilumAluthge commented Dec 13, 2021 • edited Loading

tkf commented Dec 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkf Jan 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vtjnash commented Mar 15, 2022

vchuravy commented Jun 20, 2023

tkf commented Dec 8, 2021 •

edited

Loading

tkf commented Dec 9, 2021 •

edited

Loading

codecov bot commented Dec 9, 2021 •

edited

Loading

tkf commented Dec 10, 2021 •

edited

Loading

tkf commented Dec 10, 2021 •

edited

Loading

tkf commented Dec 11, 2021 •

edited

Loading

tkf commented Dec 13, 2021 •

edited

Loading

DilumAluthge commented Dec 13, 2021 •

edited

Loading

tkf commented Dec 13, 2021 •

edited

Loading

tkf Jan 27, 2022 •

edited

Loading