Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add threadpool support to runtime #42302

Merged
merged 2 commits into from
Apr 26, 2022
Merged

Conversation

jpsamaroo
Copy link
Member

@jpsamaroo jpsamaroo commented Sep 18, 2021

NOTE: New PR description is below at #42302 (comment)

@tkf
Copy link
Member

tkf commented Sep 18, 2021

If we go this way, I think we need to remove Threads.resize_nthreads! (added as a todo). Luckily, it looks like it's only used in LinearAlgebra now (it's great that RNG is now task local).

@jpsamaroo
Copy link
Member Author

nthreads() never changes size for a given threadpool, so it's probably fine as-is?

@jpsamaroo
Copy link
Member Author

jpsamaroo commented Sep 18, 2021

What's up with these Floating point exceptions on BK? I'm not getting them locally. Whoops, I didn't initialize the barrier large enough to include the main thread.

@tkf
Copy link
Member

tkf commented Sep 18, 2021

LinearAlgebra.__init__ allocates nthreads buffers

function __init__()
try

...

Threads.resize_nthreads!(Abuf)
Threads.resize_nthreads!(Bbuf)
Threads.resize_nthreads!(Cbuf)

so, I think the PR as-is breaks * called on non-primary threadpool (provided that __init__ is called on the primary threadpool).

This is also the "intended" usage pattern of resize_nthreads!

julia/base/threads.jl

Lines 16 to 23 in e50603f

resize_nthreads!(A, copyvalue=A[1])
Resize the array `A` to length [`nthreads()`](@ref). Any new
elements that are allocated are initialized to `deepcopy(copyvalue)`,
where `copyvalue` defaults to `A[1]`.
This is typically used to allocate per-thread variables, and
should be called in `__init__` if `A` is a global constant.

That said, the use of Abuf etc.is questionable anyway

# FIXME: This code is completely invalid!!!
Atile = unsafe_wrap(Array, convert(Ptr{T}, pointer(Abuf[Threads.threadid()])), sz)
Btile = unsafe_wrap(Array, convert(Ptr{S}, pointer(Bbuf[Threads.threadid()])), sz)

Maybe it's better to just get rid of the buffer pooling until we have a proper solution (e.g., "singleton" initialization API using the double checked locking pattern).

@tkf tkf added the multithreading Base.Threads and related functionality label Sep 18, 2021
Comment on lines 39 to 41
function spawn_threadpool(n::Int)
wq = [Base.StickyWorkqueue() for _ in 1:n]
push!(Base.AllWorkqueues, wq)
Copy link
Member

@tkf tkf Sep 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This push! is racy since it could resize the internal buffer of AllWorkqueues. We'd need a lock on AllWorkqueues or a proper concurrent collection or just a CAS loop on a ref.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, other things in jl_start_threads_ (called from here) like jl_threadpools++ are also racy. I guess we also need to update jl_all_tls_states here which is tricky (since GC can start anytime).

Actually, since the nthreads for this new pool needs to match in the Julia and C side, protecting AllWorkqueues is not enough. Maybe it's easier to just use a lock here and making sure to acquire it before every time jl_start_threads_dedicated is called. But synchronization of jl_all_tls_states still needs something else...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we can do on the runtime side is to only (atomically) increment jl_threadpools once all the runtime and Julia pieces are fully setup, and have all arrays be fully sized (to the max number of threads/threadpools we'll allow, which can be a build-time parameter) at init time. I also agree that jl_start_threads_dedicated can take a lock to ensure serialization.

Copy link
Member

@tkf tkf Sep 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fun(?) part is jl_all_tls_states. I'd imagine something like this (in C)

new = copy(jl_all_tls_states)
new_ptls = [create_ptls() for _ in 1:n]
append!(new, new_ptls)

# No need for CAS since `jl_start_threads_dedicated` (or its caller)
# acquires the lock
@atomic jl_all_tls_states = new

# *Then* make the new ptls available atomically by setting `jl_n_threads`
@atomic jl_n_threads = length(new)

# Here, we still can't free `old` because there can be
# `jl_gc_wait_for_the_world` etc.

GC.gc(false)
# Once we run GC and stooped all existing thread, we can be sure there
# are no stale accesses to `jl_all_tls_states`
free(old)

...or maybe just leak old jl_all_tls_states to simply things.

Alternatively, we can use a linked list (of blocks) for jl_all_tls_states (but it maybe slows things down.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tkf now that threadpool spawning is done only at startup, would you mind re-reviewing? I'm not sure what kinds of atomics t->threadpool = xyz needs (in the runtime).

base/threadingconstructs.jl Outdated Show resolved Hide resolved
Copy link
Member

@tkf tkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already raised concern in the "social" aspect in #41586 (comment) but from the technical standing point, the direction of this PR looks good to me.

I think the tricky parts that are not implemented yet are how to handle concurrent calls to jl_start_threads_dedicated and also how to make it correct with respect to concurrent calls to other internals like jl_gc_wait_for_the_world.

@tkf
Copy link
Member

tkf commented Sep 18, 2021

Also, maybe we need to make sure to call __init__ always on the primary pool? A package Foo can spawn things in Foo.__init__. But it's probably not appropriate to spawn them in a non-primary pool even when @eval using Foo is invoked in the non-primary pool?

@DilumAluthge
Copy link
Member

cc: @chriselrod

@chriselrod
Copy link
Contributor

cc: @chriselrod

If I understand correctly, this is what ThreadingUtilities/PolyesterWeave want.
ThreadingUtilities will be able to create its own independent thread pool.
Other threaded code will not start tasks on any threads in this pool, removing the concern for potential conflicts.

But I imagine over prescribing threads will still be a concern?
You can of course also start Julia with julia -t1000 and then Threads.@threads will already over prescribe, but if a bunch of libraries start their own pools, it may be easier to accidentally end up with more threads than Sys.CPU_THREADS.

@DilumAluthge
Copy link
Member

but if a bunch of libraries start their own pools, it may be easier to accidentally end up with more threads than Sys.CPU_THREADS

This is a major concern for me as well.

@jpsamaroo
Copy link
Member Author

My reason for implementing this feature in the runtime is focused on I/O, not on computation. Only libraries which need these isolated threadpools to prevent deadlocks (like CUDA.jl/AMDGPU.jl for hostcalls, or Gtk.jl for its event loop) should use this feature by default; other libraries could allow this as an opt-in feature, but I would recommend against it.

Another feature that we will need that should better suit alternative task scheduler implementations would be the ability to forward task scheduling decisions to a library instead of the built-in PARTR algorithm.

@tkf
Copy link
Member

tkf commented Sep 19, 2021

Yes, I want to echo Julian and this feature should only be used for latency-critical (I/O) parts of the programs, and not throughput-oriented ones.

I cannot emphasize enough that this PR relies heavily on the cooperation from the community to not use this feature as much as possible, especially compute-heavy ones. If every package uses its own thread pool, we'll lose composability.

If a package uses its own thread pool, sure, it makes its micro-benchmark look awesome. But it does not mean it performs well when you combine multiple packages together. (But please don't take it as a direct criticism to Polyester.jl et al. I think it's great that we are working on multiple fronts to improve parallelism in Julia. But I think it'd be better to try converging to single resource management eventually so that multiple components in the ecosystem work well together.)

@chriselrod
Copy link
Contributor

chriselrod commented Sep 20, 2021

Yes, I want to echo Julian and this feature should only be used for latency-critical (I/O) parts of the programs, and not throughput-oriented ones.

I cannot emphasize enough that this PR relies heavily on the cooperation from the community to not use this feature as much as possible, especially compute-heavy ones. If every package uses its own thread pool, we'll lose composability.

If a package uses its own thread pool, sure, it makes its micro-benchmark look awesome. But it does not mean it performs well when you combine multiple packages together. (But please don't take it as a direct criticism to Polyester.jl et al. I think it's great that we are working on multiple fronts to improve parallelism in Julia. But I think it'd be better to try converging to single resource management eventually so that multiple components in the ecosystem work well together.)

I agree that "working well together" is preferable, but an independent thread pool (like BLAS currently uses) is preferable to potentially causing problems, and will be a much simpler adjustment to make.

On this end:

Another feature that we will need that should better suit alternative task scheduler implementations would be the ability to forward task scheduling decisions to a library instead of the built-in PARTR algorithm.

I'd be happy to have some API for communicating with the PARTR scheduler. A fairly good and perhaps simple soltuion would be to have PolyesterWeave simply avoid launching threads at all if there are multiple tasks scheduled by PARTR, with the assumption here being that there's a courser grained level of parallelism that will benefit more.
It's not perfect, as it could lead to under utilization if the parallelism is limited at that scope.

@tkf
Copy link
Member

tkf commented Sep 20, 2021

I think the philosophy behind Julia's parallelism design is that the BLAS model of managing computing resources at the library level doesn't work well when embedded in a large parallel program. I think beating BLAS using a composable way would be much more exciting than beating BLAS using the BLAS way. (Even though the latter is an excellent achievement on its own.)

Julia's parallelism is meant to go beyond per-library static scheduling.

@vchuravy
Copy link
Member

There also has been the goal from the beginning to be able to replace BLAS thread pool and make it use the Julia tasks (as FFTW does) to enable composability and to avoid oversubscription.

@KristofferC
Copy link
Member

KristofferC commented Sep 20, 2021

W.r.t BLAS: I thought the goal was to allow OpenBLAS to use our scheduler. Basically, we should push for libraries to have an API that allows passing them a scheduler, not passing over scheduling decisions to them.

Edit: Jinx.

@chriselrod
Copy link
Contributor

chriselrod commented Sep 20, 2021

I think the philosophy behind Julia's parallelism design is that the BLAS model of managing computing resources at the library level doesn't work well when embedded in a large parallel program. I think beating BLAS using a composable way would be much more exciting than beating BLAS using the BLAS way. (Even though the latter is an excellent achievement on its own.)

Julia's parallelism is meant to go beyond per-library static scheduling.

There also has been the goal from the beginning to be able to replace BLAS thread pool and make it use the Julia tasks (as FFTW does) to enable composability and to avoid oversubscription.

W.r.t BLAS: I thought the goal was to allow OpenBLAS to use our scheduler. Basically, we should push for libraries to have an API that allows passing them a scheduler, not passing over scheduling decisions to them.

OpenMathLib/OpenBLAS#2255

Anyway, OpenBLAS's single and especially multithreaded performance are pretty bad, with a few exceptions.
So libraries interested in maximizing performance, like DifferentialEquations.jl, will want to continue to use faster Julia implementations like https://github.com/YingboMa/RecursiveFactorization.jl
Also, I would generally prefer LAPACK routines to be single threaded when nested in multithreaded programs / follow a breadth-first scheduling algorithm, as these have complicated dependencies such that the returns on number of threads is bad -- much worse than can easily be achieved from courser parallelism -- except at asymptotic sizes.

OpenBLAS's LU factorization is actively slower multithreaded than single threaded unless the matrices are very large on recent CPUs with lots of threads. And that is the case with OpenBLAS's current threading, which is lower overhead than Julia's.
RecursiveFactorization.jl is able to actually achieve some speedup from multiple threads instead of slowing down (on top of being faster single threaded than OpenBLAS), but without an algorithmic change that speed up is going to be small enough to not be worthwhile if those threads can be doing anything else.
Even if we achieve MKL level benefits from multithreading via an algorithmic change, the threads would probably be better used on other "outer" tasks, so that disabling threads is probably a good heuristic solution for algorithms like this in general.

Thus, IMO, having BLAS compose with Julia's threading is a worse solution than BLAS.set_num_threads(1).
A better solution is to automatically use a single thread then other mutltihreading is engaged.

@jpsamaroo
Copy link
Member Author

Regardless of whether a given BLAS routine is maximally efficient in a multithreaded setup, trying to execute other code concurrently on the main threadpool with an external threadpool will harm the efficiency of both pools by cache thrashing. This is something that PARTR has the ability to avoid by co-scheduling related tasks together (such as the tasks comprising a BLAS call) and excluding unrelated tasks temporarily. If we're seeing evidence that PARTR isn't doing that job effectively for Julia code, we should fix PARTR, instead of giving up and using threadpools to make just one or two libraries work most efficiently in isolation.

@chriselrod
Copy link
Contributor

chriselrod commented Sep 20, 2021

Regardless of whether a given BLAS routine is maximally efficient in a multithreaded setup, trying to execute other code concurrently on the main threadpool with an external threadpool will harm the efficiency of both pools by cache thrashing. This is something that PARTR has the ability to avoid by co-scheduling related tasks together (such as the tasks comprising a BLAS call) and excluding unrelated tasks temporarily.

As I said before, because of the lack of efficiency of many BLAS/LAPACK routines, BLAS.set_num_threads(1) to avoid thrashing will generally be faster than "co-scheduling": the non-BLAS threading will be more efficient, and thus the parallelism at that level should be maximized and the threading at the BLAS level minimized to optimize performance.

This is why my nearly ideal solution is to be able to query the PARTR scheduler if it's currently scheduling multiple tasks. If so, PolyesterWeave could avoid allocating any threads at all, allowing it to be multithreaded in otherwise serial execution, but single threaded in otherwise parallel execution -- following the assumption that the threading on the other level will be more efficient.

If we're seeing evidence that PARTR isn't doing that job effectively for Julia code, we should fix PARTR, instead of giving up and using threadpools to make just one or two libraries work most efficiently in isolation.

How many years will that take?

However, I'm also not a fan of depth-first scheduling anyway and have low expectations.

Between expecting bad results in a matter of years and satisfactory results that also give me control over performance in a time-frame of weeks at most (or immediately, given that I've already written ThreadingUtilties, etc), my choices are obvious.

To be clear:
"In production", I'm use ThreadsX and Threads.@threads more often than I use Polyester or @tturbo.
Because I do have control of the full code stack, I "manually" implement what I want here by simply not using threading at all.
That code also always has only 1 layer of threading.

But it would help libraries like RecursiveFactorization.jl, important for solving stiff ODEs, to have this autoswitching ability.

@jpsamaroo
Copy link
Member Author

This is why my nearly ideal solution is to be able to query the PARTR scheduler if it's currently scheduling multiple tasks.

Ok, we can probably make that happen, by exposing a C function that returns the size of the multiq heaps, or you might just be able to check Base.Workqueues. If you had that available, would you still plan to use threadpools?

How many years will that take?

Probably on the order of a month or two, if we have a good MWE.

However, I'm also not a fan of depth-first scheduling anyway and have low expectations.

Care to explain why?

@chriselrod
Copy link
Contributor

Ok, we can probably make that happen, by exposing a C function that returns the size of the multiq heaps, or you might just be able to check Base.Workqueues.

That'd be great.
How can I check Base.Workques? length doesn't seem right:

julia> function workqueuelen(iter)
           res = Matrix{Int}(undef, Threads.nthreads()+1, iter)
           Threads.@threads for i  1:iter
               for j  1:Threads.nthreads()
                   res[j,i] = length(Base.Workqueues[j])
               end
               res[Threads.nthreads()+1,i] = length(Base.Workqueue)
           end
           res
       end
workqueuelen (generic function with 2 methods)

julia> wqls2 = workqueuelen(1_000_000);

julia> count(!iszero, wqls2)
575323

julia> sum(wqls2)
575323

julia> sum(wqls2) / length(wqls2)
0.019838724137931033

I.e., it is mostly 0 here. I'd have to look more at the implementation details to find out what to look for/what it's doing. I.e., whether a task pushed into the queue is popped off as soon as it is scheduled or once it is finished. My guess is the former.

If you had that available, would you still plan to use threadpools?

Yeah, if I had that available I could use it to avoid conflicts instead of threadpools and make everyone in this thread happy. ;)

@jpsamaroo
Copy link
Member Author

I think that's somewhat expected, since tasks can also be placed in the PARTR multiq for DFS.

Yeah, if I had that available I could use it to avoid conflicts instead of threadpools and make everyone in this thread happy. ;)

Cool, then I'll put together a PR in the next week or so to expose multiq sizes.

@tkf
Copy link
Member

tkf commented Sep 20, 2021

Querying the scheduler state is not a great API. It's racy. Also, in theory, it can increase cache traffic and move otherwise thread-local states out of exclusive mode (although the work queue we currently use probably isn't designed to get beneficial effects from this). More importantly, the user is now responsible for making sure the result is identical no matter what the scheduler state is. Using such an API is as hard as using return_type. It's doable but it needs a big warning saying that the user needs to make sure the result is identical bit-by-bit. (Or a library function using it can declare that the behavior is non-deterministic. But it's really hard to program against such a library.)

Querying the global state is not the best mindset in concurrent programming, IMHO. Each component should act with mostly local knowledge with minimal synchronizations. The point is designing the system that self-organizes into efficient machinery when things are put together.

@tknopp
Copy link
Contributor

tknopp commented Sep 22, 2021

will this PR also allow for changing the amount of threads in the first thread pool? This would be a great feature and also would prevent that people dynamically create new thread pools just because the size of the first one cannot be changed.

You mentioned Gtk.jl as a use case. Would be very interesting if this could help resolving this issue: JuliaGraphics/Gtk.jl#503

@jpsamaroo
Copy link
Member Author

will this PR also allow for changing the amount of threads in the first thread pool? This would be a great feature and also would prevent that people dynamically create new thread pools just because the size of the first one cannot be changed.

No, that's intentionally a non-goal because it has the potential to break existing code that relies on Threads.nthreads() remaining constant (which this PR guarantees for code running within only a single threadpool). I would hope that users never directly use this feature, and only a few libraries use it very sparingly. The main threadpool is what users should be using, and they're still expected to size it appropriately ahead of time.

You mentioned Gtk.jl as a use case. Would be very interesting if this could help resolving this issue: JuliaGraphics/Gtk.jl#503

Yeah, this should be solvable by placing the Gtk event loop into its own threadpool, where it can execute for 100% of its thread's time slice if desired.

@kpamnany kpamnany changed the title [RFC] Add threadpool support to runtime Add threadpool support to runtime Apr 25, 2022
@kpamnany kpamnany force-pushed the jps/moar-threads branch 2 times, most recently from 11637f1 to f58cf21 Compare April 25, 2022 16:11
Adds support for Julia to be started with `--threads=auto|N[,M]` where
`N` specifies the number of threads in the default threadpool and `M`,
if provided, specifies the number of threads in the new interactive
threadpool.

Adds an optional first parameter to `Threads.@spawn`:
`[:default|:interactive]`. If `:interactive` is specified, the task will
be run by thread(s) in the interactive threadpool only (if there is
one).

Co-authored-by: K Pamnany <[email protected]>
Comment on lines +77 to 82
nt = UInt32(Threads._nthreads_in_pool(tpid))
if heap_c * nt <= heap_p
return heap_p
end

heap_p += heap_c * nt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the wait-free property of multiqueue, we need nt = nthreads() since all worker threads from all worker pools are possible enqueuers. We probably don't need to worry too much about this for now since most of enqueues happen inside each pool and a theoretical guarantee does not imply practical performance in general. But I think it's something to watch for when using it in practice especially when there are intensive inter-pool synchronizations. For example, a rather very conservative but wasteful approach is heap_p = max(heap_c * _nthreads_in_pool(tpid), nthreads() + 1).

@vtjnash Why is += used in heap_p += heap_c * nt instead of = by the way? We only need heap_p / nt > 1 here, right? The heap vector is either empy or fully initialized so I suppose it doesn't matter ATM though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point. In particular, to retain interactivity, it should be a common pattern for an interactive task to enqueue work to the default threadpool. However, as you say, it may not be a problem in practice, and in any case we have many problems with the use of multi-queues in general; IMO the way forward is an evolution of your #43366.

@kpamnany kpamnany merged commit f410c62 into JuliaLang:master Apr 26, 2022
@kpamnany kpamnany deleted the jps/moar-threads branch April 26, 2022 20:47
@ViralBShah ViralBShah added the needs news A NEWS entry is required for this change label Apr 26, 2022
@tkf
Copy link
Member

tkf commented Apr 27, 2022

FWIW I find it unclear that #42302 (comment) is addressed before merge

@kpamnany
Copy link
Contributor

kpamnany commented Apr 27, 2022

FWIW I find it unclear that #42302 (comment) is addressed before merge

Ah. @jpsamaroo and @JeffBezanson and I agreed that :interactive is the best of the options.

@tknopp
Copy link
Contributor

tknopp commented Apr 30, 2022

Is there somewhere a high level description what this PR offers? I originally thought that this PR allows the user to create new thread pools during runtime (which also the title indicates) but browsing through the documentation it seems we now have two thread pools both of which need to be specified before initialization. We implement a low-latency high-throughput data acquisition platform and it would be highly desirable to spawn a new thread pool during runtime. In particular we do not want to share a pool between the UI (Gtk.jl) thread and the data acquisition threads (we have multiple).

A different question is what the implication of this PR for Gtk.jl are. Should the UI thread be run in the interactive pool?

@kpamnany
Copy link
Contributor

kpamnany commented May 1, 2022

There's a new manual entry.

This is intended to be used for low-latency, high interactivity tasks. It is also opt-in by the user.

@tknopp
Copy link
Contributor

tknopp commented May 1, 2022

I read that manual entry but still do not see how to create new threadpools at runtime. The PR started with:

This PR implements threadpool support in the runtime, allowing extra threadpools of arbitrary sizes to be created at runtime.

and I just wonder how the new threadpools are created.

@carstenbauer
Copy link
Member

carstenbauer commented May 1, 2022

There might be a confusion here: This PR adds threadpool support "to the runtime", the Julia runtime that is. AFAIU, it does not add features to dynamically add threads (or threadpools) "at runtime" (i.e. similar to addprocs for processes).

(FWIW, I'd also like the ability to dynamically add and remove Julia threads at runtime. But I understand that this might be tricky.)

@tknopp
Copy link
Contributor

tknopp commented May 1, 2022

The PR was originally targeting:
"allowing extra threadpools of arbitrary sizes to be created at runtime"
Not sure what I am confusing here.

@tkf
Copy link
Member

tkf commented May 1, 2022

There have been a lot of iterations over the design and implementation. What's ended up in the documentation reflects what's implemented.

As @carstenbauer said,

"allowing extra threadpools of arbitrary sizes to be created at runtime"

is not implemented. The size of each worker pool is fixed at the startup time. That said, given various cleanup patches Jameson has been adding, I suspect the size of the pools becomes dynamic at some point.

Should the UI thread be run in the interactive pool?

Yes, GUI packages should use @spawn :interactive. This does not mean UI libraries have exclusive access to OS threads. If the Julia users (i.e., whoever starts the julia process) use appropriate options, these tasks are scheduled in a worker pool that is separated from the one used by normal @spawn. Furthermore, since the :interactive pool is shared across all packages, UI libraries still have to work cooperatively and programmed carefully to always delegate computation to the main pool.

@tknopp
Copy link
Contributor

tknopp commented May 1, 2022

Thanks for the clarification. Did not want to criticize this PR but wanted to find out where this feature has been gone.

I suspect the size of the pools becomes dynamic at some point.

I really hope that this will be the case soon. Right now the workaround is to write __init__ functions looking like this

function __init__()
  if Threads.nthreads() < 4
    error("MyPackage requires four threads. Start Julia with `julia -t 4`.")
  end
end

which is not very user friendly. In my data acquisition application the isolation property of a dedicated thread pool is even more important. But anyway, the @spawn :interactive sounds like an important step for Gtk.jl. Looking forward to that.

@davidanthoff
Copy link
Contributor

davidanthoff commented Jul 21, 2022

Hm, so there are only two pools now? I kind of had hoped that we could use this for the communication part in the REPL process in the VS Code extension, so that we can create our own thread pool for that which is completely separate from any thread pool that user code would be using, but that doesn't seem feasible with this design, right? If we run our communication tasks on the interactive thread pool, and then user code blocks that, then our communication is also blocked.

Just curious: was there a specific reason to now limit this to two pools, instead of just making it a flexible system that can support n pools?

@jpsamaroo
Copy link
Member Author

@davidanthoff I'm going to try to recall the exact discussion around this, but suffice to say, we spent quite a lot of time discussing that particular feature possibility.

So one of the issues we were concerned about when discussing the possibility of user-defined (or even dynamically-added) threadpools is that, by allowing arbitrary threadpools to exist, we put a large burden on the ecosystem to ensure that threadpool usage retains performance composability.

What is performance composability, in this context? Performance composability is the ability to combine multiple pieces of code - each of which is performant on their own - and still retain similar magnitudes of performance in the combined code (or, if you're lucky, get better performance with the two combined).

And how can it be that extra threadpools hurt this? Say library A uses the default threadpool, and library B uses its own dedicated threadpool, both doing performance-sensitive operations. By default, the default threadpool uses all system threads, and so at full-tilt it can fully and near-optimally utilize the system's CPU resources. However, this other threadpool is also likely using multiple threads, probably the same number as we have in the default threadpool, and so when code is also executing on it at the same time as code executes on the default threadpool, suddenly the OS starts de-prioritizing some of our threads because it can't execute all of them in parallel. What's the result? Performance of both codes suffers. Because you were doing an unrelated computation in one threadpool, computations in another threadpool lose performance, and vice versa.

So, is this state of affairs permanent? Not at all! We discussed the possibility of exposing more built-in threadpools in the future, as necessity dictates, and also the possibility to revisit adding user-defined/dynamic threadpool support if we find that it's really necessary, and with the assumption that we have some plan to retain performance composability. For now, let's try out the interactive threadpool and see how things go! If you run into problems with meeting your interactivity deadlines, then we should have a discussion with the relevant maintainers to see if we can fix that.

@davidanthoff
Copy link
Contributor

I understand all of that, but I think that analysis just misses the scenario we would like to enable in the VS Code extension.

The scenario we have is that we want to run a tiny bit of background code in a process that is primarily running user code, and we want our code to run as robustly not matter what kind of silly things the user is doing in the REPL process (or any of the other scenarios like notebooks, testing etc that we support). If the user during development makes a mistake and runs some code on the interactive thread pool that completely blocks things, then that is exactly the situation where we still want our code to be preemptively scheduled.

I think the worries about overscheduling really don't apply to our scenario. We would only want to add one extra thread and it would do very little work. Plus, the scenario we are in is interactive editing in VS Code. In any typical interactive editing session, there are probably hundreds (if not thousands) of other threads competing for time from the OS (in VS Code itself, web browsers, Spotify, who knows what else). It is hard to see how adding one more thread would make any difference. Even if it did, I think it would be the right trade-off in this scenario.

I had a brief chat with @vtjnash at Juliacon, and he mentioned that he was working on a PR that adds the ability to dynamically add threads, and from what he described that would probably allow us to do what we would like to do, so this might already be in the works.

@jpsamaroo
Copy link
Member Author

I completely understand that; the concern is not that most libraries will do bad things, but that some libraries would do what I warned against, for the purpose of winning in benchmarks or because of some ideological disagreement with Julia's runtime system.

While that is harmless in and of itself, if users blindly enable a large additional threadpool (because they saw a blog post showing how to get max speed for some benchmark code, or even because the package docs mention it), then we'll end up getting issues filed when performance drops for complicated codes which incidentally use that library.

I think what's missing here is that we don't have a form of preemptive multitasking which would allow an interactive or semi-real-time task to forcibly steal execution time from some other actively-executing task.

we want our code to run as robustly not matter what kind of silly things the user is doing in the REPL process

Can you give a brief explanation of what that code will be doing? It might help inform the direction of this discussion if I can make more concrete recommendations.

In any typical interactive editing session, there are probably hundreds (if not thousands) of other threads competing for time from the OS (in VS Code itself, web browsers, Spotify, who knows what else). It is hard to see how adding one more thread would make any difference. Even if it did, I think it would be the right trade-off in this scenario.

These threads aren't all necessarily utilizing 100% of each CPU on the system; if they were, you'd probably get very poor performance for every thread in the system. Instead, many are doing I/O or infrequently doing small bits of work, which is exactly the idea for the kinds of tasks that should use the interactive threadpool.

I had a brief chat with @vtjnash at Juliacon, and he mentioned that he was working on a PR that adds the ability to dynamically add threads, and from what he described that would probably allow us to do what we would like to do, so this might already be in the works.

The work Jameson mentioned is probably #45447 , which allows non-Julia threads to call into Julia code, but not necessarily to be able to create more standard Julia threads from Julia code. However, that might benefit your use case by allowing you to execute some non-yielding Julia code from a manually-allocated thread (which will require some ccalls to allocate and configure a thread from the OS).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.