Structured Concurrency

Title: Exploring Structured Concurrency
Editor: Chris Foster chris42f@gmail.com
Created: 2019-09-12
Status: Work in progress
Discussion: JuliaLange/julia#33248

Here are some notes surveying structured concurrency as it can be applied to Julia.

Julia has supported non-parallel concurrency since very early on and a restricted form of parallel programming with the @threads macro since version 0.5.

In julia 1.3 a threadsafe runtime for truly parallel tasks has arrived which will greatly increase their appeal in Julia's numerical and technical computing community. It's time to think about APIs where users can express concurrent computation in a safe and composable way.

Background terminology

For clarity, here's a few items of terminology:

A Julia task stores the computational state needed to continue execution of a nested set of function calls. In the standard runtime this includes any native stack frames, CPU registers and julia runtime state needed to suspend and resume execution.
A program is concurrent when there are multiple tasks which have started but not yet completed at a given time.
A program is parallel when two or more tasks are executing at a given time.

With these definitions, parallelism implies concurrency but a concurrent program can be non-parallel if the runtime serially interleaves task execution. See, for example, section 2.1.2 of "Introduction to Concurrency in Programming Languages".

What is structured concurrency?

To quote the libdill documentation,

Structured concurrency means that lifetimes of concurrent functions are cleanly nested. If coroutine foo launches coroutine bar, then bar must finish before foo finishes.

This is not structured concurrency:

This is structured concurrency:

It's all about composability. Structured concurrency is good because it reasserts the function call as the natural unit of program composition, where the lifetime of a computation is delimited in the structure of the source code. This is sometimes called the black box rule. Without this,

Task failures can go unhandled because there's nowhere to propagate the error.
Task lifetime is not defined by the source code. When a task starts and whether it runs to completion is an implementation detail of the runtime.
Computation cannot be cancelled systematically because there's no natural tree of child tasks.
Scope-based resource cleanup (eg, with open(...) do io) is broken because task local context can leak from parents into long running children.

For a colourful view on the downsides of unstructured concurrency, @njsmith has expressed it this way in his blog post Notes on structured concurrency or, "go statement considered harmful":

The popular concurrency primitives — go statements, thread spawning functions, callbacks, futures, promises, ... they're all variants on goto, in theory and in practice. And not even the modern domesticated goto, but the old-testament fire-and-brimstone goto, that could leap across function boundaries. These primitives are dangerous even if we don't use them directly, because they undermine our ability to reason about control flow and compose complex systems out of abstract modular parts, and they interfere with useful language features like automatic resource cleanup and error propagation. Therefore, like goto, they have no place in a modern high-level language.

Structured concurrency in Julia 1.0?

Julia 1.0 supports a limited kind of structured concurrency via the @sync block which waits for lexically contained child tasks (scheduled using @async) to complete. However, like Go, there's no requirement that concurrent work is actually scoped this way; that's completely up to the user and they may use @async anywhere. At first sight, it may seem just as natural to choose an unstructured communicating sequential processes (CSP) style in current Julia.

Even if the user chooses structured concurrency with @sync, they are still faced with implementing robust cancellation machinery by hand using Channels. This is the big missing piece required for the natural use of structured concurrency in Julia.

Cancellation and preemption

A robust task cancellation system is required to express structured concurrency. Without it, child tasks cannot be systematically managed in response to events such as a timeout from the parent or the failure of a sibling. For a great discussion of cancellation and a survey of cancellation APIs see the blog post "Timeouts and cancellation for humans".

Big challenge: how do we handle cancellation safely but in a timely way? What are the valid cancellation points and can we have cancellation which is both timely, safe and efficient in a wide variety of situations? Ideally we'd like tight numerical loops to be cancellable as well as IO. And we want all this without the performance penalty of inserting extra checks or safe points into loop code.

The challenge of preemptive cancellation

At first sight one might hope to treat preemptive cancellation somewhat like InterruptException: wake the task, deliver a signal to its thread to generate a CanceledException which then unwinds the stack, running regular user cleanup code.

The key difficulty here is that arbitrary preemptive cancellation can occur in any location with no syntactic hint in the source. Others have claimed that this makes arbitrary cancellation an impossible problem for user code. The standard compromise is to make only a core set of operations (including IO) cancellable. This is the solution offered in Python Trio checkpoints, libdill's family of IO functions and in pthreads (see pthread_cancel and pthreads cancellation points). In contrast, consider the failed preemptive cancellation APIs Java Thread.stop Windows API TerminateThread, both of which were found to be fundamentally non-robust.

Let's consider the ways in which current julia code can be non-robust in the face of InterruptException. A particular difficulty occurs in resource acquisition. Consider this snippet from task.jl:

lock(t.donenotify)
# < What if InterruptException is thrown here?
try
    while !istaskdone(t)
        wait(t.donenotify)
    end
finally
    unlock(t.donenotify)
end

In Julia we have the escape hatch disable_sigint (jl_sigatomic_begin in the runtime) for deferring InterruptException, but most code doesn't consider or use this which makes user resource handling broken by default.

So it's fairly clear that arbitrary cancellation without cleanup is a non-starter and that arbitrary cancellation with cleanup is difficult. But that leaves us in a difficult situation: how do we allow for cancellation of expensive numerical operations? Are there options for cancellation of numerical loops with a semantic which can be understood by users? The Go people seem to consider that arbitrary preemption is workable, but can arbitrary cancellation be made to work with the right language and library features?

Runtime technicalities for preemption

On a technical level, our runtime situation in julia-1.3 is very similar to Go where preemption is cooperative and a rouge goroutine can sometimes wedge the entire system. There has been a large amount of work in the Go community to address this, leading to the proposal "Non-cooperative goroutine preemption". In the process, several interesting alternatives were assessed including cooperative preemption of loops (by the insertion of safe points) and more complex mechanisms such as returning from a signal to out-of-line code which leads quickly to a safe point.

Syntax

When comparing to solutions in other languages it's important to mention that many have introduced special syntax to mark concurrent code.

C# introduced async/await; many followed (Python, Rust, ...). This makes potential suspension points syntactic.
await in Python marks preemption points. async is required to go with it, forming a chain of custody around "potentially suspending" functions.
Kotlin has suspend to introduce a special calling convention which passes along the coroutine context.
Go doesn't have async or await but is deeply concurrent and is the best analogy to Julia.

The problem with async/suspend is that it splits the world of functions in two, as nicely expressed in Bob Nystrom's blog post "What color is your function?". This is a barrier to composability because higher order functions have to know about the color of the function they're being passed. Bob argues that Go handles this in the nicest way by having first class support for continuations in the language. The Julia runtime does this in the same way.

On the other hand, a syntax such as async/await is arguably a useful visual marker for possible cancellation points (await) and for which functions are cancellable (async). Note that this doesn't have to be implemented at the language level; for example, Go's context and errgroup also allow the reader to recognize where the cancellation can happen (listening to the Done channel) and which functions can be cancelled (those that accept Context as an argument).

Prototypical use cases

The "happy eyeballs" algorithm is becoming a standard example of structured concurrency thanks to @njsmith's tutorial at PyCon 2018.
The Go concurrency tutorial — in his talk @elizarov suggested that implementing all the examples there was a great inspiration.

Related julia issues and prototypes

Tapir parallel IR
API Request : Interrupt and terminate a task
Error handling in tasks
Uncaught exceptions from tasks
silent errors in Tasks
asyncmap: Include original backtrace in rethrown exception

TODO: We should organize these, and more, with a tag.

Awaits.jl

Resources

A lot has been written on structured concurrency quite recently. Relevant implementations are available in C, Kotlin and Python, with Go also having to deal with many of the same issues. The Trio forum has a section dedicated to the language-independent discussion of structured concurrency.

Links

Structured concurrency resources - Structured concurrency - Trio forum
Reading list · Python-Trio/Trio Wiki

People in the wider community

Bob Nystrom (@munificent) works on the Dart language at google. Regarding async/await, he wrote a very on-topic post
- What color is your function?.
Martin Sústrik (@sustrik) is the author of the C library libdill, and has a interesting blog in which the term "structured concurrency" appears to have (perhaps) first appeared:
Nathanial Smith (@njsmith) is the author of the Python Trio library and a key advocate of structured concurrency. His blog has several very interesting posts on the topic.
- Timeouts and cancellation for humans
- Notes on structured concurrency, or: go statement considered harmful.
See also his PyCon 2018 talk:
- Nathaniel J. Smith - Trio: Async concurrency for mere mortals - PyCon 2018 - YouTube
Roman Elizarov (@elizarov) is the team lead for Kotlin libraries at JetBrains. Here's his blog.
- Structured concurrency for Kotlin talk at Hydraconf (talk abstract)
- Kotlin structured concurrency blog post

Structured concurrency libraries

libdill (C)
Trio (Python)
Kotlin coroutines

Cancellation

Python
- Timeouts and cancellation for humans
Go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StructuredConcurrency.md

StructuredConcurrency.md

Structured Concurrency

Background terminology

What is structured concurrency?

Structured concurrency in Julia 1.0?

Cancellation and preemption

The challenge of preemptive cancellation

Runtime technicalities for preemption

Syntax

Prototypical use cases

Related julia issues and prototypes

Resources

Links

People in the wider community

Structured concurrency libraries

Cancellation

Files

StructuredConcurrency.md

Latest commit

History

StructuredConcurrency.md

File metadata and controls

Structured Concurrency

Background terminology

What is structured concurrency?

Structured concurrency in Julia 1.0?

Cancellation and preemption

The challenge of preemptive cancellation

Runtime technicalities for preemption

Syntax

Prototypical use cases

Related julia issues and prototypes

Resources

Links

People in the wider community

Structured concurrency libraries

Cancellation