`Context` oriented error relay and `MsgStream` overruns #261

goodboy · 2021-11-05T18:43:04Z

Stricter semantics around tractor.Context.open_stream(): and friends:

Previously we were ignoring a race where the callee an opened task
context could enter Context.open_stream() before calling .started().
Disallow this as well as calling .started() more then once.
Support a new StreamOverrun error that is raised in 2 main cases:
- Context.open_stream() is entered on only one side and that side sends more messages then the Actor.msg_buffer_size in which cases the error is send back to the sending side with a special warning if the receive side never entered its own .open_stream()
- Context.open_stream(backpressure=False) is explicitly set by the user in which case the error may show up at any trio checkpoint (via embedding in a tractor.RemoteActorError) on the send side if an overrun takes place. In this case user code expected to handle the error if they don't want a send side task crash - hence why this is not the default setting.

Tests todo:

context started > once
.open_stream() entered before .started() called
test(s) that verify Feeder channel lookup race condition during context setup #265 is fixed
one side of stream not opened but is overrun by other side:
- caller overruns callee
- callee overruns caller
- overrun managed via backpressure

Races that inspired this discovered as part of pikers/piker#241

Fixes #265

goodboy · 2021-11-05T18:43:56Z

Obvs this needs some unit tests.

goodboy · 2021-11-29T16:36:09Z

Added a test that we need a fix for re: #265.

Previously we were ignoring a race where the callee an opened task context could enter `Context.open_stream()` before calling `.started(). Disallow this as well as calling `.started()` more then once.

This always triggered the mentioned race condition. We need to figure out the best approach to avoid this case.

Instead of tracking feeder mem chans per RPC dialog, store `Context` instances which (now) hold refs to the underlying RPC-task feeder chans and track them inside a `Actor._contexts` map. This begins a transition to making the "context" idea the primitive abstraction for representing messaging dialogs between tasks in different memory domains (i.e. usually separate processes). A slew of changes made this possible: - change `Actor.get_memchans()` -> `.get_context()`. - Add new `Context._send_chan` and `._recv_chan` vars. - implicitly create a new context on every `Actor.send_cmd()` call. - use the context created by `.send_cmd()` in `Portal.open_context()` instead of manually creating one. - call `Actor.get_context()` inside tasks run from `._invoke()` such that feeder chans are implicitly created for callee tasks thus fixing the bug #265. NB: We might change some of the internal semantics to do with *when* the feeder chans are actually created to denote whether or not a far end task is actually *read to receive* messages. For example, in the cases where it **never** will be ready to receive messages (one-way streaming, a context that never opens a stream, etc.) we will likely want some kind of error or at least warning to the caller that messages can't be sent (yet).

This more formally declares the runtime's remote task startingn API and uses it throughout all the dependent `Portal` API methods. Allows dropping `Portal._submit()` and simplifying `.run_in_actor()` style result waiting to be delegated to the context APIs at remote task `return` response time. We now also track the remote entrypoint "type` as `Context._remote_func_type`.

In preparation for supporting both backpressure detection (through an optional error) as well as control over the msg channel buffer size, add internal configuration flags for both to contexts. Also adjust `Context._err_on_from_remote_msg()` -> `._maybe..` such that it can be called and will only raise if a scope nursery has been set. Add a `Context._error` for stashing the remote task's error that may be delivered in an `'error'` message.

Half of portal API usage requires a 1 message response (`.run()`, `.run_in_actor()`) and the streaming APIs should probably be explicitly enabled for backpressure if desired by the user. This makes more sense in (psuedo) realtime systems where it's better to notify on a block then freeze without notice. Make this default behaviour with a new error to be raised: `tractor._exceptions.StreamOverrun` when a sender overruns a stream by the default size (2**6 for now). The old behavior can be enabled with `Context.open_stream(backpressure=True)` but now with warning log messages when there are overruns. Add task-linked-context error propagation using a "nursery raising" technique such that if either end of context linked pair of tasks errors, that error can be relayed to other side and raised as a form of interrupt at the receiving task's next `trio` checkpoint. This enables reliable error relay without expecting the (error) receiving task to call an API which would raise the remote exception (which it might never currently if using `tractor.MsgStream` APIs). Further internal implementation details: - define the default msg buffer size as `Actor.msg_buffer_size` - expose a `msg_buffer_size: int` kwarg from `Actor.get_context()` - maybe raise aforementioned context errors using `Context._maybe_error_from_remote_msg()` inside `Actor._push_result()` - support optional backpressure on a stream when pushing messages in `Actor._push_result()` - in `_invote()` handle multierrors raised from a `@tractor.context` entrypoint as being potentially caused by a relayed error from the remote caller task, if `Context._error` has been set then raise that error inside the `RemoteActorError` that will be relayed back to that caller more or less proxying through the source side error back to its origin.

Keeping it disabled on context open will help with detecting any stream connection which was never opened on one side of the task pair. In that case we can report that there was an overrun **and** a stream wasn't opened versus if the stream is explicitly configured not to use bp then we throw the standard overflow. Use `trio.Nursery._closed` to detect "closure" XD since it seems to be the most reliable way to determine if a spawn call will trigger a runtime error.

A context stream overrun should normally never take place since if a stream is opened (via ``Context.open_stream()``) backpressure is applied on the message buffer (unless explicitly disabled by the ``backpressure=False`` flag) such that an overrun on the receiving task should result in blocking the (remote) sender task (eventually depending on the underlying ``MsgStream`` transport). Here we add a special error message that reports if one side never opened a stream and let's the user know in the overrun error message that they may be trying to push messages to a task that isn't ready to receive them. Further fixes / details: - pop any `Context` at the end of any `_invoke()` task that creates one and registers with the runtime. - ignore but warn about messages received for a context that either no longer exists or is unknown (guarding against crashes by malicious packets in the latter case)

A context method handling all this logic makes the most sense since it contains all the state related to whether the error should be raised in a nursery scope or is expected to be raised by a consumer task which reads and processes the msg directly (via a `Portal` API call). This also makes it easy to always process remote errors even when there is no (stream) overrun condition.

goodboy · 2021-12-07T13:17:27Z

@guilledk @overclockworked64 I think this one is ready to land.

There might be one last little test I forgot - gotta check.

Oh and we need nooz.. might be a long one 😂

goodboy added the IPC and transport label Nov 5, 2021

goodboy requested review from guilledk and overclockworked64 November 5, 2021 18:43

goodboy force-pushed the stricter_context_starting branch from 00f97b1 to 3f26914 Compare November 9, 2021 02:39

This was referenced Nov 28, 2021

Feeder channel lookup race condition during context setup #265

Closed

Infect asyncio #121

Merged

goodboy force-pushed the stricter_context_starting branch from a955b71 to fb2b759 Compare November 29, 2021 14:28

goodboy force-pushed the stricter_context_starting branch 2 times, most recently from af8f54b to 8a4671a Compare December 2, 2021 23:12

goodboy added 3 commits December 3, 2021 10:08

Error on mal-use of Context.started()

f4793af

Previously we were ignoring a race where the callee an opened task context could enter `Context.open_stream()` before calling `.started(). Disallow this as well as calling `.started()` more then once.

Add test for #265: "msg sent before stream opened"

568902a

This always triggered the mentioned race condition. We need to figure out the best approach to avoid this case.

Add a double started error checking test

3f6099f

goodboy force-pushed the stricter_context_starting branch from 8a4671a to 3f6099f Compare December 3, 2021 15:09

goodboy added 9 commits December 3, 2021 14:49

Prove we've fixed #265

872b24a

Add a stream overrun exception

6751349

Always set Context._portal on the caller task side

2680a94

Add context stream overrun tests

2b05ffc

Adjust remaining examples and tests for non-backpressure default

7b9d410

goodboy changed the title ~~Error on mal-use of Context.started()~~ Context oriented error relay and MsgStream overruns Dec 6, 2021

goodboy added 5 commits December 5, 2021 20:00

Type check fixes

41a3e6a

Enable bp on clustering test

f3432bd

Pop context on .open_context() exit

4ea5c9b

goodboy added 2 commits December 6, 2021 11:54

Don't cancel the context on overrun cases

142083d

Slight delay to avoid flaky bcast race

58805a0

goodboy added bug Something isn't working cancellation SC teardown semantics and anti-zombie semantics labels Dec 6, 2021

Only pop contexts on decorated entrypoints

1f8e1cc

goodboy mentioned this pull request Dec 6, 2021

Embedded RemoteActorErrors and other exceptions #268

Open

goodboy added 9 commits December 6, 2021 16:32

Make cancel case expect multi-error

a79cdc7

Fix ignore warning log message

4b40599

Add a basic no-errors-when-backpressure stream test

63ecae7

Bump windows timeout again

52a2b7a

Rename test mod

fd6f457

Move context-streaming operational tests into one mod

efba522

Add back broken send chan ignore block

4856285

Bleh cast to list for msgpack

df59071

Add stream open before started, detailed semantics comment

703dee8

guilledk approved these changes Dec 7, 2021

View reviewed changes

Add nooz

faaecbf

goodboy merged commit f7c9056 into master Dec 7, 2021

goodboy deleted the stricter_context_starting branch December 7, 2021 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Context` oriented error relay and `MsgStream` overruns #261

`Context` oriented error relay and `MsgStream` overruns #261

goodboy commented Nov 5, 2021 •

edited

Loading

goodboy commented Nov 5, 2021

goodboy commented Nov 29, 2021

goodboy commented Dec 7, 2021

Context oriented error relay and MsgStream overruns #261

Context oriented error relay and MsgStream overruns #261

Conversation

goodboy commented Nov 5, 2021 • edited Loading

goodboy commented Nov 5, 2021

goodboy commented Nov 29, 2021

goodboy commented Dec 7, 2021

`Context` oriented error relay and `MsgStream` overruns #261

`Context` oriented error relay and `MsgStream` overruns #261

goodboy commented Nov 5, 2021 •

edited

Loading