Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context oriented error relay and MsgStream overruns #261

Merged
merged 31 commits into from
Dec 7, 2021

Conversation

goodboy
Copy link
Owner

@goodboy goodboy commented Nov 5, 2021

Stricter semantics around tractor.Context.open_stream(): and friends:

  • Previously we were ignoring a race where the callee an opened task
    context could enter Context.open_stream() before calling .started().
  • Disallow this as well as calling .started() more then once.
  • Support a new StreamOverrun error that is raised in 2 main cases:
    • Context.open_stream() is entered on only one side and that side sends more messages then the Actor.msg_buffer_size in which cases the error is send back to the sending side with a special warning if the receive side never entered its own .open_stream()
    • Context.open_stream(backpressure=False) is explicitly set by the user in which case the error may show up at any trio checkpoint (via embedding in a tractor.RemoteActorError) on the send side if an overrun takes place. In this case user code expected to handle the error if they don't want a send side task crash - hence why this is not the default setting.

Tests todo:

Races that inspired this discovered as part of pikers/piker#241

Fixes #265

@goodboy
Copy link
Owner Author

goodboy commented Nov 5, 2021

Obvs this needs some unit tests.

@goodboy goodboy force-pushed the stricter_context_starting branch from 00f97b1 to 3f26914 Compare November 9, 2021 02:39
@goodboy goodboy force-pushed the stricter_context_starting branch from a955b71 to fb2b759 Compare November 29, 2021 14:28
@goodboy
Copy link
Owner Author

goodboy commented Nov 29, 2021

Added a test that we need a fix for re: #265.

@goodboy goodboy force-pushed the stricter_context_starting branch 2 times, most recently from af8f54b to 8a4671a Compare December 2, 2021 23:12
Previously we were ignoring a race where the callee an opened task
context could enter `Context.open_stream()` before calling `.started().
Disallow this as well as calling `.started()` more then once.
This always triggered the mentioned race condition.
We need to figure out the best approach to avoid this case.
@goodboy goodboy force-pushed the stricter_context_starting branch from 8a4671a to 3f6099f Compare December 3, 2021 15:09
Instead of tracking feeder mem chans per RPC dialog, store `Context`
instances which (now) hold refs to the underlying RPC-task feeder chans
and track them inside a `Actor._contexts` map. This begins a transition
to making the "context" idea the primitive abstraction for representing
messaging dialogs between tasks in different memory domains (i.e.
usually separate processes).

A slew of changes made this possible:
- change `Actor.get_memchans()` -> `.get_context()`.
- Add new `Context._send_chan` and `._recv_chan` vars.
- implicitly create a new context on every `Actor.send_cmd()` call.
- use the context created by `.send_cmd()` in `Portal.open_context()`
  instead of manually creating one.
- call `Actor.get_context()` inside tasks run from `._invoke()`
  such that feeder chans are implicitly created for callee tasks
  thus fixing the bug #265.

NB: We might change some of the internal semantics to do with *when* the
feeder chans are actually created to denote whether or not a far end
task is actually *read to receive* messages. For example, in the cases
where it **never** will be ready to receive messages (one-way streaming,
a context that never opens a stream, etc.) we will likely want some kind
of error or at least warning to the caller that messages can't be sent
(yet).
This more formally declares the runtime's remote task startingn API
and uses it throughout all the dependent `Portal` API methods.
Allows dropping `Portal._submit()` and simplifying `.run_in_actor()`
style result waiting to be delegated to the context APIs at remote
task `return` response time. We now also track the remote entrypoint
"type` as `Context._remote_func_type`.
In preparation for supporting both backpressure detection (through an
optional error) as well as control over the msg channel buffer size, add
internal configuration flags for both to contexts. Also adjust
`Context._err_on_from_remote_msg()` -> `._maybe..` such that it can be
called and will only raise if a scope nursery has been set. Add
a `Context._error` for stashing the remote task's error that may be
delivered in an `'error'` message.
Half of portal API usage requires a 1 message response (`.run()`,
`.run_in_actor()`) and the streaming APIs should probably be explicitly
enabled for backpressure if desired by the user. This makes more sense
in (psuedo) realtime systems where it's better to notify on a block then
freeze without notice. Make this default behaviour with a new error to
be raised: `tractor._exceptions.StreamOverrun` when a sender overruns
a stream by the default size (2**6 for now). The old behavior can be
enabled with `Context.open_stream(backpressure=True)` but now with
warning log messages when there are overruns.

Add task-linked-context error propagation using a "nursery raising"
technique such that if either end of context linked pair of tasks
errors, that error can be relayed to other side and raised as a form of
interrupt at the receiving task's next `trio` checkpoint. This enables
reliable error relay without expecting the (error) receiving task to
call an API which would raise the remote exception (which it might never
currently if using `tractor.MsgStream` APIs).

Further internal implementation details:
- define the default msg buffer size as `Actor.msg_buffer_size`
- expose a `msg_buffer_size: int` kwarg from `Actor.get_context()`
- maybe raise aforementioned context errors using
  `Context._maybe_error_from_remote_msg()` inside `Actor._push_result()`
- support optional backpressure on a stream when pushing messages
  in `Actor._push_result()`
- in `_invote()` handle multierrors raised from a `@tractor.context`
  entrypoint as being potentially caused by a relayed error from the
  remote caller task, if `Context._error` has been set then raise that
  error inside the `RemoteActorError` that will be relayed back to that
  caller more or less proxying through the source side error back to its
  origin.
@goodboy goodboy changed the title Error on mal-use of Context.started() Context oriented error relay and MsgStream overruns Dec 6, 2021
Keeping it disabled on context open will help with detecting any stream
connection which was never opened on one side of the task pair.  In that
case we can report that there was an overrun **and** a stream wasn't
opened versus if the stream is explicitly configured not to use bp then
we throw the standard overflow.

Use `trio.Nursery._closed` to detect "closure" XD since it seems to be
the most reliable way to determine if a spawn call will trigger
a runtime error.
A context stream overrun should normally never take place since if
a stream is opened (via ``Context.open_stream()``) backpressure is
applied on the message buffer (unless explicitly disabled by the
``backpressure=False`` flag) such that an overrun on the receiving task
should result in blocking the (remote) sender task (eventually depending
on the underlying ``MsgStream`` transport).

Here we add a special error message that reports if one side never
opened a stream and let's the user know in the overrun error message
that they may be trying to push messages to a task that isn't ready to
receive them.

Further fixes / details:
- pop any `Context` at the end of any `_invoke()` task that creates
  one and registers with the runtime.
- ignore but warn about messages received for a context that either
  no longer exists or is unknown (guarding against crashes by malicious
  packets in the latter case)
@goodboy goodboy added bug Something isn't working cancellation SC teardown semantics and anti-zombie semantics labels Dec 6, 2021
A context method handling all this logic makes the most sense since it
contains all the state related to whether the error should be raised in
a nursery scope or is expected to be raised by a consumer task which
reads and processes the msg directly (via a `Portal` API call). This
also makes it easy to always process remote errors even when there is no
(stream) overrun condition.
@goodboy
Copy link
Owner Author

goodboy commented Dec 7, 2021

@guilledk @overclockworked64 I think this one is ready to land.

There might be one last little test I forgot - gotta check.

Oh and we need nooz.. might be a long one 😂

@goodboy goodboy merged commit f7c9056 into master Dec 7, 2021
@goodboy goodboy deleted the stricter_context_starting branch December 7, 2021 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cancellation SC teardown semantics and anti-zombie semantics IPC and transport
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feeder channel lookup race condition during context setup
2 participants