How to assert invariants? #6977
benesch
started this conversation in
Technical musings
Replies: 1 comment
-
The tl;dr, since that turned into a massive wall of text, is that I think we should sprinkle in a few calls to |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The pattern for asserting invariants that I like best is to use program-ending control flow to check invariants. In Rust, this amounts to using macros like
assert!
,unreachable!
, orpanic!
. This is a well-established pattern; after all, it's the reason that theassert!
andunreachable!
macros result in a panic.The big upside to this approach is that the signatures of operations that are infallible indicate that fact:
The big downside of this approach is that bugs that violate invariants will result in crashing the process. In my opinion this is the safest, most conservative choice for a general invariant. Given no more information than "this condition must be true for the program to be correct", if the condition does not hold, then the only way to guarantee that the program does not produce an incorrect result is to crash the program.
Of course, invariants are not typically so vague. Usually invariants are scoped to a single component:
And then if one of those invariants is violated, you need only throw away the Avro decoder, or the bad SQL query, and not the whole process.
A trick that works really well in a lot of domains is to wrap each unit of computation in a panic handler. That means you can continue using panics to assert invariants while also preventing a bug from crashing the process. A lot of HTTP servers do this nicely. Each HTTP request spawns a new thread (or coroutine). If that thread crashes, that HTTP request fails (usually another thread makes a best effort attempt to return "internal server error" to the client), but the rest of the server can run with no trouble.
Unfortunately that trick is not going to work as easily with Materialize. You might hope to wrap each SQL connection in its own thread (or Tokio task, in this case) with a panic handler, but a number of components have been designed such that this is impossible. Specifically:
So instead, we'll need to identify the bug-prone code within each thread, and isolate it manually.
So, I think there are two paths forward:
panic::catch_unwind
.Let's take the optimizer as an example, since a) there have been a few optimizer-caused crashes lately and b) it is a component that is totally within our control, so we can easily change its design. We want to insulate the coordinator from bugs in the optimizer, since a bug in the optimizer doesn't mean that the global state of the process is at risk—just that we've encountered a query that for whatever reason has tickled a bug in the optimizer—and we can simply throw away that query and report "internal error" to the user.
Option (1) essentially amounts to wrapping the call to Optimizer::optimize in
panic::catch_unwind
and converting those panics to errors.Option (2) requires adjusting the optimizer's code to be a lot more defensive, returning errors wherever we call
assert!
,unreachable!
, orpanic!
. This is hard to do in general, since there are a lot of sleeper operations that can cause panics, like an out-of-bounds slice (foo[7]
iffoo.len() < 7
). It's hard to know where to draw the line, in my experience—do we need to avoid using slicing everywhere? IsOption::unwrap
ok in cases where there is "obviously" a check that the option is present in the same function?So I think option (1) is generally the best choice, since it requires no cooperation from the wrapped code. This is particularly important when dealing with external code, which we don't have the same degree of control over. E.g., it's going to be much harder to convince the protobuf library not to call
unreachable!
.Life is going to be a bit harder in the dataflow layer, where performance is at a premium.
panic::catch_unwind
might be too slow to call on the critical path of datum evaluation; there have been some issues about its performance in the past: rust-lang/rust#34727. On the other hand, exceptions are generally a bit cheaper than returning errors; one of my former coworkers has a great post on the subjection: https://dr-knz.net/measuring-errors-vs-exceptions-in-go-and-cpp.html./cc @frankmcsherry
Beta Was this translation helpful? Give feedback.
All reactions