How to assert invariants? #6977

benesch · 2020-07-14T21:37:44Z

benesch
Jul 14, 2020
Maintainer

The pattern for asserting invariants that I like best is to use program-ending control flow to check invariants. In Rust, this amounts to using macros like assert!, unreachable!, or panic! . This is a well-established pattern; after all, it's the reason that the assert! and unreachable! macros result in a panic.

The big upside to this approach is that the signatures of operations that are infallible indicate that fact:

fn infallible_op(/* ... */) {
    if some_condition_is_not_met {
        unreachable!("internal invariant violated");
    }
}

The big downside of this approach is that bugs that violate invariants will result in crashing the process. In my opinion this is the safest, most conservative choice for a general invariant. Given no more information than "this condition must be true for the program to be correct", if the condition does not hold, then the only way to guarantee that the program does not produce an incorrect result is to crash the program.

Of course, invariants are not typically so vague. Usually invariants are scoped to a single component:

"If this condition doesn't hold, then the Avro decoder state is invalid."
"If this condition doesn't hold, the planning of this SQL query has failed."

And then if one of those invariants is violated, you need only throw away the Avro decoder, or the bad SQL query, and not the whole process.

A trick that works really well in a lot of domains is to wrap each unit of computation in a panic handler. That means you can continue using panics to assert invariants while also preventing a bug from crashing the process. A lot of HTTP servers do this nicely. Each HTTP request spawns a new thread (or coroutine). If that thread crashes, that HTTP request fails (usually another thread makes a best effort attempt to return "internal server error" to the client), but the rest of the server can run with no trouble.

Unfortunately that trick is not going to work as easily with Materialize. You might hope to wrap each SQL connection in its own thread (or Tokio task, in this case) with a panic handler, but a number of components have been designed such that this is impossible. Specifically:

the coordinator is single-threaded, so a panic in the coordinator (including in the catalog) will take the whole system down,
the timely/differential computation mixes all view computation together on a fixed pool of worker threads, so a panic in any dataflow will take the whole system down.

So instead, we'll need to identify the bug-prone code within each thread, and isolate it manually.

So, I think there are two paths forward:

Wrap bug-prone code with panic::catch_unwind.
Make bug-prone code more defensive; essentially teach it to return errors rather than panicking whenever an invariant is violated.

Let's take the optimizer as an example, since a) there have been a few optimizer-caused crashes lately and b) it is a component that is totally within our control, so we can easily change its design. We want to insulate the coordinator from bugs in the optimizer, since a bug in the optimizer doesn't mean that the global state of the process is at risk—just that we've encountered a query that for whatever reason has tickled a bug in the optimizer—and we can simply throw away that query and report "internal error" to the user.

Option (1) essentially amounts to wrapping the call to Optimizer::optimize in panic::catch_unwind and converting those panics to errors.

Option (2) requires adjusting the optimizer's code to be a lot more defensive, returning errors wherever we call assert!, unreachable!, or panic!. This is hard to do in general, since there are a lot of sleeper operations that can cause panics, like an out-of-bounds slice (foo[7] if foo.len() < 7). It's hard to know where to draw the line, in my experience—do we need to avoid using slicing everywhere? Is Option::unwrap ok in cases where there is "obviously" a check that the option is present in the same function?

So I think option (1) is generally the best choice, since it requires no cooperation from the wrapped code. This is particularly important when dealing with external code, which we don't have the same degree of control over. E.g., it's going to be much harder to convince the protobuf library not to call unreachable!.

Life is going to be a bit harder in the dataflow layer, where performance is at a premium. panic::catch_unwind might be too slow to call on the critical path of datum evaluation; there have been some issues about its performance in the past: rust-lang/rust#34727. On the other hand, exceptions are generally a bit cheaper than returning errors; one of my former coworkers has a great post on the subjection: https://dr-knz.net/measuring-errors-vs-exceptions-in-go-and-cpp.html.

/cc @frankmcsherry

benesch · 2020-07-14T21:42:13Z

benesch
Jul 14, 2020
Maintainer Author

The tl;dr, since that turned into a massive wall of text, is that I think we should sprinkle in a few calls to panic::catch_unwind to our codebase, rather than trying to avoid panics via defensive programming. I think catch_unwind is likely to protect against more classes of errors, plus it avoids introducing Results in code paths that are infallible modulo bugs. But we'd need to do some benchmarking to evaluate the performance cost if we want to use that strategy in the dataflow layer.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to assert invariants? #6977

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

How to assert invariants? #6977

benesch Jul 14, 2020 Maintainer

Replies: 1 comment

benesch Jul 14, 2020 Maintainer Author

benesch
Jul 14, 2020
Maintainer

benesch
Jul 14, 2020
Maintainer Author