-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: lazy evaluation of subqueries during comparisons #20298
Comments
Should this be tackled at query planning time? Your sample query can be transformed to:
And from there we could possibly push the
vs
The example is a bit artificial, though. Do you have a practical query to play with? Perhaps such transformations are not amenable in a real query. |
Yeah, it's possible that we could always get something to be done at planning time. I'm not sure though. In general, any subquery that returns a large number of rows will be subject to this problem. I'm sure we can think of something, right? Surely there are realistic queries with subqueries that always have to return a lot of rows. |
I'm siding with Peter on this. We shouldn't invest to optimize the infrastructure around subqueries at this time, given how much we will change logical planning in the coming few months. |
I do agree something needs to be done though, but I believe we'll reap 80% of the performance gains with a plan transformation, instead of "lazy evaluation". |
In addition I think there's some additional complexity here in the interaction with the > create table t (x int);
> select 1 in (select * from [insert into t (select * from generate_series(1,100)) returning x]); |
Seems somewhat reasonable. I believe that this optimization will be a lot simpler than the kind of transformations you're proposing. If we had a good reason to try to optimize large subqueries, I would vote for doing this as a first cut before some more fancy transformation that might not even work in all cases. However, it's not clear that there's a need for this work yet. Also I'm not sure why you're putting lazy evaluation in quotes - is there a more accurate term that I'm missing? ISTM we eagerly iterate the entire plan and put the results into a datum. I'm proposing we iterate the plan only when prompted by the above layer, volcano style, the way we do with every other kind of plan. @justinj true - that isn't evaluated as a subquery though, I don't think. I'm putting the [] syntax aside because people don't really know it exists yet. Also, FWIW, other databases (well I'm just thinking of pg to be fair) use CTEs in lieu of the [] syntax. CTEs allow all preparable statements as named expressions. |
It sounds like lazy evaluation except it wouldn't be lazy in all cases - when planning for distributed execution, we'd need to evaluate them all upfront so the evaluation would then be strict. What I'd like to recommend is to start by applying the "leaf data" pattern to subqueries as a first step, then perhaps consider indeed delaying the evaluation until either distsql physical planning, or actual evaluation in local execution, whichever occurs. |
I actually didn't know that in case of IN/ANY/ALL we buffer up the right side. I had assumed we'd only buffer the result (a single boolean value), since we know we only need to evaluate the subquery once to obtain that result. Unless we plan to fix this in the execution engine, we should change the optimizer to hoist the subquery out of the projection in this case. Currently, it only hoists correlated subqueries, but we could easily change it to hoist uncorrelated subqueries like this as well. |
We can't evaluate the boolean / comparison when the subquery is executed because the LHS of the comparison may be row-dependent, i.e. |
The optimizer already hoists the case where the |
We have marked this issue as stale because it has been inactive for |
An additional motivator for lazy evaluation of subqueries is to improve the performance of queries in the form For example, in CRDB, notice that the latency of both of the queries below are about the same:
Postgres lazy evaluates the subqueries, so the latency of the second query is much lower. The
Of course, this example could in theory be simplified at optimization time to remove the slow subquery entirely because the fast subquery returns a constant value. But it's not hard to imagine fast subqueries with results that cannot be known at optimization time, like |
Furthermore, Postgres docs explicitly mention:
I believe lazy evaluation of subqueries is required in order to match this behavior. We even make the same claim in our documentation, which we need to correct:
|
See cockroachdb#20298 and cockroachdb#82498 for additional context. Release note: None
82703: builtins: remove certain overloads for to_timestamp r=mgartner a=otan Paving the way for this to be backported - remove some overloads that were added to remove ambiguity for the purpose of better backwards compat. Release note: None 83196: opt: clarify exception to conditional evaluation guarantee r=mgartner a=mgartner See #20298 and #82498 for additional context. Release note: None Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: Marcus Gartner <[email protected]>
Another argument is correctness: if there's an invalid expression (that would trigger a run-time error) in the non-evaluated branch, the overall query should succeed without error. |
Good point. That's the behavior I would expect too. Although I believe either way could be considered correct because Postgres is not consistent—the optimizer traverses all branches searching for expressions to fold, which can cause runtime errors in branches that aren't actually followed during execution. For example:
If we prevent the optimizer from folding the division expression by adding a volatile numerator, the runtime error does not occur:
|
This commit inlines strict UDFs by wrapping the generated subquery expression with a `CASE` expression. The `CASE` expression results in `NULL` if any of the arguments of the UDF are `NULL`. Note that even though the subquery is wrapped in a `CASE` expression, it may still be evaluated due to the fact that non-correlated subqueries are eagerly evaluated, see cockroachdb#20298. This should be fine because these inlined UDFs shouldn't have side-effects - we don't inline volatile UDFs. Release note: None
101867: opt: fix strict UDFs with subquery arguments r=mgartner a=mgartner #### opt: fix strict UDFs with subquery arguments Previously, we wrapped all strict UDFs with a `CASE` expression to prevent the UDF from being invoked when any of the arguments were `NULL`. This was required in order to inline strict UDFs (see #94797). Unfortunately, wrapping the arguments in a `CASE` expression caused problems when the arguments were subqueries. The subqueries would be duplicated in the optimizer expression, one copy used in the `CASE` expression and another as an argument to the UDF`, with each copy containing the same table and column IDs. Normalization rules could then transform these subqueries into apply-joins that would have intersecting columns in their left and right children. This caused optimizer assertion failures in debug builds and could cause incorrect results in release builds. This commit implements a temporary fix which removes the synthesized `CASE` expression entirely. The logic for returning `NULL` immediately when a strict UDF is invoked with a `NULL` argument has been added back into `planner.EvalRoutineExpr`, and strict UDFs are not longer inlined. We should be able to inline strict UDFs by adding the `CASE` expression during the `InlineUDF` normalization rule, as long as the arguments are not complex expressions (e.g., only variables, constants, or placeholders). This is left for a future commit. Fixes #101516 Release note (bug fix): A bug has been fixed that could cause incorrect results for queries invoking `STRICT` user-defined functions. This bug was only present in pre-release versions of 23.1. #### opt: add strict field to UDF expression format Release note: None #### opt: simplify description of ExprFmtFlags The description of ExprFmtFlags has been simplified. It previously contained a very detailed description of bitmasks and iota, which probably isn't warranted because it is such a common pattern in the codebase. #### opt: inline strict UDFs This commit inlines strict UDFs by wrapping the generated subquery expression with a `CASE` expression. The `CASE` expression results in `NULL` if any of the arguments of the UDF are `NULL`. Note that even though the subquery is wrapped in a `CASE` expression, it may still be evaluated due to the fact that non-correlated subqueries are eagerly evaluated, see #20298. This should be fine because these inlined UDFs shouldn't have side-effects - we don't inline volatile UDFs. Release note: None 101925: c2c: add 30 second timeout to producer ingestion clean up during cutover r=lidorcarmel a=msbutler Epic: none Release note: None Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Michael Butler <[email protected]>
I think we're close to solving this with some of the infrastructure we added for UDFs. Subqueries in UDFs have to be lazily evaluated because they are evaluated for each invocation of the UDF (can be multiple times in one query if the UDF is in a projection or filter) and they might depend on values of the UDF arguments that are only known at runtime. I think the remaining hurdle to overcome is building execution infrastructure that can distribute lazily-evaluated subqueries so that we don't have to disable DistSQL for all queries with subqueries. |
Subqueries on the right side of IN/ANY/SOME/ALL comparisons are fully executed and their results buffered into memory before the comparisons begin.
This is undesirable in the case of large subqueries. Take a look at the following contrived scenario:
This query takes 700ms to return on my machine, since it has to buffer up the result of the inner subquery into a tuple
Datum
before starting the comparison. In this case, it would be ideal to perform the execution of the subquery plan lazily, because the comparison would quickly be able to short-circuit. I think all of the tuple comparison operators could be converted to use this lazy plan evaluation for subqueries.This could be exposed with a special Tuple datum that hides an internal subquery and has a
Next
method. There are probably other ways of doing it too.Jira issue: CRDB-5935
The text was updated successfully, but these errors were encountered: