-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding complex expressions projections for Subquery #9719
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this @Lordworms
One situation I came up with that makes eliminating common expr across plans tricky is how to keep track of column references. Consider a plan like:
Projection: a+1
--Projection: a+1 as a
----TableScan
Where two a+1
are different things actually, since two a
are not the same thing. The result is wrong if we replace them with a pre-computed common expr.
datafusion/sqllogictest/test_files/project_complex_sub_query.slt
Outdated
Show resolved
Hide resolved
Thanks, I'll record this and fix in the following PR |
Projection: t.c3 + t.c4, SUM(t.c3 + t.c4) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING | ||
--WindowAggr: windowExpr=[[SUM(CAST(t.c3 + t.c4 AS Int64)) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]] | ||
----Projection: t.c3 + t.c4 AS t.c3 + t.c4, t.c3, t.c4 | ||
------TableScan: t projection=[c3, c4] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked this plan. It seems that expression t.c3+t.c4
in the top projection still is a binary expression (Unfortunately from display it is not visible whether it is column or binary expression). We should have convert it to its pre-evaluated column version: Col(t.c3+t.c4)
. If that was the case, I would expect to see following logical plan
Projection: t.c3 + t.c4, SUM(t.c3 + t.c4) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
--WindowAggr: windowExpr=[[SUM(CAST(t.c3 + t.c4 AS Int64)) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]]
----Projection: t.c3 + t.c4 AS t.c3 + t.c4
------TableScan: t projection=[c3, c4]
where columns t.c3
and t.c4
are pruned at the first projection (from bottom). Since they are no longer needed by subsequent stages. I will try to solve this problem. If I can come up with a solution I will let you know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, in this case, I think the easiest way to do this is to delete the t.c3 and t.c4 columns in the upper projection schema. (which could be done when traversing from top to down). Will you do it or let me try it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to go ahead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think same problem still occurs in the LogicalPlan::Window
where inner expression of the Cast
is binary expression. I would expect Projection
immediately top of the TableScan
to only include expression: t.c3 + t.c4 AS t.c3 + t.c4
because. Subsequent stages doesn't refer t.c3
or t.c4
. If you rewrite the expression for window also (as in projection). We should get that plan.
In short we have to rewrite expressions that contribute to common expression count during f_down
pass.
) | ||
STORED AS CSV | ||
WITH HEADER ROW | ||
LOCATION '../core/tests/data/demo.csv'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since demo.csv
file is not added to this PR. Following tests fails to produce PhysicalPlan
. I think, it is better to add this file also. We can still test LogicalPlan
s in the test by using the flag set datafusion.explain.logical_plan_only = true;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll add that
@mustafasrepo Hi If you have a moment, I would be honored to have your review on my updated one. |
Sure, I will look into this PR. |
Thanks @mustafasrepo , I double-checked the results and it meet the requirements. I actually have one question, why we could still get the proper answer with eliminating the upper ProjectionExec?(I think it is optimized since another logical plan optimization would eliminate the same Projection) I think it is ready to be merged now. Also, I think we could use similar way to track Aggregate, I could make an extra ticket here and solve it in the coming PR. |
Exactly, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!. Thanks @Lordworms for this work. It is really nice to collaborate with you. Another look is appreciated though. If any reviewer has time.
Thank you! |
@@ -423,6 +417,34 @@ impl OptimizerRule for CommonSubexprEliminate { | |||
plan => Ok(plan), | |||
} | |||
} | |||
/// currently the implemention is not optimal, Basically I just do a top-down iteration over all the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment seems to be incomplete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for that, I'll complete it
} | ||
|
||
impl ProjectionAdder { | ||
// TODO: adding more expressions for sub query, currently only support for Simple Binary Expressions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a tracking issue for this that we could refer to in the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add that
I am trying to clean up our old PRs -- it looks to me like @mustafasrepo approved and asked for another reviewer #9719 (review). However it isn't clear to me if this PR needs another review before merge or if that is a nice to have. I don't think we can merge this PR without some work to resolve the conflicts, but I don't want to waste people's time resolving conflicts if it isn't ready to merge. Please let us know your thoughts |
While asking for another review, What I had in mind was "nice to have". I think, once conflicts are resolved we can merge this PR as is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you all for working on this ❤️
I've reviewed it again, it's mostly LGTM. +1 for merging this once the conflict and remaining comments are solved.
} | ||
pub fn is_not_complex(op: &Operator) -> bool { | ||
matches!( | ||
op, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sugg:
Operator::LtEq
andOperator::GtEq
can also be sonsidered not complexOperator
isCopy
, so changing the function parameter toop: Operator
seems better
let f = | ||
DFField::new_unqualified(&expr.to_string(), dtype.clone(), true); | ||
field_set.insert(f.name().to_owned()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since only field name is used, we can skip constructing the DFField
?
let f = | |
DFField::new_unqualified(&expr.to_string(), dtype.clone(), true); | |
field_set.insert(f.name().to_owned()); | |
field_set.insert(expr.to_string()); |
Marking this one as a draft as I think it needs to have conflicts resolved before we can merge it |
1 similar comment
Marking this one as a draft as I think it needs to have conflicts resolved before we can merge it |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Closes #9576
Rationale for this change
What changes are included in this PR?
Added a top-down-top tree-node traverse in all plan, find complex_expressions in all LogicalPlan, try to add an extra Projection if this expression is needed in upper layer.
Are these changes tested?
Are there any user-facing changes?