Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding complex expressions projections for Subquery #9719

Closed
wants to merge 29 commits into from

Conversation

Lordworms
Copy link
Contributor

@Lordworms Lordworms commented Mar 21, 2024

Which issue does this PR close?

Closes #9576

Rationale for this change

What changes are included in this PR?

Added a top-down-top tree-node traverse in all plan, find complex_expressions in all LogicalPlan, try to add an extra Projection if this expression is needed in upper layer.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt) and removed logical-expr Logical plan and expressions labels Mar 21, 2024
@github-actions github-actions bot added the core Core DataFusion crate label Mar 22, 2024
@github-actions github-actions bot removed the core Core DataFusion crate label Mar 22, 2024
@github-actions github-actions bot removed the sql SQL Planner label Mar 22, 2024
@Lordworms Lordworms marked this pull request as ready for review March 22, 2024 15:31
@alamb alamb requested review from waynexia and mustafasrepo March 22, 2024 18:46
Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this @Lordworms

One situation I came up with that makes eliminating common expr across plans tricky is how to keep track of column references. Consider a plan like:

Projection: a+1
--Projection: a+1 as a
----TableScan

Where two a+1 are different things actually, since two a are not the same thing. The result is wrong if we replace them with a pre-computed common expr.

datafusion/optimizer/src/optimize_projections.rs Outdated Show resolved Hide resolved
@Lordworms
Copy link
Contributor Author

Thanks for taking this @Lordworms

One situation I came up with that makes eliminating common expr across plans tricky is how to keep track of column references. Consider a plan like:

Projection: a+1
--Projection: a+1 as a
----TableScan

Where two a+1 are different things actually, since two a are not the same thing. The result is wrong if we replace them with a pre-computed common expr.

Thanks, I'll record this and fix in the following PR

Comment on lines 39 to 42
Projection: t.c3 + t.c4, SUM(t.c3 + t.c4) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
--WindowAggr: windowExpr=[[SUM(CAST(t.c3 + t.c4 AS Int64)) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]]
----Projection: t.c3 + t.c4 AS t.c3 + t.c4, t.c3, t.c4
------TableScan: t projection=[c3, c4]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked this plan. It seems that expression t.c3+t.c4 in the top projection still is a binary expression (Unfortunately from display it is not visible whether it is column or binary expression). We should have convert it to its pre-evaluated column version: Col(t.c3+t.c4). If that was the case, I would expect to see following logical plan

Projection: t.c3 + t.c4, SUM(t.c3 + t.c4) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
--WindowAggr: windowExpr=[[SUM(CAST(t.c3 + t.c4 AS Int64)) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]]
----Projection: t.c3 + t.c4 AS t.c3 + t.c4
------TableScan: t projection=[c3, c4]

where columns t.c3 and t.c4 are pruned at the first projection (from bottom). Since they are no longer needed by subsequent stages. I will try to solve this problem. If I can come up with a solution I will let you know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, in this case, I think the easiest way to do this is to delete the t.c3 and t.c4 columns in the upper projection schema. (which could be done when traversing from top to down). Will you do it or let me try it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to go ahead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think same problem still occurs in the LogicalPlan::Window where inner expression of the Cast is binary expression. I would expect Projection immediately top of the TableScan to only include expression: t.c3 + t.c4 AS t.c3 + t.c4 because. Subsequent stages doesn't refer t.c3 or t.c4. If you rewrite the expression for window also (as in projection). We should get that plan.
In short we have to rewrite expressions that contribute to common expression count during f_down pass.

)
STORED AS CSV
WITH HEADER ROW
LOCATION '../core/tests/data/demo.csv';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since demo.csv file is not added to this PR. Following tests fails to produce PhysicalPlan. I think, it is better to add this file also. We can still test LogicalPlans in the test by using the flag set datafusion.explain.logical_plan_only = true;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll add that

@Lordworms
Copy link
Contributor Author

I think right now it is fine, I'll run the tests
image

@Lordworms
Copy link
Contributor Author

@mustafasrepo Hi If you have a moment, I would be honored to have your review on my updated one.

@mustafasrepo
Copy link
Contributor

@mustafasrepo Hi If you have a moment, I would be honored to have your review on my updated one.

Sure, I will look into this PR.

@Lordworms
Copy link
Contributor Author

Lordworms commented Mar 29, 2024

Thanks @mustafasrepo , I double-checked the results and it meet the requirements. I actually have one question, why we could still get the proper answer with eliminating the upper ProjectionExec?(I think it is optimized since another logical plan optimization would eliminate the same Projection) I think it is ready to be merged now. Also, I think we could use similar way to track Aggregate, I could make an extra ticket here and solve it in the coming PR.

@mustafasrepo
Copy link
Contributor

Thanks @mustafasrepo , I double-checked the results and it meet the requirements. I actually have one question, why we could still get the proper answer with eliminating the upper ProjectionExec?(I think it is optimized since another logical plan optimization would eliminate the same Projection) I think it is ready to be merged now. Also, I think we could use similar way to track Aggregate, I could make an extra ticket here and solve it in the coming PR.

Exactly, the optimize_projections rule (for LogicalPlan) removes the unnecessary LogicalPlan::Projections from the plan if those projections have same input and same output schema. In this example, this is the case also. LogicalPlan::window has already desired schema for the user input.

Copy link
Contributor

@mustafasrepo mustafasrepo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!. Thanks @Lordworms for this work. It is really nice to collaborate with you. Another look is appreciated though. If any reviewer has time.

@Lordworms
Copy link
Contributor Author

LGTM!. Thanks @Lordworms for this work. It is really nice to collaborate with you. Another look is appreciated though. If any reviewer has time.

Thank you!

@@ -423,6 +417,34 @@ impl OptimizerRule for CommonSubexprEliminate {
plan => Ok(plan),
}
}
/// currently the implemention is not optimal, Basically I just do a top-down iteration over all the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems to be incomplete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for that, I'll complete it

}

impl ProjectionAdder {
// TODO: adding more expressions for sub query, currently only support for Simple Binary Expressions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a tracking issue for this that we could refer to in the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add that

@alamb
Copy link
Contributor

alamb commented Apr 29, 2024

I am trying to clean up our old PRs -- it looks to me like @mustafasrepo approved and asked for another reviewer #9719 (review). However it isn't clear to me if this PR needs another review before merge or if that is a nice to have.

I don't think we can merge this PR without some work to resolve the conflicts, but I don't want to waste people's time resolving conflicts if it isn't ready to merge.

Please let us know your thoughts

@mustafasrepo
Copy link
Contributor

mustafasrepo commented Apr 30, 2024

However it isn't clear to me if this PR needs another review before merge or if that is a nice to have.

While asking for another review, What I had in mind was "nice to have". I think, once conflicts are resolved we can merge this PR as is.

Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you all for working on this ❤️

I've reviewed it again, it's mostly LGTM. +1 for merging this once the conflict and remaining comments are solved.

}
pub fn is_not_complex(op: &Operator) -> bool {
matches!(
op,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sugg:

  • Operator::LtEq and Operator::GtEq can also be sonsidered not complex
  • Operator is Copy, so changing the function parameter to op: Operator seems better

Comment on lines +1023 to +1025
let f =
DFField::new_unqualified(&expr.to_string(), dtype.clone(), true);
field_set.insert(f.name().to_owned());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since only field name is used, we can skip constructing the DFField?

Suggested change
let f =
DFField::new_unqualified(&expr.to_string(), dtype.clone(), true);
field_set.insert(f.name().to_owned());
field_set.insert(expr.to_string());

@alamb
Copy link
Contributor

alamb commented May 17, 2024

Marking this one as a draft as I think it needs to have conflicts resolved before we can merge it

1 similar comment
@alamb
Copy link
Contributor

alamb commented May 17, 2024

Marking this one as a draft as I think it needs to have conflicts resolved before we can merge it

@alamb alamb marked this pull request as draft May 17, 2024 00:34
Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Jul 16, 2024
@github-actions github-actions bot closed this Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Keep track of common sub-expression across logical plan nodes
5 participants