Keep track of common sub-expression across logical plan nodes #9576

mustafasrepo · 2024-03-12T10:14:30Z

Is your feature request related to a problem or challenge?

No response

Describe the solution you'd like

Currently, common CommonSubexprEliminate LogicalPlan optimizer rule analyzes common sub-expressions in a query. Then caches, common sub-expression by adding a LogicalPlan::Projection if it thinks this is beneficial.
As an example, following query

SELECT c3+c4, SUM(c3+c4) OVER(order by c3+c4)
FROM t

generates following LogicalPlan:

Projection: t.c3 + t.c4, SUM(t.c3 + t.c4) ORDER BY [t.c3 + t.c4 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
--WindowAggr: windowExpr=[[SUM(CAST(t.c3 + t.c4t.c4t.c3 AS t.c3 + t.c4 AS Int64)) ORDER BY [t.c3 + t.c4t.c4t.c3 AS t.c3 + t.c4 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW AS SUM(t.c3 + t.c4) ORDER BY [t.c3 + t.c4 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW]]
----Projection: t.c3 + t.c4 AS t.c3 + t.c4t.c4t.c3, t.c3, t.c4
------TableScan: t projection=[c3, c4]

where t.c3+t.c4 is calculated once in the Projection then referred by subsequent WindowAggr as a column.

However, following query:

SELECT c3+c4, SUM(c3+c4) OVER()
FROM t

generates following LogicalPlan:

Projection: t.c3 + t.c4, SUM(t.c3 + t.c4) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
--WindowAggr: windowExpr=[[SUM(CAST(t.c3 + t.c4 AS Int64)) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]]
----TableScan: t projection=[c3, c4]

instead we could generate following plan:

Projection: col(t.c3 + t.c4), SUM(t.c3 + t.c4) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
--WindowAggr: windowExpr=[[SUM(CAST(col(t.c3 + t.c4) AS t.c3 + t.c4 AS Int64)) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]]
----Projection: t.c3 + t.c4 AS col(t.c3 + t.c4)
------TableScan: t projection=[c3, c4]

If were to keep track of common sub expression counts globally across different nodes in the LogicalPlan. This will enable us to generate better LogicalPlans.

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

Lordworms · 2024-03-17T03:15:17Z

I can do this one

mustafasrepo · 2024-03-18T06:25:00Z

One can use following query to generate table t

CREATE EXTERNAL TABLE t (
  c1  VARCHAR NOT NULL,
  c2  TINYINT NOT NULL,
  c3  SMALLINT NOT NULL,
  c4  SMALLINT,
  c5  INT,
  c6  BIGINT NOT NULL,
  c7  SMALLINT NOT NULL,
  c8  INT NOT NULL,
  c9  BIGINT UNSIGNED NOT NULL,
  c10 VARCHAR NOT NULL,
  c11 FLOAT NOT NULL,
  c12 DOUBLE NOT NULL,
  c13 VARCHAR NOT NULL
)
STORED AS CSV
WITH HEADER ROW
LOCATION '../../testing/data/csv/aggregate_test_100.csv'

Lordworms · 2024-03-19T23:26:44Z

Hey @mustafasrepo, the reason that the first plan added a new Projection is that in the Rewriter it would mark the c3+c4 twice so that it judges the expressions needed to add an extra Projection layer. However, here I got two Problem and wish you could give me an answer. Currently, it seems like I have two ways to implement this feature

I can directly go through all the expressions of a single plan and if I find a BinaryOp then I just add another Projection upon the current plan with sub query (easy solution)
The second one is that we should track a DAG over the expression and if it is referenced in another plan, we add an extra projection. (which I have no idea how to properly trace them in treenode recursion), I don't want to do another recursion.
Which one do you think is better?

mustafasrepo · 2024-03-20T07:08:28Z

Hey @mustafasrepo, the reason that the first plan added a new Projection is that in the Rewriter it would mark the c3+c4 twice so that it judges the expressions needed to add an extra Projection layer. However, here I got two Problem and wish you could give me an answer. Currently, it seems like I have two ways to implement this feature

I can directly go through all the expressions of a single plan and if I find a BinaryOp then I just add another Projection upon the current plan with sub query (easy solution)

The second one is that we should track a DAG over the expression and if it is referenced in another plan, we add an extra projection. (which I have no idea how to properly trace them in treenode recursion), I don't want to do another recursion.
Which one do you think is better?

The current approach is we use projection to calculate a complex expression if it is used at least twice (Otherwise projection deemed unnecessary). Hence, first option wouldn't work in this case. The other approach may work, however, it may place the projection in a sub-optimal spot. As an example, consider following plan,

Projection(a+b)
--Filter (a+b=0),
----Sort(a+b ASC),
------TableScan(a,b)

with second approach you might produce plan below (still better than current behaviour. However, sub-optimal)

Projection(`a+b`)
--Filter (`a+b`=0),
----Projection (a+b as `a+b`)
------Sort(a+b ASC),
--------TableScan(a,b)

where I used `a+b` to distinguish it from binary expression a+b.
However, instead we could have generated following plan

Projection(`a+b`)
--Filter (`a+b`=0),
----Sort(`a+b` ASC),
------Projection (a+b as `a+b`)
--------TableScan(a,b)

Hence, I think best approach is to traverse plan from top to bottom and keeping the cumulative complex expression counts in the plan. For plan below

Projection(a+b)
--Filter (a+b=0),
----Sort(a+b ASC),
------TableScan(a,b)

This would produce

Projection(a+b), ("a+b", count: 1)
--Filter (a+b=0), ("a+b", count: 2)
----Sort(a+b ASC),  ("a+b", count: 2)
------TableScan(a,b)

Then after constructing, above tree. With a bottom-up traversal we can generate following plan

Projection(`a+b`)
--Filter (`a+b`=0),
----Sort(`a+b` ASC),
------Projection (a+b as `a+b`)
--------TableScan(a,b)

by implacing projections to calculate common expression that are used more than once by subsequent stages. However, I presume this would involve a lot of work.

Lordworms · 2024-03-20T10:53:09Z

Hey @mustafasrepo, the reason that the first plan added a new Projection is that in the Rewriter it would mark the c3+c4 twice so that it judges the expressions needed to add an extra Projection layer. However, here I got two Problem and wish you could give me an answer. Currently, it seems like I have two ways to implement this feature

I can directly go through all the expressions of a single plan and if I find a BinaryOp then I just add another Projection upon the current plan with sub query (easy solution)

The second one is that we should track a DAG over the expression and if it is referenced in another plan, we add an extra projection. (which I have no idea how to properly trace them in treenode recursion), I don't want to do another recursion.
Which one do you think is better?

The current approach is we use projection to calculate a complex expression if it is used at least twice (Otherwise projection deemed unnecessary). Hence, first option wouldn't work in this case. The other approach may work, however, it may place the projection in a sub-optimal spot. As an example, consider following plan,
Projection(a+b)
--Filter (a+b=0),
----Sort(a+b ASC),
------TableScan(a,b)
with second approach you might produce plan below (still better than current behaviour. However, sub-optimal)
Projection(`a+b`)
--Filter (`a+b`=0),
----Projection (a+b as `a+b`)
------Sort(a+b ASC),
--------TableScan(a,b)
where I used a+b to distinguish it from binary expression a+b. However, instead we could have generated following plan
Projection(`a+b`)
--Filter (`a+b`=0),
----Sort(`a+b` ASC),
------Projection (a+b as `a+b`)
--------TableScan(a,b)
Hence, I think best approach is to traverse plan from top to bottom and keeping the cumulative complex expression counts in the plan. For plan below
Projection(a+b)
--Filter (a+b=0),
----Sort(a+b ASC),
------TableScan(a,b)
This would produce
Projection(a+b), ("a+b", count: 1)
--Filter (a+b=0), ("a+b", count: 2)
----Sort(a+b ASC),  ("a+b", count: 2)
------TableScan(a,b)
Then after constructing, above tree. With a bottom-up traversal we can generate following plan
Projection(`a+b`)
--Filter (`a+b`=0),
----Sort(`a+b` ASC),
------Projection (a+b as `a+b`)
--------TableScan(a,b)
by implacing projections to calculate common expression that are used more than once by subsequent stages. However, I presume this would involve a lot of work.

I got it, Thanks for your solutions. I plan to implement this today.

mustafasrepo added the enhancement New feature or request label Mar 12, 2024

Lordworms mentioned this issue Mar 21, 2024

Adding complex expressions projections for Subquery #9719

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep track of common sub-expression across logical plan nodes #9576

Keep track of common sub-expression across logical plan nodes #9576

mustafasrepo commented Mar 12, 2024

Lordworms commented Mar 17, 2024

mustafasrepo commented Mar 18, 2024

Lordworms commented Mar 19, 2024 •

edited

Loading

mustafasrepo commented Mar 20, 2024

Lordworms commented Mar 20, 2024

Keep track of common sub-expression across logical plan nodes #9576

Keep track of common sub-expression across logical plan nodes #9576

Comments

mustafasrepo commented Mar 12, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Lordworms commented Mar 17, 2024

mustafasrepo commented Mar 18, 2024

Lordworms commented Mar 19, 2024 • edited Loading

mustafasrepo commented Mar 20, 2024

Lordworms commented Mar 20, 2024

Lordworms commented Mar 19, 2024 •

edited

Loading