Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep track of common sub-expression across logical plan nodes #9576

Open
mustafasrepo opened this issue Mar 12, 2024 · 5 comments
Open

Keep track of common sub-expression across logical plan nodes #9576

mustafasrepo opened this issue Mar 12, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@mustafasrepo
Copy link
Contributor

Is your feature request related to a problem or challenge?

No response

Describe the solution you'd like

Currently, common CommonSubexprEliminate LogicalPlan optimizer rule analyzes common sub-expressions in a query. Then caches, common sub-expression by adding a LogicalPlan::Projection if it thinks this is beneficial.
As an example, following query

SELECT c3+c4, SUM(c3+c4) OVER(order by c3+c4)
FROM t

generates following LogicalPlan:

Projection: t.c3 + t.c4, SUM(t.c3 + t.c4) ORDER BY [t.c3 + t.c4 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
--WindowAggr: windowExpr=[[SUM(CAST(t.c3 + t.c4t.c4t.c3 AS t.c3 + t.c4 AS Int64)) ORDER BY [t.c3 + t.c4t.c4t.c3 AS t.c3 + t.c4 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW AS SUM(t.c3 + t.c4) ORDER BY [t.c3 + t.c4 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW]]
----Projection: t.c3 + t.c4 AS t.c3 + t.c4t.c4t.c3, t.c3, t.c4
------TableScan: t projection=[c3, c4]

where t.c3+t.c4 is calculated once in the Projection then referred by subsequent WindowAggr as a column.

However, following query:

SELECT c3+c4, SUM(c3+c4) OVER()
FROM t

generates following LogicalPlan:

Projection: t.c3 + t.c4, SUM(t.c3 + t.c4) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
--WindowAggr: windowExpr=[[SUM(CAST(t.c3 + t.c4 AS Int64)) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]]
----TableScan: t projection=[c3, c4]

instead we could generate following plan:

Projection: col(t.c3 + t.c4), SUM(t.c3 + t.c4) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
--WindowAggr: windowExpr=[[SUM(CAST(col(t.c3 + t.c4) AS t.c3 + t.c4 AS Int64)) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]]
----Projection: t.c3 + t.c4 AS col(t.c3 + t.c4)
------TableScan: t projection=[c3, c4]

If were to keep track of common sub expression counts globally across different nodes in the LogicalPlan. This will enable us to generate better LogicalPlans.

Describe alternatives you've considered

No response

Additional context

No response

@mustafasrepo mustafasrepo added the enhancement New feature or request label Mar 12, 2024
@Lordworms
Copy link
Contributor

I can do this one

@mustafasrepo
Copy link
Contributor Author

One can use following query to generate table t

CREATE EXTERNAL TABLE t (
  c1  VARCHAR NOT NULL,
  c2  TINYINT NOT NULL,
  c3  SMALLINT NOT NULL,
  c4  SMALLINT,
  c5  INT,
  c6  BIGINT NOT NULL,
  c7  SMALLINT NOT NULL,
  c8  INT NOT NULL,
  c9  BIGINT UNSIGNED NOT NULL,
  c10 VARCHAR NOT NULL,
  c11 FLOAT NOT NULL,
  c12 DOUBLE NOT NULL,
  c13 VARCHAR NOT NULL
)
STORED AS CSV
WITH HEADER ROW
LOCATION '../../testing/data/csv/aggregate_test_100.csv'

@Lordworms
Copy link
Contributor

Lordworms commented Mar 19, 2024

Hey @mustafasrepo, the reason that the first plan added a new Projection is that in the Rewriter it would mark the c3+c4 twice so that it judges the expressions needed to add an extra Projection layer. However, here I got two Problem and wish you could give me an answer. Currently, it seems like I have two ways to implement this feature

  1. I can directly go through all the expressions of a single plan and if I find a BinaryOp then I just add another Projection upon the current plan with sub query (easy solution)
  2. The second one is that we should track a DAG over the expression and if it is referenced in another plan, we add an extra projection. (which I have no idea how to properly trace them in treenode recursion), I don't want to do another recursion.
    Which one do you think is better?

@mustafasrepo
Copy link
Contributor Author

Hey @mustafasrepo, the reason that the first plan added a new Projection is that in the Rewriter it would mark the c3+c4 twice so that it judges the expressions needed to add an extra Projection layer. However, here I got two Problem and wish you could give me an answer. Currently, it seems like I have two ways to implement this feature

  1. I can directly go through all the expressions of a single plan and if I find a BinaryOp then I just add another Projection upon the current plan with sub query (easy solution)
  2. The second one is that we should track a DAG over the expression and if it is referenced in another plan, we add an extra projection. (which I have no idea how to properly trace them in treenode recursion), I don't want to do another recursion.
    Which one do you think is better?

The current approach is we use projection to calculate a complex expression if it is used at least twice (Otherwise projection deemed unnecessary). Hence, first option wouldn't work in this case. The other approach may work, however, it may place the projection in a sub-optimal spot. As an example, consider following plan,

Projection(a+b)
--Filter (a+b=0),
----Sort(a+b ASC),
------TableScan(a,b)

with second approach you might produce plan below (still better than current behaviour. However, sub-optimal)

Projection(`a+b`)
--Filter (`a+b`=0),
----Projection (a+b as `a+b`)
------Sort(a+b ASC),
--------TableScan(a,b)

where I used `a+b` to distinguish it from binary expression a+b.
However, instead we could have generated following plan

Projection(`a+b`)
--Filter (`a+b`=0),
----Sort(`a+b` ASC),
------Projection (a+b as `a+b`)
--------TableScan(a,b)

Hence, I think best approach is to traverse plan from top to bottom and keeping the cumulative complex expression counts in the plan. For plan below

Projection(a+b)
--Filter (a+b=0),
----Sort(a+b ASC),
------TableScan(a,b)

This would produce

Projection(a+b), ("a+b", count: 1)
--Filter (a+b=0), ("a+b", count: 2)
----Sort(a+b ASC),  ("a+b", count: 2)
------TableScan(a,b)

Then after constructing, above tree. With a bottom-up traversal we can generate following plan

Projection(`a+b`)
--Filter (`a+b`=0),
----Sort(`a+b` ASC),
------Projection (a+b as `a+b`)
--------TableScan(a,b)

by implacing projections to calculate common expression that are used more than once by subsequent stages. However, I presume this would involve a lot of work.

@Lordworms
Copy link
Contributor

Hey @mustafasrepo, the reason that the first plan added a new Projection is that in the Rewriter it would mark the c3+c4 twice so that it judges the expressions needed to add an extra Projection layer. However, here I got two Problem and wish you could give me an answer. Currently, it seems like I have two ways to implement this feature

  1. I can directly go through all the expressions of a single plan and if I find a BinaryOp then I just add another Projection upon the current plan with sub query (easy solution)
  2. The second one is that we should track a DAG over the expression and if it is referenced in another plan, we add an extra projection. (which I have no idea how to properly trace them in treenode recursion), I don't want to do another recursion.
    Which one do you think is better?

The current approach is we use projection to calculate a complex expression if it is used at least twice (Otherwise projection deemed unnecessary). Hence, first option wouldn't work in this case. The other approach may work, however, it may place the projection in a sub-optimal spot. As an example, consider following plan,

Projection(a+b)
--Filter (a+b=0),
----Sort(a+b ASC),
------TableScan(a,b)

with second approach you might produce plan below (still better than current behaviour. However, sub-optimal)

Projection(`a+b`)
--Filter (`a+b`=0),
----Projection (a+b as `a+b`)
------Sort(a+b ASC),
--------TableScan(a,b)

where I used a+b to distinguish it from binary expression a+b. However, instead we could have generated following plan

Projection(`a+b`)
--Filter (`a+b`=0),
----Sort(`a+b` ASC),
------Projection (a+b as `a+b`)
--------TableScan(a,b)

Hence, I think best approach is to traverse plan from top to bottom and keeping the cumulative complex expression counts in the plan. For plan below

Projection(a+b)
--Filter (a+b=0),
----Sort(a+b ASC),
------TableScan(a,b)

This would produce

Projection(a+b), ("a+b", count: 1)
--Filter (a+b=0), ("a+b", count: 2)
----Sort(a+b ASC),  ("a+b", count: 2)
------TableScan(a,b)

Then after constructing, above tree. With a bottom-up traversal we can generate following plan

Projection(`a+b`)
--Filter (`a+b`=0),
----Sort(`a+b` ASC),
------Projection (a+b as `a+b`)
--------TableScan(a,b)

by implacing projections to calculate common expression that are used more than once by subsequent stages. However, I presume this would involve a lot of work.

I got it, Thanks for your solutions. I plan to implement this today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants