Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSE shorthand alias #10868

Closed

Conversation

MohamedAbdeen21
Copy link
Contributor

@MohamedAbdeen21 MohamedAbdeen21 commented Jun 11, 2024

Which issue does this PR close?

Closes #10280.

Rationale for this change

Shorten aliases generated by CSE for readability

What changes are included in this PR?

Use shorthand numeric aliases (#1, #2, #3, ...) for common subexpressions. Shouldn't cause same conflicts as #10333 as it still uses the same underlying expression identifier string.

Are these changes tested?

Using existing tests + a couple of new tests

Are there any user-facing changes?

Better and more concise logical plans

cc @peter-toth

@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Jun 11, 2024
@@ -1080,8 +1080,8 @@ query TT
explain select a/2, a/2 + 1 from t
----
logical_plan
01)Projection: {t.a / Int64(2)|{Int64(2)}|{t.a}} AS t.a / Int64(2), {t.a / Int64(2)|{Int64(2)}|{t.a}} AS t.a / Int64(2) + Int64(1)
02)--Projection: t.a / Int64(2) AS {t.a / Int64(2)|{Int64(2)}|{t.a}}
01)Projection: #1 AS t.a / Int64(2), #1 AS t.a / Int64(2) + Int64(1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second projection is #1 + 1 , but the alias makes it hard to read. Can't figure out how to fix this, appreciate any feedback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But don't we need to keep the AS t.a / Int64(2) alias in this case so as to keep the original top level column name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be kept for the first projection, yes.

Ideally, the second projection IMO should be #1 + 1 AS ....

@@ -166,6 +166,15 @@ impl CommonSubexprEliminate {
) -> Result<(Vec<Vec<Expr>>, LogicalPlan)> {
let mut common_exprs = IndexMap::new();

input.schema().iter().for_each(|(qualifier, field)| {
Copy link
Contributor

@peter-toth peter-toth Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should suppose that all extracted aliases remain in the plan:

  • Let's suppose that we have a plan like this after some CSE optimization:
    ...
      Projection: #1 as c1, #1 as c2, #2 as c3, #2 as c4, (a + 2) as c5, (a + 1 + 1) as c6
        Projection: (a + b) as "#1", (a + c) as "#2", a
    ...
    
  • Then some other optimization rule prunes "c1" and "c2" from the plan because they turn out to be unnecessary:
    ...
      Projection: #2 as c3, #2 as c4, (a + 2) as c5, (a + 1 + 1) as c6
        Projection: (a + c) as "#2", a
    ...
    
  • And then some other rule creates new CSE possibilities:
    ...
      Projection: #2 as c3, #2 as c4, (a + 2) as c5, (a + 2) as c6
        Projection: (a + c) as "#2", a
    ...
    
  • CSE rule runs again but indexes here and at build_common_expr_project_plan() are out of sync...

IMO the best thing we can do is to choose a unique aliases for a common expressions in CommonSubexprRewriter when we found the expression and store the alias in common_exprs together with the expression. In that case we don't need to deal with index sync issues and don't get plans with unnecessary aliases like here: https://github.com/apache/datafusion/pull/10868/files#diff-351499880963d6a383c92e156e75019cd9ce33107724a9635853d7d4cd1898d0R1403

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both issues don't affect correctness.

One thing I'd like to point out is that adding unused columns (all input's columns) in intermediate projection is the behavior of current CSE, it's not introduced in this PR. You can try copying the new test and running it against main. You'll get this output.

        let plan = LogicalPlanBuilder::from(table_scan.clone())
            .project(vec![(col("a") + col("b")).alias("#1"), col("c")])?
            .project(vec![
                (col("c") + lit(2)).alias("c3"),
                (col("c") + lit(2)).alias("c4"),
            ])?
            .build()?;
Projection: {test.c + Int32(2)|{Int32(2)}|{test.c}} AS test.c + Int32(2) AS c3, {test.c + Int32(2)|{Int32(2)}|{test.c}} AS test.c + Int32(2) AS c4
  Projection: test.c + Int32(2) AS {test.c + Int32(2)|{Int32(2)}|{test.c}}, #1, test.c
    Projection: test.a + test.b AS #1, test.c
      TableScan: test

Extra projections are removed by other rules, so the final plan doesn't contain these projections.

Also, you may have noticed that extra projections make the aliases "out-of-sync" and to be honest I don't mind the #2 instead of #1 (as long as it's not something ridiculous like #1023 for example), and I don't see a way to fix that without patching some hacky global state/counter or asking other rules to reuse aliases when removing the extra projections.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, what I meant by "idexes go out of sync" is that if your modified CSE rule runs on a plan that we got in the 3rd step (i.e. there is no #1 in the plan) e.g.:

let plan = LogicalPlanBuilder::from(table_scan.clone())
    .project(vec![(col("a") + col("b")).alias("#2"), col("c")])?
    .project(vec![
        col("#2").alias("c1"),
        col("#2").alias("c2"),
        (col("c") + lit(2)).alias("c3"),
        (col("c") + lit(2)).alias("c4"),
    ])?
    .build()?;

then it produces an incorrect plan:

Projection: #2 AS c1, #2 AS c2, #2 AS c3, #2 AS c4 // The issue here is that `#2` gets aliased to `#1` below, but `#2` doesn't change here.
  Projection: #2 AS #1, test.c + Int32(2) AS #2, test.c
    Projection: test.a + test.b AS #2, test.c
      TableScan: test

This is because you inject #2 into common_exprs, but you don't inject it to expr_stats (and others).

IMO modifying common_exprs is hacky if you don't do it in CommonSubexprRewriter, that's why I suggested the solution in my previous comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed index usage; now we keep the original alias inside the common_exprs.

Copy link
Contributor

@peter-toth peter-toth Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this starts to look as the suggested because we assign the unique aliases in CommonSubexprRewriter and store it in common_exprs together with the common expression.

But why do you still inject previous # aliases to common_exprs? I think you just need to find the biggest one here and pass that number to CommonSubexprRewriter and simply start assigning new # aliases in f_down() from that number + 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a solution that can produce a unique alias fast. There is no problem with having gaps if we can do it constant time (vs. no gaps with linear time to the number of common expressions).

Copy link
Contributor Author

@MohamedAbdeen21 MohamedAbdeen21 Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n is usually really small. I don't think this is a big performance hit, and so filling the gaps is a good tradeoff IMO

Copy link
Contributor

@peter-toth peter-toth Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why do you want to fill the gaps? These are artifical aliases so having consecutive numbers has no use, all that matter is they are short, unique and easy to read. Also, if you don't inject anything into common_exprs then the pontless #1 AS #1 aliases won't get added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I don't think I'll be able to do that anytime soon.

If that's the only remaining issue, I can mark the PR as ready and a maintainer can push that change.

Copy link
Contributor

@peter-toth peter-toth Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do it and let me try to open a PR with the fix to your PR tomorrow or during the weekend.

@@ -342,7 +346,7 @@ impl CommonSubexprEliminate {
Aggregate::try_new(Arc::new(new_input), new_group_expr, new_aggr_expr)
.map(LogicalPlan::Aggregate)
} else {
let mut expr_number = common_exprs.len();
let mut expr_number = common_exprs.values().map(|t| t.1).max().unwrap_or(0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a test case

@MohamedAbdeen21 MohamedAbdeen21 marked this pull request as ready for review June 13, 2024 18:31
@MohamedAbdeen21
Copy link
Contributor Author

The failing CI is a simple clippy warning, I'd appreciate if it can be fixed before merging.

The last thread between me and @peter-toth mentions some possible improvements, but these should be a separate follow-up PR.

I'm extremely sorry for not being able to push any further changes. And thanks @peter-toth for partaking in this long, boring review process and helping me understand this rule better.

@peter-toth
Copy link
Contributor

peter-toth commented Jun 15, 2024

The failing CI is a simple clippy warning, I'd appreciate if it can be fixed before merging.

The last thread between me and @peter-toth mentions some possible improvements, but these should be a separate follow-up PR.

I'm extremely sorry for not being able to push any further changes. And thanks @peter-toth for partaking in this long, boring review process and helping me understand this rule better.

Thanks @MohamedAbdeen21 for the PR! I fixed the clippy issue and made some changes to alias generation as it turned out that there is an AliasGenerator available for optimizer rules. I also reverted some unnecessary changes.
https://github.com/peter-toth/arrow-datafusion/commits/cse-numeric-aliases/ contains your original commits, a merge commit from main and my suggestions in the last commit.
As far as I see we have 2 options:

  • I can open a PR targeting this PR and then you can review my suggestions and merge them into this PR.
  • Or I can open a new PR targeting main which will contain your commits too.

cc @alamb, as this PR might conflict with your #10835

@alamb
Copy link
Contributor

alamb commented Jun 15, 2024

cc @alamb, as this PR might conflict with your #10835

Thanks for the heads up @peter-toth -- I can handle any conflicts if they arise. I am still trying to get improved performance with that PR

@MohamedAbdeen21
Copy link
Contributor Author

MohamedAbdeen21 commented Jun 16, 2024

Hey @peter-toth sorry for the late reply.

Looks good, and I like the Alias generator. But I have a couple of comments:

  1. Thought we agreed on the # prefix. Is there a reason for choosing the __cse_ prefix? DuckDB uses #XXX, SQLServer uses ExprXXX (can't verify this atm), Spark uses _expr#XXX, all of which look cleaner than __cse_XXX
  2. Tiny nit, but can you use into_values instead of using into_iter then ignoring the key?

I don't mind either options, a PR against this one keeps the entire review history and can make reviews easier, but a new PR can be easier to maintain as you'll be able to directly push commits.

@peter-toth
Copy link
Contributor

peter-toth commented Jun 16, 2024

Hey @peter-toth sorry for the late reply.

Looks good, and I like the Alias generator. But I have a couple of comments:

  1. Thought we agreed on the # prefix. Is there a reason for choosing the __cse_ prefix? DuckDB uses #XXX, SQLServer uses ExprXXX (can't verify this atm), Spark uses _expr#XXX, all of which look cleaner than __cse_XXX
  2. Tiny nit, but can you use into_values instead of using into_iter then ignoring the key?

I don't mind either options, a PR against this one keeps the entire review history and can make reviews easier, but a new PR can be easier to maintain as you'll be able to directly push commits.

No problem, I've opened #10939.

  1. Yes and actually I'm fine with that as well, but I found a few AliasGenerator usecases in the existing code and wanted to follow the conventions of https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/decorrelate_predicate_subquery.rs#L249 and https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/scalar_subquery_to_join.rs#L245 with the __cse prefix.
  2. I've also changed into_iter() to into_values () in 6b1d1e3. Thanks for the suggestion!

@MohamedAbdeen21
Copy link
Contributor Author

Became part of #10939.

@MohamedAbdeen21 MohamedAbdeen21 deleted the cse-numeric-aliases branch June 17, 2024 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make alias_symbol more human-readable
3 participants