[SPARK-43838][SQL] Fix subquery on single table with having clause can't be optimized #41347

Hisoka-X · 2023-05-28T05:54:44Z

What changes were proposed in this pull request?

Eg:

sql("create view t(c1, c2) as values (0, 1), (0, 2), (1, 2)")

sql("select c1, c2, (select count(*) cnt from t t2 where t1.c1 = t2.c1 " +
"having cnt = 0) from t t1").show()

The error will throw:

[PLAN_VALIDATION_FAILED_RULE_IN_BATCH] Rule org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery in batch Operator Optimization before Inferring Filters generated an invalid plan: The plan becomes unresolved: 'Project [toprettystring(c1#224, Some(America/Los_Angeles)) AS toprettystring(c1)#238, toprettystring(c2#225, Some(America/Los_Angeles)) AS toprettystring(c2)#239, toprettystring(cnt#246L, Some(America/Los_Angeles)) AS toprettystring(scalarsubquery(c1))#240]
+- 'Project [c1#224, c2#225, CASE WHEN isnull(alwaysTrue#245) THEN 0 WHEN NOT (cnt#222L = 0) THEN null ELSE cnt#222L END AS cnt#246L]
   +- 'Join LeftOuter, (c1#224 = c1#224#244)
      :- Project [col1#226 AS c1#224, col2#227 AS c2#225]
      :  +- LocalRelation [col1#226, col2#227]
      +- Project [cnt#222L, c1#224#244, cnt#222L, c1#224, true AS alwaysTrue#245]
         +- Project [cnt#222L, c1#224 AS c1#224#244, cnt#222L, c1#224]
            +- Aggregate [c1#224], [count(1) AS cnt#222L, c1#224]
               +- Project [col1#228 AS c1#224]
                  +- LocalRelation [col1#228, col2#229]The previous plan: Project [toprettystring(c1#224, Some(America/Los_Angeles)) AS toprettystring(c1)#238, toprettystring(c2#225, Some(America/Los_Angeles)) AS toprettystring(c2)#239, toprettystring(scalar-subquery#223 [c1#224 && (c1#224 = c1#224#244)], Some(America/Los_Angeles)) AS toprettystring(scalarsubquery(c1))#240]
:  +- Project [cnt#222L, c1#224 AS c1#224#244]
:     +- Filter (cnt#222L = 0)
:        +- Aggregate [c1#224], [count(1) AS cnt#222L, c1#224]
:           +- Project [col1#228 AS c1#224]
:              +- LocalRelation [col1#228, col2#229]
+- Project [col1#226 AS c1#224, col2#227 AS c2#225]
   +- LocalRelation [col1#226, col2#227]

The reason of error is the unresolved expression in Join node which generate by subquery decorrelation. The duplicateResolved in Join node are false. That's meaning the Join left and right have same Attribute, in this eg is c1#224. The right c1#224 Attribute generated by having Inputs, because there are wrong having Inputs.

This problem only occurs when there contain having clause.

also do some code format fix.

Why are the changes needed?

Fix subquery bug on single table when use having clause

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add new test

Hisoka-X · 2023-05-28T05:55:43Z

cc @cloud-fan @jchen5

cloud-fan · 2023-05-31T10:09:43Z

I think a better fix is to let DeduplicateRelations handle Project with alias as well, which is also a source of conflicting attribute ids.

Hisoka-X · 2023-06-07T05:37:35Z

I think a better fix is to let DeduplicateRelations handle Project with alias as well, which is also a source of conflicting attribute ids.

@cloud-fan Hi, I already implement DeduplicateRelations to handle with Project. Should I revert change from RewriteCorrelatedScalarSubquery? I think it's a hidden danger

allisonwang-db · 2023-07-05T23:07:49Z

cc @jchen5

cloud-fan · 2023-07-07T20:43:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala

@@ -105,6 +105,21 @@ object DeduplicateRelations extends Rule[LogicalPlan] {
        (m, false)
      }

+    case p @ Project(_, child) if p.resolved && p.projectList.forall(_.isInstanceOf[Alias]) =>


I'm reading the doc of the collectConflictPlans function in this class. I think the problem we have now is there are more plan nodes than the leaf node that can produce new attributes, and we need to handle all of them. Project is not the only way, let's follow what plan nodes collectConflictPlans handles.

I add all plan node which in collectConflictPlans. Please check again, but there are another problem. How to test it, seem produce an negative case not easy.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala

Hisoka-X · 2023-07-11T01:42:42Z

...t/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LeftSemiAntiJoinPushDownSuite.scala

@@ -437,25 +437,6 @@ class LeftSemiAntiJoinPushDownSuite extends PlanTest {
    }
  }

-  Seq(LeftSemi, LeftAnti).foreach { case jt =>


This test unnecessary. Because we can deduplicate those attributes in anti-join / semi-join is a self-join. Please refer #39131

Hisoka-X · 2023-07-11T01:44:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala

@@ -271,7 +271,10 @@ case class ScalarSubquery(
    mayHaveCountBug: Option[Boolean] = None)
  extends SubqueryExpression(plan, outerAttrs, exprId, joinCond, hint) with Unevaluable {
  override def dataType: DataType = {
-    assert(plan.schema.fields.nonEmpty, "Scalar subquery should have only one column")
+    if (!plan.schema.fields.nonEmpty) {


make sure the AnalysisException will be throw, not AssertionError

Can we really reach this code branch?

Yes. Usually this error will be thrown by checkAnalysis, but we may call datatype in DeduplicateRelations to cause this exception to be thrown. This change ensures that the thrown exception is consistent.

Change before:

Caused by: sbt.ForkMain$ForkError: java.lang.AssertionError: assertion failed: Scalar subquery should have only one column at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.sql.catalyst.expressions.ScalarSubquery.dataType(subquery.scala:274) at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:194) at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$$anonfun$findAliases$1.applyOrElse(DeduplicateRelations.scala:530) at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$$anonfun$findAliases$1.applyOrElse(DeduplicateRelations.scala:530) at scala.PartialFunction.$anonfun$runWith$1$adapted(PartialFunction.scala:145) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.collect(TraversableLike.scala:407) at scala.collection.TraversableLike.collect$(TraversableLike.scala:405) at scala.collection.AbstractTraversable.collect(Traversable.scala:108) at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.findAliases(DeduplicateRelations.scala:530) at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.org$apache$spark$sql$catalyst$analysis$DeduplicateRelations$$renewDuplicatedRelations(DeduplicateRelations.scala:120) at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.apply(DeduplicateRelations.scala:40) at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.apply(DeduplicateRelations.scala:38)

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala

cloud-fan · 2023-07-20T02:15:35Z

The k8s failure is unrelated, I'm merging it to master, thanks!

Hisoka-X · 2023-07-20T02:18:44Z

Thanks @cloud-fan for your help and @allisonwang-db

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala

…dRelations` ### What changes were proposed in this pull request? This is a follow up PR for #41347 , add missing aggregate case in `renewDuplicatedRelations` ### Why are the changes needed? add missing case ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? exist test. Closes #42160 from Hisoka-X/SPARK-43838_subquery_aggregate_follow_up. Authored-by: Jia Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…n't be optimized ### What changes were proposed in this pull request? Eg: ```scala sql("create view t(c1, c2) as values (0, 1), (0, 2), (1, 2)") sql("select c1, c2, (select count(*) cnt from t t2 where t1.c1 = t2.c1 " + "having cnt = 0) from t t1").show() ``` The error will throw: ``` [PLAN_VALIDATION_FAILED_RULE_IN_BATCH] Rule org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery in batch Operator Optimization before Inferring Filters generated an invalid plan: The plan becomes unresolved: 'Project [toprettystring(c1#224, Some(America/Los_Angeles)) AS toprettystring(c1)apache#238, toprettystring(c2#225, Some(America/Los_Angeles)) AS toprettystring(c2)apache#239, toprettystring(cnt#246L, Some(America/Los_Angeles)) AS toprettystring(scalarsubquery(c1))apache#240] +- 'Project [c1#224, c2#225, CASE WHEN isnull(alwaysTrue#245) THEN 0 WHEN NOT (cnt#222L = 0) THEN null ELSE cnt#222L END AS cnt#246L] +- 'Join LeftOuter, (c1#224 = c1#224#244) :- Project [col1#226 AS c1#224, col2#227 AS c2#225] : +- LocalRelation [col1#226, col2#227] +- Project [cnt#222L, c1#224#244, cnt#222L, c1#224, true AS alwaysTrue#245] +- Project [cnt#222L, c1#224 AS c1#224#244, cnt#222L, c1#224] +- Aggregate [c1#224], [count(1) AS cnt#222L, c1#224] +- Project [col1#228 AS c1#224] +- LocalRelation [col1#228, col2#229]The previous plan: Project [toprettystring(c1#224, Some(America/Los_Angeles)) AS toprettystring(c1)apache#238, toprettystring(c2#225, Some(America/Los_Angeles)) AS toprettystring(c2)apache#239, toprettystring(scalar-subquery#223 [c1#224 && (c1#224 = c1#224#244)], Some(America/Los_Angeles)) AS toprettystring(scalarsubquery(c1))apache#240] : +- Project [cnt#222L, c1#224 AS c1#224#244] : +- Filter (cnt#222L = 0) : +- Aggregate [c1#224], [count(1) AS cnt#222L, c1#224] : +- Project [col1#228 AS c1#224] : +- LocalRelation [col1#228, col2#229] +- Project [col1#226 AS c1#224, col2#227 AS c2#225] +- LocalRelation [col1#226, col2#227] ``` The reason of error is the unresolved expression in `Join` node which generate by subquery decorrelation. The `duplicateResolved` in `Join` node are false. That's meaning the `Join` left and right have same `Attribute`, in this eg is `c1#224`. The right `c1#224` `Attribute` generated by having Inputs, because there are wrong having Inputs. This problem only occurs when there contain having clause. also do some code format fix. ### Why are the changes needed? Fix subquery bug on single table when use having clause ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add new test Closes apache#41347 from Hisoka-X/SPARK-43838_subquery_having. Lead-authored-by: Jia Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…dRelations` ### What changes were proposed in this pull request? This is a follow up PR for apache#41347 , add missing aggregate case in `renewDuplicatedRelations` ### Why are the changes needed? add missing case ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? exist test. Closes apache#42160 from Hisoka-X/SPARK-43838_subquery_aggregate_follow_up. Authored-by: Jia Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added the SQL label May 28, 2023

Hisoka-X mentioned this pull request Jun 19, 2023

[SPARK-43781][SQL] Fix IllegalStateException when cogrouping two datasets derived from the same source #41554

Closed

cloud-fan reviewed Jul 7, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 7, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala Outdated Show resolved Hide resolved

Hisoka-X force-pushed the SPARK-43838_subquery_having branch from 3a29635 to 075ef46 Compare July 10, 2023 14:02

update code

0c12ef9

Hisoka-X force-pushed the SPARK-43838_subquery_having branch from 075ef46 to 0c12ef9 Compare July 10, 2023 14:05

update code

0b3fce7

Hisoka-X commented Jul 11, 2023

View reviewed changes

Hisoka-X added 2 commits July 11, 2023 12:26

Merge branch 'master_' into SPARK-43838_subquery_having

56f8fd6

Merge branch 'master_' into SPARK-43838_subquery_having

7b2521c

cloud-fan reviewed Jul 13, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala Outdated Show resolved Hide resolved

cloud-fan and others added 2 commits July 18, 2023 15:55

add demo

76b84b5

extract method

d265a28

cloud-fan reviewed Jul 18, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 18, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala Outdated Show resolved Hide resolved

Hisoka-X added 2 commits July 18, 2023 18:21

extract method

9b37afd

update test case

9822c0a

cloud-fan reviewed Jul 19, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 19, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Jul 19, 2023

View reviewed changes

remove curry

f324892

cloud-fan closed this in e0c79c6 Jul 20, 2023

Hisoka-X deleted the SPARK-43838_subquery_having branch July 20, 2023 02:22

cloud-fan reviewed Jul 25, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala Show resolved Hide resolved

Hisoka-X mentioned this pull request Jul 26, 2023

[SPARK-43838][SQL][FOLLOWUP] Add missing aggregate in renewDuplicatedRelations #42160

Closed

Zand100 mentioned this pull request Dec 13, 2024

[SPARK-50489][SQL][PYTHON] Fix self-join after applyInArrow #49056

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43838][SQL] Fix subquery on single table with having clause can't be optimized #41347

[SPARK-43838][SQL] Fix subquery on single table with having clause can't be optimized #41347

Hisoka-X commented May 28, 2023 •

edited

Loading

Hisoka-X commented May 28, 2023

cloud-fan commented May 31, 2023

Hisoka-X commented Jun 7, 2023

allisonwang-db commented Jul 5, 2023

cloud-fan Jul 7, 2023

Hisoka-X Jul 10, 2023

Hisoka-X Jul 11, 2023

Hisoka-X Jul 11, 2023

cloud-fan Jul 13, 2023

Hisoka-X Jul 13, 2023 •

edited

Loading

cloud-fan commented Jul 20, 2023

Hisoka-X commented Jul 20, 2023

[SPARK-43838][SQL] Fix subquery on single table with having clause can't be optimized #41347

[SPARK-43838][SQL] Fix subquery on single table with having clause can't be optimized #41347

Conversation

Hisoka-X commented May 28, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Hisoka-X commented May 28, 2023

cloud-fan commented May 31, 2023

Hisoka-X commented Jun 7, 2023

allisonwang-db commented Jul 5, 2023

cloud-fan Jul 7, 2023

Choose a reason for hiding this comment

Hisoka-X Jul 10, 2023

Choose a reason for hiding this comment

Hisoka-X Jul 11, 2023

Choose a reason for hiding this comment

Hisoka-X Jul 11, 2023

Choose a reason for hiding this comment

cloud-fan Jul 13, 2023

Choose a reason for hiding this comment

Hisoka-X Jul 13, 2023 • edited Loading

Choose a reason for hiding this comment

cloud-fan commented Jul 20, 2023

Hisoka-X commented Jul 20, 2023

Hisoka-X commented May 28, 2023 •

edited

Loading

Hisoka-X Jul 13, 2023 •

edited

Loading