[SPARK-12705] [SPARK-10777] [SQL] Analyzer Rule ResolveSortReferences #10678

gatorsmile · 2016-01-10T03:29:01Z

JIRA: https://issues.apache.org/jira/browse/SPARK-12705

Scope:
This PR is a general fix for sorting reference resolution when the child's outputSet does not have the order-by attributes (called, missing attributes):

UnaryNode support is limited to Project, Window, Aggregate, Distinct, Filter, RepartitionByExpression.
We will not try to resolve the missing references inside a subquery, unless the outputSet of this subquery contains it.

General Reference Resolution Rules:

Jump over the nodes with the following types: Distinct, Filter, RepartitionByExpression. Do not need to add missing attributes. The reason is their outputSet is decided by their inputSet, which is the outputSet of their children.
Group-by expressions in Aggregate: missing order-by attributes are not allowed to be added into group-by expressions since it will change the query result. Thus, in RDBMS, it is not allowed.
Aggregate expressions in Aggregate: if the group-by expressions in Aggregate contains the missing attributes but aggregate expressions do not have it, just add them into the aggregate expressions. This can resolve the analysisExceptions thrown by the three TCPDS queries.
Project and Window are special. We just need to add the missing attributes to their projectList.

Implementation:

Traverse the whole tree in a pre-order manner to find all the resolvable missing order-by attributes.
Traverse the whole tree in a post-order manner to add the found missing order-by attributes to the node if their inputSet contains the attributes.
If the origins of the missing order-by attributes are different nodes, each pass only resolves the missing attributes that are from the same node.

Risk:
Low. This rule will be trigger iff !s.resolved && child.resolved is true. Thus, very few cases are affected.

rxin · 2016-01-10T04:12:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -523,14 +523,37 @@ class Analyzer(
    def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
      case s @ Sort(ordering, global, p @ Project(projectList, child))
          if !s.resolved && p.resolved =>
-        val (newOrdering, missing) = resolveAndFindMissing(ordering, p, child)
+        val (newOrdering, missing, newChild): (Seq[SortOrder], Seq[Attribute], LogicalPlan) =


can you add some comment here about why we need two separate cases.

Sure, will do

SparkQA · 2016-01-10T05:15:56Z

Test build #49052 has finished for PR 10678 at commit 5ca4630.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-10T07:24:08Z

Test build #49054 has finished for PR 10678 at commit da6baf2.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-01-10T16:33:24Z

retest this please.

SparkQA · 2016-01-10T17:09:05Z

Test build #49058 has finished for PR 10678 at commit da6baf2.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-01-10T17:15:51Z

retest this please.

SparkQA · 2016-01-10T17:39:23Z

Test build #49060 has finished for PR 10678 at commit da6baf2.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-10T19:07:58Z

Test build #49061 has finished for PR 10678 at commit b5de079.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-01-11T03:06:27Z

@davies Could you take a look if the fix covers the analysis resolution issue in TPCD? Thank you!

davies · 2016-01-11T06:14:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+          p.child match {
+            // Case 1: when WINDOW functions are used in the SELECT clause.
+            //   Example: SELECT sum(col1) OVER() FROM table1 ORDER BY col2
+            case p1 @ Project(_, w @ Window(_, _, _, _, p2: Project)) =>


What if the child of Window is not a Project (a Aggregation or other)? Or there are multiple projections ?

davies · 2016-01-11T06:34:28Z

@gatorsmile Thanks for working on this, but after this patch, TPCDS Q12/Q20/Q98 are still can't be resolved: " cannot resolve 'i_item_id' given input columns: [i_class, itemrevenue, i_category, i_current_price".

You can see all these queries here: https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDS_1_4_Queries.scala#L612

gatorsmile · 2016-01-11T07:10:59Z

Let me do more investigation tomorrow. @davies : )

gatorsmile · 2016-01-11T22:11:37Z

I can reproduce the problem using a simple query now:

select area, rank() over (partition by area order by month) as c1
from windowData group by product, area, month order by product

The logical plan is like

'Sort ['product ASC], true
+- Project [area#1,c1#3]
   +- Project [area#1,month#0,c1#3,c1#3]
      +- Window [area#1,month#0], [rank() windowspecdefinition(area#1,month#0 ASC,ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS c1#3], [area#1], [month#0 ASC]
         +- Aggregate [product#2,area#1,month#0], [area#1,month#0]
            +- Subquery windowdata
               +- LogicalRDD [month#0,area#1,product#2], MapPartitionsRDD[1] at apply at Transformer.scala:22

This does not match the existing two patterns. Thus, it is unable to resolve the sorting columns. I will try to find and write a general rule to handle the cases, as you suggested.

gatorsmile · 2016-01-12T07:55:40Z

I found the sorting columns could be partially resolved by different nodes. For example,
select a.c1, b.c2 from t1 a, t1 b order by b.c3, a.c3

I need extra time to write the fix. It becomes more complex than what I thought.

gatorsmile · 2016-01-12T15:13:15Z

Like the example I posted above, it works in RDBMS. However, supporting it in Spark SQL is not trivial. In this example, Join operator is involved but it does not have a projectList attribute. Thus, we are unable to directly add missing attributes to it. Now, only two operators Project and Window are doable. That means, we might see more JIRAs to complain it.

In the long term, maybe we can add projectList to all the logical operators? That indicates we integrate Project into all the logical operators.

I will check if the three TPCDS queries can be resolved without major code changes. Thanks!

davies · 2016-01-12T17:39:31Z

@gatorsmile There is a JIRA to have projectList in Join, we can do that later. For this PR, we may just push the missing attributes until any JOIN.

gatorsmile · 2016-01-12T18:07:55Z

Thank you! @davies Will upload the new version tonight.

gatorsmile · 2016-01-13T06:27:45Z

In this update, the code changes are trying to enhance the existing support for the following scenario: Sort-by attributes that are not present in the SELECT clause

Now, between the top Sort operator and the operator that can resolve the sorting columns, it can have one or multiple different operators of Aggregate, Project, Subquery and Window. The reason why we just support these four types is that only these four types we can add sorting attributes into their project list. In the future, we can easily extend the solution by supporting more types.

I am not sure if we should change the collection we used in this tail recursion function? Now, it is Seq. If the plan tree is huge, the performance could be not good. Thanks!

davies · 2016-01-13T06:44:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -73,6 +74,7 @@ class Analyzer(
      ResolveGroupingAnalytics ::
      ResolvePivot ::
      ResolveUpCast ::
+      ResolveAggregateFunctions ::


Why we need to change the order?

In this rule ResolveAggregateFunctions, it has a case for pushing down the expressions in Sort operator to the underlying Aggregate. This needs to be done earlier than the rule ResolveSortReferences; Otherwise, we might add unnecessary missing sorting attributes.

depending on rule orders of one batch is not a good idea. We should either add more if conditions to only capture needed cases, or separate them into multiple batches.

Yeah, you are right. Will do the change to exclude the cases that can be resolved by ResolveAggregateFunctions. Thanks!

davies · 2016-01-19T18:56:06Z

DISTINCT sounds good, can we use DISTRIBUTE BY, CLUSTER BY together with ORDER BY? they all change the final order.

gatorsmile · 2016-01-19T18:58:49Z

I saw a test case in LogicalPlanToSQLSuite.scala.

  test("distribute by with sort by") {
    checkHiveQl("SELECT id FROM t0 DISTRIBUTE BY id SORT BY id")
  }

Do you think users could use it in this way?

davies · 2016-01-19T19:15:32Z

@gatorsmile Let's keep that (DISTRIBUTE BY, CLUSTER BY) for now (even the query does not make sense to me).

gatorsmile · 2016-01-19T19:18:43Z

Thank you for your reviews! @davies

gatorsmile · 2016-01-19T19:48:27Z

Since we do not plan to support sort reference resolution inside subquery, should we just close the following JIRA? https://issues.apache.org/jira/browse/SPARK-10777

davies · 2016-01-19T20:09:30Z

@gatorsmile I think that JIRA still valid, because the subquery already output the required attribute, we don't need to add the missing attributes into the project list inside subquery. I believe this PR could fix that.

davies · 2016-01-19T20:09:57Z

Could you also include that in the title?

gatorsmile · 2016-01-19T21:04:08Z

Sure, let me add it.

SparkQA · 2016-01-20T08:22:27Z

Test build #49765 has finished for PR 10678 at commit 1964884.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-01-20T16:08:34Z

@davies The code is ready to review, after addressing all the comments.

@davies @cloud-fan The current implementation of ResolveAggregateFunctions also tries to resolve the missing sort references, but it has a limit. The related logics is triggered iff Aggregate is the child of Sort. I am just thinking if we should move the related part of ResolveAggregateFunctions into the rule ResolveSortReferences? Thanks!

Update: Tried to make it more general, but it is complex. Thus, I did not change the original algorithm in ResolveAggregateFunctions. Just called it in the ResolveSortReferences

marmbrus · 2016-01-25T23:11:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+            case w: Window =>
+              w.copy(projectList = w.projectList ++
+                missingResolvableAttrs.filter((w.inputSet -- w.outputSet).contains))
+            case a: Aggregate =>


This case can never happen right?

In the following query, we can trigger this case. Actually, this query is based on the failed TPCDS queries. Thus, we added it as a test case. The column product is in the group-by clause but not appeared in aggregateExpressions. Thus, we hit this error if we want to sort the results by product.
select area, rank() over (partition by area order by month) as c1 from windowData group by product, area, month order by product, area

If we remove this case, we will get this error:

Failed to analyze query: org.apache.spark.sql.AnalysisException: resolved attribute(s) product#2 missing from area#1,c1#39 in operator !Sort [product#2 ASC,area#1 ASC], true; Project [area#1,c1#48] +- !Sort [product#2 ASC,area#1 ASC], true +- Project [area#1,c1#48] +- Project [area#1,month#0,c1#48,c1#48] +- Window [area#1,month#0], [rank(month#0) windowspecdefinition(area#1,month#0 ASC,ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS c1#48], [area#1], [month#0 ASC] +- Aggregate [product#2,area#1,month#0], [area#1,month#0] +- Subquery windowdata +- LogicalRDD [month#0,area#1,product#2], MapPartitionsRDD[1] at apply at Transformer.scala:22

SparkQA · 2016-01-31T01:25:00Z

Test build #50449 has finished for PR 10678 at commit ba02f46.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-01-31T05:01:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

-      case s @ Sort(ordering, global, p @ Project(projectList, child))
-          if !s.resolved && p.resolved =>
-        val (newOrdering, missing) = resolveAndFindMissing(ordering, p, child)
+      case s @ Sort(_, _, a: Aggregate) if a.resolved =>


Do we need to check that !s.resolved? we have that in next case.

to be more clear, ResolveSortReferences tries to resolve attributes in SortOrders, and ResolveAggregateFunctions tries to resolve aggregate functions in unexpected places(filter or sort), right?

So I think we should skip sort with aggregate functions here, i.e.
case s: Sort if s.order.exists(ResolveAggregateFunctions.containsAggregate) => s, and add comment to say this case should be handled in ResolveAggregateFunctions

There could be missing attributes together with aggregate functions, will that work?

@davies The missing attributes are also handled in ResolveAggregateFunctions. Thus it works. To answer your first question regarding !s.resolved, this is part of the algorithm design in the rule ResolveAggregateFunctions, as shown below: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L706-L708

@cloud-fan Sure, let me change it. Thanks!

So this seems that the rule in ResolveAggregateFunctions does not really resolve the missing attributes, we could keep that rule unchanged in this PR.

If it's not trivial to fix this, we could create another JIRA for that.

Unfortunately, it is not trivial. : ( So far, the current rule ResolveSortReferences only can handle the missing attributes. In this case, we have to push the aggregate function down to the underlying Aggregate. Thus, it does not work.

@cloud-fan I will let ResolveAggregateFunctions handle the missing attribute resolution as long as the child of Sort is Aggregate.

// Skip sort with aggregate. This will be handled in ResolveAggregateFunctions case sa @ Sort(_, _, child: Aggregate) => sa

When rewriting ResolveSortReferences in another PR, I will try to make the behaviors of both rules identical for resolving the missing attributes.

can we put the problematic query in an ignored test? just in case we don't forget it...

Sure, I will add it.

davies · 2016-01-31T05:13:34Z

Left two comments, otherwise LGTM.

@marmbrus Could you take another pass on it?

SparkQA · 2016-01-31T09:48:06Z

Test build #50459 has finished for PR 10678 at commit ddfebbf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-02-01T08:43:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

          s // Nothing we can do here. Return original plan.
+        } else {
+          // Add the missing attributes into projectList of Project/Window or
+          //   aggregateExpressions of Aggregate, if they are in the inputSet


nit: too many spaces

cloud-fan · 2016-02-01T08:45:39Z

LGTM, pending tests

rxin · 2016-02-01T08:49:28Z

Let's get @marmbrus to take a look at this one too.

SparkQA · 2016-02-01T09:24:31Z

Test build #50479 has finished for PR 10678 at commit c2964da.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-02-01T19:56:53Z

LGTM, thanks for adding a bunch of tests!

Merging to master.

gatorsmile · 2016-02-02T03:36:38Z

Thank you everyone! : )

gatorsmile added 2 commits January 9, 2016 19:21

window function: Sorting columns are not in Project

c2fcaa8

style fix.

5ca4630

rxin reviewed Jan 10, 2016
View reviewed changes

code cleaning and address comments.

da6baf2

Merge remote-tracking branch 'upstream/master' into sortWindows

b5de079

This was referenced Jan 10, 2016

[SPARK-12692][BUILD][STREAMING] Scala style: Fix the style violation (Space before "," or ":") #10685

Closed

[SPARK-4628][BUILD] Add a resolver to MiMaBuild.scala for mqttv3(1.0.1). #10688

Closed

davies reviewed Jan 11, 2016
View reviewed changes

address comments.

d164342

gatorsmile changed the title ~~[SPARK-12705] [SQL] AnalysisException: Sorting columns are not in Project of Window Function~~ [SPARK-12705] [SQL] AnalysisException: Sorting columns are not in the child operators Jan 13, 2016

davies reviewed Jan 13, 2016
View reviewed changes

gatorsmile changed the title ~~[SPARK-12705] [SQL] Analyzer Rule ResolveSortReferences~~ [SPARK-12705] [SPARK-10777] [SQL] Analyzer Rule ResolveSortReferences Jan 19, 2016

address comments.

1964884

marmbrus reviewed Jan 25, 2016
View reviewed changes

addressed comments.

ba02f46

davies reviewed Jan 31, 2016
View reviewed changes

gatorsmile added 2 commits January 30, 2016 22:56

address comments.

5bfda35

address comments.

ddfebbf

Added a test case that we need to fix in the next PR.

c2964da

cloud-fan reviewed Feb 1, 2016
View reviewed changes

asfgit closed this in 8f26eb5 Feb 1, 2016

gatorsmile deleted the sortWindows branch February 2, 2016 03:36

[SPARK-12705] [SPARK-10777] [SQL] Analyzer Rule ResolveSortReferences #10678

[SPARK-12705] [SPARK-10777] [SQL] Analyzer Rule ResolveSortReferences #10678

Conversation

gatorsmile commented Jan 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 10, 2016

SparkQA commented Jan 10, 2016

gatorsmile commented Jan 10, 2016

SparkQA commented Jan 10, 2016

gatorsmile commented Jan 10, 2016

SparkQA commented Jan 10, 2016

SparkQA commented Jan 10, 2016

gatorsmile commented Jan 11, 2016

Choose a reason for hiding this comment

davies commented Jan 11, 2016

gatorsmile commented Jan 11, 2016

gatorsmile commented Jan 11, 2016

gatorsmile commented Jan 12, 2016

gatorsmile commented Jan 12, 2016

davies commented Jan 12, 2016

gatorsmile commented Jan 12, 2016

gatorsmile commented Jan 13, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davies commented Jan 19, 2016

gatorsmile commented Jan 19, 2016

davies commented Jan 19, 2016

gatorsmile commented Jan 19, 2016

gatorsmile commented Jan 19, 2016

davies commented Jan 19, 2016

davies commented Jan 19, 2016

gatorsmile commented Jan 19, 2016

SparkQA commented Jan 20, 2016

gatorsmile commented Jan 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 31, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davies commented Jan 31, 2016

SparkQA commented Jan 31, 2016

Choose a reason for hiding this comment

cloud-fan commented Feb 1, 2016

rxin commented Feb 1, 2016

SparkQA commented Feb 1, 2016

marmbrus commented Feb 1, 2016

gatorsmile commented Feb 2, 2016