[SPARK-23500][SQL] Fix complex type simplification rules to apply to entire plan #20687

henryr · 2018-02-28T00:52:24Z

What changes were proposed in this pull request?

Complex type simplification optimizer rules were not applied to the
entire plan, just the expressions reachable from the root node. This
patch fixes the rules to transform the entire plan.

How was this patch tested?

New unit test + ran sql / core tests.

SparkQA · 2018-02-28T03:12:55Z

Test build #87739 has finished for PR 20687 at commit f446fa2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

henryr · 2018-02-28T08:23:32Z

retest this please

cloud-fan · 2018-02-28T08:48:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala

@@ -25,8 +25,8 @@ import org.apache.spark.sql.catalyst.rules.Rule
 * push down operations into [[CreateNamedStructLike]].
 */
 object SimplifyCreateStructOps extends Rule[LogicalPlan] {


nit: can we merge these 3 rules? then we only need to transform the plan once.

+1 for @cloud-fan 's advice.

gatorsmile · 2018-02-28T08:52:15Z

cc @dongjoon-hyun Do you want to review this PR?

SparkQA · 2018-02-28T11:42:37Z

Test build #87763 has finished for PR 20687 at commit f446fa2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Could you fix the JIRA number in PR title?

dongjoon-hyun · 2018-02-28T17:49:03Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala

+  test("SPARK-23500: Simplify complex ops that aren't at the plan root") {
+    val structRel = relation
+      .select(GetStructField(CreateNamedStruct(Seq("att1", 'id)), 0, None) as "foo")
+      .select('foo).analyze


@henryr Could you update the test cases properly? Actually, this will not provide the test coverage of your PR properly because of CollapseProject at line 40.

Thanks for the pointer. I replaced the projection with an aggregation.

dongjoon-hyun · 2018-02-28T17:49:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala

@@ -25,8 +25,8 @@ import org.apache.spark.sql.catalyst.rules.Rule
 * push down operations into [[CreateNamedStructLike]].
 */
 object SimplifyCreateStructOps extends Rule[LogicalPlan] {


+1 for @cloud-fan 's advice.

SparkQA · 2018-03-05T21:49:02Z

Test build #87971 has finished for PR 20687 at commit 63c7098.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

henryr · 2018-03-06T19:22:31Z

retest this please

SparkQA · 2018-03-06T21:45:11Z

Test build #88024 has finished for PR 20687 at commit 63c7098.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

henryr · 2018-03-07T00:20:34Z

This failing because of SPARK-23606, which seems unrelated (I haven't been able to trigger it in local builds, at least).

dongjoon-hyun · 2018-03-07T03:26:24Z

Retest this please.

SparkQA · 2018-03-07T06:36:01Z

Test build #88035 has finished for PR 20687 at commit 63c7098.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-03-07T21:05:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala

@@ -22,32 +22,24 @@ import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule

 /**
-* push down operations into [[CreateNamedStructLike]].
+* Simplify redundant [[CreateNamedStructLike]], [[CreateArray]] and [[CreateMap]] expressions.
 */
 object SimplifyCreateStructOps extends Rule[LogicalPlan] {


SimplifyExtractValueOps?

Good point, done.

cloud-fan · 2018-03-07T21:10:50Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala

+
+  test("SPARK-23500: Simplify complex ops that aren't at the plan root") {
+    // If nullable attributes aren't used, the array and map test cases fail because array
+    // and map indexing can return null so the output is marked nullable.


why? I think the optimization is still valid, we should show this in the test, instead of hiding it with a nullable attribute.

The optimization works either way, but in (for example) the map case, m1 is marked as nullable in the original plan because presumably GetMapValue(CreateMap(...)) can return null if the key is not in the map.

So for the expected plan to compare the same as the original, it has to be reading a nullable attribute - otherwise the plans don't pass comparePlans. I moved and reworded the comment to hopefully clarify this a bit.

There's an opportunity to fix this up again after the rule completes (since some attributes could be marked too conservatively as nullable). Do you think that's something we should pursue for this PR?

gatorsmile · 2018-03-07T21:28:31Z

@dongjoon-hyun Have you finished the review?

cloud-fan · 2018-03-07T22:27:51Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala

+    comparePlans(Optimizer execute structRel, structExpected)
+
+    // If nullable attributes aren't used in the 'expected' plans, the array and map test
+    // cases fail because array and map indexing can return null so the output attribute


This explains why the original plan(before optimize) marks its output as nullable, but I'm confused why the optimized plan still marks its output as nullable.

It's a good question! I'm not too familiar with how nullability is marked and unmarked during planning. My understanding is roughly that the analyzer resolves all the plan's expressions and in doing so marks attributes as nullable or not. After that it's not clear that the optimizer revisits any of those nullability decisions. Is there an optimizer pass which should make nullability marking more precise?

nullable is mostly calculated on demand, so we don't have rules to change the nullable property. For this case, the expression is Alias(GetArrayItem(CreateArray(Attribute...))), which is nullable. After optimize, it becomes Alias(Attribute...) and is not nullable(if that attribute is not nullable). So the nullable is updated automatically.

I don't know why you hit this issue, please ping us if you can't figure it out, we can help to debug.

Thanks, that's plenty of information to get started - I'll dig into it.

@cloud-fan I looked again at this briefly this morning. The issue is that it's the AttributeReference in the top-level Aggregate's groupingExpressions that has inconsistent nullability.

The AttributeReference in the original plan was originally created with nullable=true, before optimization. So at that point it's kind of fixed unless the optimizer dereferences the attr reference and realises that the target is no longer nullable.

good catch! Let's explain this in the test and fix it in a follow-up. We can just add a new rule to transform the plan and update the nullability.

Done, thanks. I filed SPARK-23634 to fix this. Out of interest, why does AttributeReference cache the nullability of its referent? Is it because comparison is too expensive to do if you have to follow a level of indirection to get to the original attribute?

Because AttributeReference is not only used as a reference of an attribute from children, but also the new attributes produced by leaf nodes, which has to carry the nullable info. It's not ideal but it's too late to change now.

SparkQA · 2018-03-07T23:30:02Z

Test build #88056 has finished for PR 20687 at commit f66112c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-03-09T17:48:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala

@@ -22,32 +22,24 @@ import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.rules.Rule

 /**
-* push down operations into [[CreateNamedStructLike]].
+* Simplify redundant [[CreateNamedStructLike]], [[CreateArray]] and [[CreateMap]] expressions.
 */


nit. Could you fix the indentation?

dongjoon-hyun · 2018-03-09T17:50:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala

-  override def apply(plan: LogicalPlan): LogicalPlan = {
-    plan.transformExpressionsUp {
-      // push down field selection (array of structs)
+      // Remove redundant array indexing.
      case GetArrayStructFields(CreateArray(elems), field, ordinal, numFields, containsNull) =>


nit.

case GetArrayStructFields(CreateArray(elems), field, ordinal, _, _) =>

dongjoon-hyun · 2018-03-09T17:51:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala

      case GetArrayStructFields(CreateArray(elems), field, ordinal, numFields, containsNull) =>
        // instead f selecting the field on the entire array,
        // select it from each member of the array.
        // pushing down the operation this way open other optimizations opportunities
        // (i.e. struct(...,x,...).x)
        CreateArray(elems.map(GetStructField(_, ordinal, Some(field.name))))
-      // push down item selection.
+
+      // Remove redundant map lookup.
      case ga @ GetArrayItem(CreateArray(elems), IntegerLiteral(idx)) =>
        // instead of creating the array and then selecting one row,
        // remove array creation altgether.


altgether -> altogether?

dongjoon-hyun · 2018-03-09T18:12:21Z

I didn't retrigger Jenkins due to the existing comment.
Overall, the PR looks reasonable to me. I look forward to see the follow-up issue, SPARK-23634.
@gatorsmile . Could you review this?

…entire plan ## What changes were proposed in this pull request? Complex type simplification optimizer rules were not applied to the entire plan, just the expressions reachable from the root node. This patch fixes the rules to transform the entire plan. ## How was this patch tested? New unit test + sql / core tests.

gatorsmile · 2018-03-09T22:22:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala

+ */
+object SimplifyExtractValueOps extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transform { case p =>
+    p.transformExpressionsUp {


@dongjoon-hyun , is it safe to simplify it for Aggregate?

Sorry for late response, @gatorsmile .
These are expression-level optimization rules. If the original expressions exists in SELECT, GROUP BY, and HAVING, those are simplified in the same way together. Do you have any concerning cases?

aggregateExpressions are resolved from groupingExpressions using semanticEquals, while referring to names from input.

Expression-level optimizer simplifies both aggregateExpressions and groupingExpressions together. If the target expression exists at somewhere of both sides, the simplified expression also exists at the same locations of both sides. Given that, semanticEquals will work for the updated expressions.

how about select struct(a, b).a from t group by struct(a, b)? We may optimize it to select a from t group by struct(a, b), which is invalid.

Oh, right. I missed to consider that kind of cases.

Since map is not orderable, that happens for struct and array types.

(Ignore my previous comment here, which was mistaken).

SparkQA · 2018-03-10T00:32:44Z

Test build #88135 has finished for PR 20687 at commit c2137d7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-03-15T02:39:51Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala

+    val mapExpected = relation
+      .select('nullable_id as "m1")
+      .groupBy($"m1")("1").analyze
+    comparePlans(Optimizer execute mapRel, mapExpected)


@henryr .
Could you add more test cases mentioned today, for example, like the following? We need a test case for array, too.

val structRel = relation.groupBy( CreateNamedStruct(Seq("att1", 'nullable_id)))( GetStructField(CreateNamedStruct(Seq("att1", 'nullable_id)), 0, None)).analyze comparePlans(Optimizer execute structRel, structRel)

dongjoon-hyun · 2018-03-15T02:47:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala

+ * Simplify redundant [[CreateNamedStructLike]], [[CreateArray]] and [[CreateMap]] expressions.
+ */
+object SimplifyExtractValueOps extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transform { case p =>


@henryr . You can change like the following in order to avoid Aggregate.

override def apply(plan: LogicalPlan): LogicalPlan = plan transform { case a: Aggregate => a case p => p.transformExpressionsUp {

SparkQA · 2018-03-15T23:09:37Z

Test build #88280 has finished for PR 20687 at commit 8adaa47.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-03-16T00:54:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala

@@ -19,57 +19,47 @@ package org.apache.spark.sql.catalyst.optimizer

 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.plans.logical.Aggregate


This should be before line 21 in alphabetical order.
You can check this locally with dev/scalastyle.

import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}

dongjoon-hyun · 2018-03-16T00:57:08Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala

+      .select('nullable_id as "m1")
+      .groupBy($"m1")("1").analyze
+    comparePlans(Optimizer execute mapRel, mapExpected)
+


It seems that the current test case become too long. For the following negative cases, let's split to another test case. Maybe, with the following title?

test("SPARK-23500: Aggregation expressions should not be simplified.")

gatorsmile · 2018-03-16T16:44:51Z

The fix looks good to me, but the test coverage is not enough.

gatorsmile · 2018-03-16T16:45:06Z

@henryr Thanks for your great work!

henryr · 2018-03-19T05:25:32Z

@gatorsmile thank you for the reviews! Are there specific test cases you'd like to see? I've checked correlated and uncorrelated subqueries, various flavours of join, aggregates with HAVING clauses, nested compound types, and so on.

gatorsmile · 2018-03-19T05:40:02Z

@henryr Please try to add the test cases that matter in your opinion. I will also submit a follow-up PR to add more test cases after this PR is merged.

henryr · 2018-03-19T23:29:34Z

@gatorsmile ok, I think the coverage right now is a reasonable start - the other test cases I can think of would act more like they're exercising the expression-walking code, not the actual simplification. Look forward to collaborating on the follow-up PR.

SparkQA · 2018-03-20T03:00:58Z

Test build #88391 has finished for PR 20687 at commit 5926301.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-03-20T20:27:23Z

Thanks! Merged to master.

gatorsmile · 2018-03-20T20:32:29Z

Will submit a separate PR for tests only.

… apply to entire plan ## What changes were proposed in this pull request? This PR is to improve the test coverage of the original PR #20687 ## How was this patch tested? N/A Author: gatorsmile <[email protected]> Closes #20911 from gatorsmile/addTests.

… apply to entire plan ## What changes were proposed in this pull request? This PR is to improve the test coverage of the original PR apache#20687 ## How was this patch tested? N/A Author: gatorsmile <[email protected]> Closes apache#20911 from gatorsmile/addTests.

cloud-fan reviewed Feb 28, 2018

View reviewed changes

dongjoon-hyun reviewed Feb 28, 2018

View reviewed changes

dongjoon-hyun requested changes Feb 28, 2018

View reviewed changes

henryr force-pushed the spark-25000 branch from f446fa2 to 63c7098 Compare March 5, 2018 19:24

henryr changed the title ~~[SPARK-25000][SQL] Fix complex type simplification rules to apply to entire plan~~ [SPARK-23500][SQL] Fix complex type simplification rules to apply to entire plan Mar 5, 2018

cloud-fan reviewed Mar 7, 2018

View reviewed changes

henryr force-pushed the spark-25000 branch from 63c7098 to f66112c Compare March 7, 2018 21:50

cloud-fan reviewed Mar 7, 2018

View reviewed changes

dongjoon-hyun reviewed Mar 9, 2018

View reviewed changes

henryr force-pushed the spark-25000 branch from f66112c to c2137d7 Compare March 9, 2018 21:10

gatorsmile reviewed Mar 9, 2018

View reviewed changes

dongjoon-hyun reviewed Mar 15, 2018

View reviewed changes

Handle aggregation case

8adaa47

dongjoon-hyun reviewed Mar 16, 2018

View reviewed changes

Style fixes etc.

5926301

asfgit closed this in 477d6bd Mar 20, 2018

gatorsmile mentioned this pull request Mar 27, 2018

[SPARK-23500][SQL][FOLLOWUP] Fix complex type simplification rules to apply to entire plan #20911

Closed

[SPARK-23500][SQL] Fix complex type simplification rules to apply to entire plan #20687

[SPARK-23500][SQL] Fix complex type simplification rules to apply to entire plan #20687

Conversation

henryr commented Feb 28, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Feb 28, 2018

henryr commented Feb 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Feb 28, 2018

SparkQA commented Feb 28, 2018

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 5, 2018

henryr commented Mar 6, 2018

SparkQA commented Mar 6, 2018

henryr commented Mar 7, 2018

dongjoon-hyun commented Mar 7, 2018

SparkQA commented Mar 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Mar 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

henryr Mar 14, 2018 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Mar 10, 2018

dongjoon-hyun Mar 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Mar 16, 2018

gatorsmile commented Mar 16, 2018

henryr commented Mar 19, 2018

gatorsmile commented Mar 19, 2018

henryr commented Mar 19, 2018

SparkQA commented Mar 20, 2018

gatorsmile commented Mar 20, 2018

gatorsmile commented Mar 20, 2018

dongjoon-hyun commented Mar 9, 2018 •

edited

Loading

henryr Mar 14, 2018 •

edited

Loading

dongjoon-hyun Mar 15, 2018 •

edited

Loading