[SPARK-31670][SQL] Trim unnecessary Struct field alias in Aggregate/GroupingSets #28490

AngersZhuuuu · 2020-05-10T14:49:21Z

What changes were proposed in this pull request?

Struct field both in GROUP BY and Aggregate Expresison with CUBE/ROLLUP/GROUPING SET will failed when analysis.

test("SPARK-31670") {
  withTable("t1") {
      sql(
        """
          |CREATE TEMPORARY VIEW t(a, b, c) AS
          |SELECT * FROM VALUES
          |('A', 1, NAMED_STRUCT('row_id', 1, 'json_string', '{"i": 1}')),
          |('A', 2, NAMED_STRUCT('row_id', 2, 'json_string', '{"i": 1}')),
          |('A', 2, NAMED_STRUCT('row_id', 2, 'json_string', '{"i": 2}')),
          |('B', 1, NAMED_STRUCT('row_id', 3, 'json_string', '{"i": 1}')),
          |('C', 3, NAMED_STRUCT('row_id', 4, 'json_string', '{"i": 1}'))
        """.stripMargin)

      checkAnswer(
        sql(
          """
            |SELECT a, c.json_string, SUM(b)
            |FROM t
            |GROUP BY a, c.json_string
            |WITH CUBE
            |""".stripMargin),
        Row("A", "{\"i\": 1}", 3) :: Row("A", "{\"i\": 2}", 2) :: Row("A", null, 5) ::
          Row("B", "{\"i\": 1}", 1) :: Row("B", null, 1) ::
          Row("C", "{\"i\": 1}", 3) :: Row("C", null, 3) ::
          Row(null, "{\"i\": 1}", 7) :: Row(null, "{\"i\": 2}", 2) :: Row(null, null, 9) :: Nil)

  }
}

Error

[info] - SPARK-31670 *** FAILED *** (2 seconds, 857 milliseconds)
[info]   Failed to analyze query: org.apache.spark.sql.AnalysisException: expression 't.`c`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
[info]   Aggregate [a#247, json_string#248, spark_grouping_id#246L], [a#247, c#223.json_string AS json_string#241, sum(cast(b#222 as bigint)) AS sum(b)#243L]
[info]   +- Expand [List(a#221, b#222, c#223, a#244, json_string#245, 0), List(a#221, b#222, c#223, a#244, null, 1), List(a#221, b#222, c#223, null, json_string#245, 2), List(a#221, b#222, c#223, null, null, 3)], [a#221, b#222, c#223, a#247, json_string#248, spark_grouping_id#246L]
[info]      +- Project [a#221, b#222, c#223, a#221 AS a#244, c#223.json_string AS json_string#245]
[info]         +- SubqueryAlias t
[info]            +- Project [col1#218 AS a#221, col2#219 AS b#222, col3#220 AS c#223]
[info]               +- Project [col1#218, col2#219, col3#220]
[info]                  +- LocalRelation [col1#218, col2#219, col3#220]
[info]

For Struct type Field, when we resolve it, it will construct with Alias. When struct field in GROUP BY with CUBE/ROLLUP etc, struct field in groupByExpression and aggregateExpression will be resolved with different exprId as below

'Aggregate [cube(a#221, c#223.json_string AS json_string#240)], [a#221, c#223.json_string AS json_string#241, sum(cast(b#222 as bigint)) AS sum(b)#243L]
+- SubqueryAlias t
   +- Project [col1#218 AS a#221, col2#219 AS b#222, col3#220 AS c#223]
      +- Project [col1#218, col2#219, col3#220]
         +- LocalRelation [col1#218, col2#219, col3#220]

This makes ResolveGroupingAnalytics.constructAggregateExprs() failed to replace aggreagteExpression use expand groupByExpression attribute since there exprId is not same. then error happened.

Why are the changes needed?

Fix analyze bug

Does this PR introduce any user-facing change?

NO

How was this patch tested?

Added UT

SparkQA · 2020-05-10T19:27:30Z

Test build #122477 has finished for PR 28490 at commit 27c495b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2020-06-02T06:53:48Z

cc @maropu Can you have a review ?

maropu · 2020-06-03T07:34:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

-      def isPartOfAggregation(e: Expression): Boolean = {
-        aggsBuffer.exists(a => a.find(_ eq e).isDefined)
+        gid: Attribute): Seq[NamedExpression] = {
+      val resolvedGroupByAliases = groupByAliases.map(_.transformDown {


Probably, we should not fix this issue in ResolveGroupingAnalytics, but in ResolveRefences just like this;

object ResolveReferences extends Rule[LogicalPlan] { ... def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp { case p: LogicalPlan if !p.childrenResolved => p .... case a @ Aggregate(groupingExprs, aggExprs, _) if both `aggExprs` and `aggExprs` have the same struct field => val newAgg = resolve expressions so that they have the same exprIds newAgg

A root cause seems to be that ResolveReferences assigns different exprIds to each#30.json_string AS json_strings (#31 vs #32);

20/06/03 16:22:47 WARN HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences === !'Aggregate [cube('a, 'get_json_object('each.json_string, $.iType))], ['a, 'coalesce('get_json_object('each.json_string, $.iType), -127) AS iType#29, unresolvedalias('sum('b), None)] +- Generate explode(c#4), false, x, [each#30] +- SubqueryAlias t +- Project [x AS a#2, 1 AS b#3, array(named_struct(row_id, 1, json_string, y)) AS c#4] +- Range (0, 1, step=1, splits=Some(4)) 'Aggregate [cube(a#2, 'get_json_object(each#30.json_string AS json_string#31, $.iType))], [a#2, 'coalesce('get_json_object(each#30.json_string AS json_string#32, $.iType), -127) AS iType#29, unresolvedalias('sum(b#3), None)] +- Generate explode(c#4), false, x, [each#30] +- SubqueryAlias t +- Project [x AS a#2, 1 AS b#3, array(named_struct(row_id, 1, json_string, y)) AS c#4] +- Range (0, 1, step=1, splits=Some(4))

cc: @cloud-fan @viirya

Based on above analysis, seems it is caused by ResolveReferences, instead of ResolveGroupingAnalytics? Can we possibly have a reproducible case without CUBE?

Based on above analysis, seems it is caused by ResolveReferences, instead of ResolveGroupingAnalytics? Can we possibly have a reproducible case without CUBE?

This happened with cube when construct cube LogicalPlan.

The shown query plan above seems before ResolveGroupingAnalytics? So without CUBE is it possible to encounter similar issue? @maropu

A root cause seems to be that ResolveReferences assigns different exprIds to each#30.json_string AS json_strings (#31 vs #32);

Yea, StructGetFiled will construct with a alias in ResolveReference and got different ExprID

Probably, we should not fix this issue in ResolveGroupingAnalytics, but in ResolveRefences just like this;

object ResolveReferences extends Rule[LogicalPlan] { ... def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp { case p: LogicalPlan if !p.childrenResolved => p .... case a @ Aggregate(groupingExprs, aggExprs, _) if both `aggExprs` and `aggExprs` have the same struct field => val newAgg = resolve expressions so that they have the same exprIds newAgg

Also having clause.. may have this problem. I didn't thinking enough case.

The shown query plan above seems before ResolveGroupingAnalytics? So without CUBE is it possible to encounter similar issue? @maropu

Yea, all the cases have the same issue;

scala> spark.range(1).selectExpr("'x' AS a", "1 AS b", "array(named_struct('row_id', 1, 'json_string', 'y')) AS c").createOrReplaceTempView("t") // ROLLUP scala> sql(""" | select a, coalesce(get_json_object(each.json_string,'$.iType'),'-127') as iType, sum(b) | from t | LATERAL VIEW explode(c) x AS each | group by a, get_json_object(each.json_string,'$.iType') | with rollup | """).show() org.apache.spark.sql.AnalysisException: expression 'x.`each`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [a#17, get_json_object(each#9.json_string AS json_string#10, $.iType)#18, spark_grouping_id#16L], [a#17, coalesce(get_json_object(each#9.json_string, $.iType), -127) AS iType#8, sum(cast(b#3 as bigint)) AS sum(b)#13L] +- Expand [ArrayBuffer(a#2, b#3, c#4, each#9, a#14, get_json_object(each#9.json_string AS json_string#10, $.iType)#15, 0), ArrayBuffer(a#2, b#3, c#4, each#9, a#14, null, 1), ArrayBuffer(a#2, b#3, c#4, each#9, null, null, 3)], [a#2, b#3, c#4, each#9, a#17, get_json_object(each#9.json_string AS json_string#10, $.iType)#18, spark_grouping_id#16L] +- Project [a#2, b#3, c#4, each#9, a#2 AS a#14, get_json_object(each#9.json_string, $.iType) AS get_json_object(each#9.json_string AS json_string#10, $.iType)#15] +- Generate explode(c#4), false, x, [each#9] +- SubqueryAlias t +- Project [x AS a#2, 1 AS b#3, array(named_struct(row_id, 1, json_string, y)) AS c#4] +- Range (0, 1, step=1, splits=Some(4)) // GROUPING SETS scala> sql(""" | select a, coalesce(get_json_object(each.json_string,'$.iType'),'-127') as iType, sum(b) | from t | LATERAL VIEW explode(c) x AS each | group by grouping sets((a, get_json_object(each.json_string,'$.iType'))) | """).show() org.apache.spark.sql.AnalysisException: expression 'x.`each`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Aggregate [a#28, get_json_object(each#20.json_string AS json_string#21, $.iType)#29, spark_grouping_id#27L], [a#28, coalesce(get_json_object(each#20.json_string, $.iType), -127) AS iType#19, sum(cast(b#3 as bigint)) AS sum(b)#24L] +- Expand [ArrayBuffer(a#2, b#3, c#4, each#20, a#25, get_json_object(each#20.json_string AS json_string#21, $.iType)#26, 0)], [a#2, b#3, c#4, each#20, a#28, get_json_object(each#20.json_string AS json_string#21, $.iType)#29, spark_grouping_id#27L] +- Project [a#2, b#3, c#4, each#20, a#2 AS a#25, get_json_object(each#20.json_string, $.iType) AS get_json_object(each#20.json_string AS json_string#21, $.iType)#26] +- Generate explode(c#4), false, x, [each#20] +- SubqueryAlias t +- Project [x AS a#2, 1 AS b#3, array(named_struct(row_id, 1, json_string, y)) AS c#4] +- Range (0, 1, step=1, splits=Some(4))

maropu · 2020-06-03T07:41:29Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

@@ -3495,6 +3495,59 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
    assert(df4.schema.head.name === "randn(1)")
    checkIfSeedExistsInExplain(df2)
  }
+
+  test("SPARK-31670: Struct Field in groupByExpr with CUBE") {
+    withTable("t1") {


nit: t1 -> t

maropu · 2020-06-03T07:44:01Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+          |c array<struct<row_id:int,json_string:string>>,
+          |d array<array<string>>,
+          |e array<map<string, int>>)
+          |using orc""".stripMargin)


Please use a temp view for test performance. Also, its better to add some rows for answer checks in this test table instead of the current null table.

maropu · 2020-06-03T07:44:46Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+      checkAnswer(
+        sql(
+          """
+            |select a, each.json_string, sum(b)


nit: Could you use uppercases for the SQL keywords where possible?

maropu · 2020-06-03T07:46:30Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+            |from t1
+            |LATERAL VIEW explode(c) x AS each
+            |group by a, each.json_string
+            |with cube


Could you check the other analytics grouping, too, e.g., GROUPING SETS and ROLLUP?

Could you check the other analytics grouping, too, e.g., GROUPING SETS and ROLLUP?

emmm grouping sets still have problem.

SparkQA · 2020-06-04T01:54:37Z

Test build #123516 has finished for PR 28490 at commit 4c0b04c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-06T19:04:23Z

Test build #123589 has finished for PR 28490 at commit c4ff823.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-06-07T00:20:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -510,10 +510,12 @@ class Analyzer(
      // collect all the found AggregateExpression, so we can check an expression is part of
      // any AggregateExpression or not.
      val aggsBuffer = ArrayBuffer[Expression]()
+


nit: plz revert the unencessary changes. (Unrelated changes might lead to revert/backport failures sometimes...)

nit: plz revert the unencessary changes. (Unrelated changes might lead to revert/backport failures sometimes...)

Sorry..., forgot to check diff.

maropu · 2020-06-07T00:21:45Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+          |c ARRAY<STRUCT<row_id:INT,json_string:STRING>>,
+          |d ARRAY<ARRAY<STRING>>,
+          |e ARRAY<MAP<STRING, INT>>)
+          |USING ORC""".stripMargin)


See: #28490 (comment)

See: #28490 (comment)

Test case change to view and with actual data.

maropu · 2020-06-07T00:26:20Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

@@ -3496,6 +3496,88 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
    checkIfSeedExistsInExplain(df2)
  }

+  test("SPARK-31670: Struct Field in groupByExpr with CUBE") {


Could you please make the title more correct? I think we don't need the word with CUBE.

Could you please make the title more correct? I think we don't need the word with CUBE.

It's ok for current PR title?

Yea, looks fine.

maropu · 2020-06-07T00:28:08Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+            |LATERAL VIEW EXPLODE(c) X AS each
+            |GROUP BY a, each.json_string
+            |GROUPING sets((a),(a, each.json_string))
+            |""".stripMargin), Nil)


Could you add tests having queries with HAVING clauses?

Could you add tests having queries with HAVING clauses?

Seems I make a mistake, having won't have duplicate field since it won't have grouping keys in having condition

maropu · 2020-06-07T00:29:47Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+          """
+            |SELECT a, each.json_string AS json_string, SUM(b)
+            |FROM t
+            |LATERAL VIEW EXPLODE(c) x AS each


btw, we must need lateral view to reproduce this issue? I mean, this issue cannot happen without lateral view?

btw, we must need lateral view to reproduce this issue? I mean, this issue cannot happen without lateral view?

No, I changed.

maropu · 2020-06-07T00:31:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

      case q: LogicalPlan =>
        logTrace(s"Attempting to resolve ${q.simpleString(SQLConf.get.maxToStringFields)}")
        q.mapExpressions(resolveExpressionTopDown(_, q))
    }

+    def needResolveStructField(plan: LogicalPlan): Boolean = {


maropu · 2020-06-07T00:31:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+      }
+    }
+
+    def containSameStructFields(


It is not enough just to check if both sides (grpExprs and aggExprs) have struct fields here? We need to confirm the identity by using unresolved attributes?

It is not enough just to check if both sides (grpExprs and aggExprs) have struct fields here? We need to confirm the identity by using unresolved attributes?

Yea, hear attribute is still unresolved attributes.

It is not enough just to check if both sides (grpExprs and aggExprs) have struct fields here?

GroupingSets need check selected cols too, so all need to check.

I think its better to avoid comparing unresolved attributes.... could we resolve them then detect the mismatched exprIds?

I think its better to avoid comparing unresolved attributes.... could we resolve them then detect the mismatched exprIds?

How about current? not check, just reoslve.

maropu · 2020-06-07T00:46:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+    }
+
+    def containSameStructFields(
+        grpExprs: Seq[Attribute],


nit: grpExprs -> groupExprs for consistency.

nit: grpExprs -> groupExprs for consistency.

Done

maropu · 2020-06-07T00:49:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+      case p: LogicalPlan if needResolveStructField(p) =>
+        logTrace(s"Attempting to resolve ${p.simpleString(SQLConf.get.maxToStringFields)}")
+        val resolved = p.mapExpressions(resolveExpressionTopDown(_, p))
+        val structFieldMap = new mutable.HashMap[String, Alias]


mutable.HashMap -> mutable.Map

mutable.HashMap -> mutable.Map

Done

SparkQA · 2020-06-07T07:05:01Z

Test build #123597 has finished for PR 28490 at commit 1ee0542.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-07T07:05:02Z

Test build #123595 has finished for PR 28490 at commit e28b084.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-06-07T08:44:23Z

retest this please

maropu · 2020-06-07T11:42:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -1259,6 +1259,11 @@ class Analyzer(
      attr.withExprId(exprId)
    }

+    private def dedupStructField(attr: Alias, structFieldMap: Map[String, Attribute]) = {


Not used now?

Not used now?

Oh, yea

maropu · 2020-06-07T11:57:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -1481,7 +1486,35 @@ class Analyzer(

      case q: LogicalPlan =>
        logTrace(s"Attempting to resolve ${q.simpleString(SQLConf.get.maxToStringFields)}")


Could you write this handling in an independent patten like this?

case agg @ (_: Aggregate | _: GroupingSets) => val resolved = agg.mapExpressions(resolveExpressionTopDown(_, agg)) val structFieldMap = mutable.Map[String, Alias]() resolved.transformExpressionsDown { // Plz describe some comments why we need this handling... case a @ Alias(struct: GetStructField, _) => if (structFieldMap.contains(struct.sql)) { val exprId = structFieldMap.getOrElse(struct.sql, a).exprId Alias(a.child, a.name)(exprId, a.qualifier, a.explicitMetadata) } else { structFieldMap += (struct.sql -> a) a } case e => e } case q: LogicalPlan => logTrace(s"Attempting to resolve ${q.simpleString(SQLConf.get.maxToStringFields)}") q.mapExpressions(resolveExpressionTopDown(_, q))

w/ some code cleanup;

case agg @ (_: Aggregate | _: GroupingSets) => val resolved = agg.mapExpressions(resolveExpressionTopDown(_, agg)) val hasStructField = resolved.expressions.exists { _.collectFirst { case gsf: GetStructField => gsf }.isDefined } if (hasStructField) { // Plz describe some comments why we need this handling... val structFieldMap = mutable.Map[String, Alias]() resolved.transformExpressionsDown { case a @ Alias(struct: GetStructField, _) => if (structFieldMap.contains(struct.sql)) { val exprId = structFieldMap.getOrElse(struct.sql, a).exprId Alias(a.child, a.name)(exprId, a.qualifier, a.explicitMetadata) } else { structFieldMap += (struct.sql -> a) a } } } else { resolved }

w/ some code cleanup;

Done

maropu · 2020-06-07T12:46:24Z

Could you check this? @cloud-fan @viirya

SparkQA · 2020-06-07T13:12:42Z

Test build #123603 has finished for PR 28490 at commit 1ee0542.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-07T17:05:46Z

Test build #123605 has finished for PR 28490 at commit 0af3166.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-10T21:49:54Z

Test build #127288 has finished for PR 28490 at commit 0af3166.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-09-01T16:29:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+            // trim Alias over top-level GetStructField
+            case Alias(s: GetStructField, _) => s
+            case other => other
+          }


I think we can add comment explaining because these expressions are not named expressions originally, we can safely trim top-level Alias.

I think we can add comment explaining because these expressions are not named expressions originally, we can safely trim top-level Alias.

Extract this part as a func and add comment. cc @cloud-fan

viirya · 2020-09-01T16:34:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+            val result = resolved match {
+              case Alias(s: GetStructField, _) if trimAlias && !isTopLevel => s
+              case others => others
            }


Actually I'm wondering if there is any cases we don't want to trim nested (i.e., non top-level) Alias? Such Alias is useless and could possibly cause unexpected issue.

Actually I'm wondering if there is any cases we don't want to trim nested (i.e., non top-level) Alias? Such Alias is useless and could possibly cause unexpected issue.

Since we will call CleanupAlias later in Analyzer, any way this field will be trimmed, but if we don't handle it here, we can't pass ResolveGroupingAnalytics.constructAggregateExprs().

As we don't merge this yet, can you also add a comment here?

As we don't merge this yet, can you also add a comment here?

Yea

As we don't merge this yet, can you also add a comment here?

See the comment I added just now. cc @cloud-fan

viirya

Just minor comments, otherwise LGTM.

viirya · 2020-09-01T16:35:28Z

Thanks for updating the PR description. It looks better now.

SparkQA · 2020-09-01T18:45:48Z

Test build #128151 has finished for PR 28490 at commit 51cea07.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-02T03:56:48Z

Test build #128170 has finished for PR 28490 at commit 84e65af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

cloud-fan · 2020-09-02T04:15:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

-      // names leading to ambiguous references exception.
-      case a @ Aggregate(groupingExprs, aggExprs, appendColumns: AppendColumns) =>
-        a.mapExpressions(resolveExpressionTopDown(_, appendColumns))
+      case a: Aggregate =>


We need a high-level comment to explain why we trim alias here, e.g.

// SPARK-31670: ...

We need a high-level comment to explain why we trim alias here, e.g.

// [SPARK-31670](https://issues.apache.org/jira/browse/SPARK-31670): ...

Done

SparkQA · 2020-09-02T05:05:59Z

Test build #128184 has finished for PR 28490 at commit e6fb91f.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2020-09-02T05:58:00Z

retest this please

SparkQA · 2020-09-02T07:05:02Z

Test build #128181 has finished for PR 28490 at commit 9411887.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-02T07:05:03Z

Test build #128188 has finished for PR 28490 at commit e6fb91f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2020-09-02T07:05:52Z

retest this please

SparkQA · 2020-09-02T12:21:20Z

Test build #128195 has finished for PR 28490 at commit e6fb91f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-02T13:49:07Z

thanks, merging to master!

dongjoon-hyun · 2020-11-05T23:57:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

-      // names leading to ambiguous references exception.
-      case a @ Aggregate(groupingExprs, aggExprs, appendColumns: AppendColumns) =>
-        a.mapExpressions(resolveExpressionTopDown(_, appendColumns))
+      // SPARK-31607: Resolve Struct field in groupByExpressions and aggregateExpressions


In the comment, SPARK-31607 seems to be a typo of SPARK-31670 because SPARK-31607 is Improve the perf of CTESubstitution.

In the comment, SPARK-31607 seems to be a typo of SPARK-31670 because SPARK-31607 is Improve the perf of CTESubstitution.

Yea，sorry for my mistake

Never mind~, @AngersZhuuuu . :)

### What changes were proposed in this pull request? This PR fixes incorrect JIRA ids in `Analyzer.scala` introduced by SPARK-31670 (#28490) ```scala - // SPARK-31607: Resolve Struct field in selectedGroupByExprs/groupByExprs and aggregations + // SPARK-31670: Resolve Struct field in selectedGroupByExprs/groupByExprs and aggregations ``` ### Why are the changes needed? Fix the wrong information. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is a comment change. Manually review. Closes #30269 from dongjoon-hyun/SPARK-31670-MINOR. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…ate UnresolvedAlias ### What changes were proposed in this pull request? This PR partially backports #31758 to 3.1, to fix a backward compatibility issue caused by #28490 The query below has different output schemas in 3.0 and 3.1 ``` sql("select struct(1, 2) as s").groupBy(col("s.col1")).agg(first("s")) ``` In 3.0 the output column name is `col1`, in 3.1 it's `s.col1`. This breaks existing queries. In #28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column `s.col1` will become `UnresolvedAlias(s.col1, None)`. In `ResolveReference`, the logic used to directly resolve `s.col` to `s.col1 AS col1` but after #28490 we enter the code path with `trimAlias = true and !isTopLevel`, so the alias is removed and resulting in `s.col1`, which will then be resolved in `ResolveAliases` as `s.col1 AS s.col1` #31758 happens to fix this issue because we no longer wrap `UnresolvedAttribute` with `UnresolvedAlias` in `RelationalGroupedDataset`. ### Why are the changes needed? Fix an unexpected query output schema change ### Does this PR introduce _any_ user-facing change? Yes as explained above. ### How was this patch tested? updated test Closes #32239 from cloud-fan/bug. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>

…ate UnresolvedAlias ### What changes were proposed in this pull request? This PR partially backports apache#31758 to 3.1, to fix a backward compatibility issue caused by apache#28490 The query below has different output schemas in 3.0 and 3.1 ``` sql("select struct(1, 2) as s").groupBy(col("s.col1")).agg(first("s")) ``` In 3.0 the output column name is `col1`, in 3.1 it's `s.col1`. This breaks existing queries. In apache#28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column `s.col1` will become `UnresolvedAlias(s.col1, None)`. In `ResolveReference`, the logic used to directly resolve `s.col` to `s.col1 AS col1` but after apache#28490 we enter the code path with `trimAlias = true and !isTopLevel`, so the alias is removed and resulting in `s.col1`, which will then be resolved in `ResolveAliases` as `s.col1 AS s.col1` apache#31758 happens to fix this issue because we no longer wrap `UnresolvedAttribute` with `UnresolvedAlias` in `RelationalGroupedDataset`. ### Why are the changes needed? Fix an unexpected query output schema change ### Does this PR introduce _any_ user-facing change? Yes as explained above. ### How was this patch tested? updated test Closes apache#32239 from cloud-fan/bug. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>

Struct Field in groupByExpr with CUBE

27c495b

probot-autolabeler bot added the SQL label May 10, 2020

AngersZhuuuu changed the title ~~[SPARK-31670][WIP]Struct Field in groupByExpr with CUBE~~ [SPARK-31670][SQL]Struct Field in groupByExpr with CUBE May 21, 2020

maropu reviewed Jun 3, 2020

View reviewed changes

Merge branch 'master' into SPARK-31670

4c0b04c

AngersZhuuuu added 2 commits June 6, 2020 22:37

WIP save

282648d

Update Analyzer.scala

c4ff823

maropu reviewed Jun 7, 2020

View reviewed changes

AngersZhuuuu changed the title ~~[SPARK-31670][SQL]Struct Field in groupByExpr with CUBE~~ [SPARK-31670][SQL]Resolve Struct Field in Aggregate with same ExprId Jun 7, 2020

AngersZhuuuu changed the title ~~[SPARK-31670][SQL]Resolve Struct Field in Aggregate with same ExprId~~ [SPARK-31670][SQL]Resolve Struct Field in Grouping Aggregate with same ExprId Jun 7, 2020

AngersZhuuuu added 3 commits June 7, 2020 13:30

fix UT

6d1b60e

use view

e28b084

Don't use unresolve attribute

1ee0542

maropu reviewed Jun 7, 2020

View reviewed changes

Update Analyzer.scala

0af3166

cloud-fan approved these changes Sep 1, 2020

View reviewed changes

viirya reviewed Sep 1, 2020

View reviewed changes

viirya approved these changes Sep 1, 2020

View reviewed changes

Update Analyzer.scala

84e65af

cloud-fan reviewed Sep 2, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 2, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 2, 2020

View reviewed changes

AngersZhuuuu added 2 commits September 2, 2020 12:25

Update Analyzer.scala

9411887

Update Analyzer.scala

e6fb91f

cloud-fan closed this in 5e6173e Sep 2, 2020

dongjoon-hyun reviewed Nov 5, 2020

View reviewed changes

dongjoon-hyun mentioned this pull request Nov 5, 2020

[MINOR][SQL] Fix incorrect JIRA ID comments in Analyzer #30269

Closed

cloud-fan mentioned this pull request Apr 19, 2021

[SPARK-34639][SQL][3.1] RelationalGroupedDataset.alias should not create UnresolvedAlias #32239

Closed

		@@ -1481,7 +1486,35 @@ class Analyzer(

		case q: LogicalPlan =>
		logTrace(s"Attempting to resolve ${q.simpleString(SQLConf.get.maxToStringFields)}")

[SPARK-31670][SQL] Trim unnecessary Struct field alias in Aggregate/GroupingSets #28490

[SPARK-31670][SQL] Trim unnecessary Struct field alias in Aggregate/GroupingSets #28490

Conversation

AngersZhuuuu commented May 10, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented May 10, 2020

AngersZhuuuu commented Jun 2, 2020

maropu Jun 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu Jun 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 4, 2020

SparkQA commented Jun 6, 2020

maropu Jun 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 7, 2020

SparkQA commented Jun 7, 2020

maropu commented Jun 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Jun 7, 2020

SparkQA commented Jun 7, 2020

SparkQA commented Jun 7, 2020

SparkQA commented Aug 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

viirya commented Sep 1, 2020

SparkQA commented Sep 1, 2020

SparkQA commented Sep 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 2, 2020

AngersZhuuuu commented Sep 2, 2020

SparkQA commented Sep 2, 2020

SparkQA commented Sep 2, 2020

AngersZhuuuu commented May 10, 2020 •

edited

Loading

maropu Jun 3, 2020 •

edited

Loading

maropu Jun 3, 2020 •

edited

Loading

maropu Jun 7, 2020 •

edited

Loading