[SPARK-34639][SQL] Always remove unnecessary Alias in Analyzer.resolveExpression #31758

cloud-fan · 2021-03-05T09:26:42Z

What changes were proposed in this pull request?

In Analyzer.resolveExpression, we have a parameter to decide if we should remove unnecessary Alias or not. This is over complicated and we can always remove unnecessary Alias.

This PR simplifies this part and removes the parameter.

Why are the changes needed?

code cleanup

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

cloud-fan · 2021-03-05T09:27:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

-        val candidates = q.children.flatMap(_.output)
-        assert(ordinal >= 0 && ordinal < candidates.length)
-        candidates.apply(ordinal)
+        assert(q.children.length == 1)


This change is not related to this PR, but a small followup from #31728 (comment)

cloud-fan · 2021-03-05T09:27:53Z

cc @maropu @viirya

SparkQA · 2021-03-05T14:41:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40395/

SparkQA · 2021-03-05T14:49:44Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40395/

SparkQA · 2021-03-05T15:16:08Z

Test build #135813 has finished for PR 31758 at commit 0bae6b4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Could you check the UT failures?

SparkQA · 2021-03-08T09:59:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40446/

SparkQA · 2021-03-08T10:04:16Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40446/

SparkQA · 2021-03-08T11:33:17Z

Test build #135863 has finished for PR 31758 at commit 7511a7b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-08T13:20:00Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40453/

SparkQA · 2021-03-08T13:55:36Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40453/

maropu · 2021-03-08T14:02:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+            // top-level `GetStructField` if it's safe to do so. Since we will call CleanupAliases
+            // later in Analyzer, trim non top-level unnecessary alias here is safe.
+            case Alias(s: GetStructField, _) if !isTopLevel => s
+            case Alias(s: GetArrayStructFields, _) if !isTopLevel => s


It looks we also need to add entries for the other ExtractValue classes, e.g., GetMapValue?
They seems to have the same issue with SPARK-31670;

scala> spark.table("t").printSchema() root |-- c0: integer (nullable = false) |-- c1: map (nullable = false) | |-- key: string | |-- value: string (valueContainsNull = true) scala> sql("select c0, c1.key, COUNT(1) from t group by c0, c1.key with cube").show() org.apache.spark.sql.AnalysisException: expression 't.`c1`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.; Aggregate [c0#34, key#35, spark_grouping_id#33L], [c0#34, c1#24[key] AS key#28, count(1) AS count(1)#30L] +- Expand [List(c0#23, c1#24, c0#31, key#32, 0), List(c0#23, c1#24, c0#31, null, 1), List(c0#23, c1#24, null, key#32, 2), List(c0#23, c1#24, null, null, 3)], [c0#23, c1#24, c0#34, key#35, spark_grouping_id#33L] +- Project [c0#23, c1#24, c0#23 AS c0#31, c1#24[key] AS key#32] +- SubqueryAlias t +- View (`t`, [c0#23,c1#24]) +- Project [cast(col1#25 as int) AS c0#23, cast(col2#26 as map<string,string>) AS c1#24] +- Project [col1#25, col2#26] +- LocalRelation [col1#25, col2#26]

good catch!

SparkQA · 2021-03-08T16:38:58Z

Test build #135870 has finished for PR 31758 at commit c43dc08.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-09T07:22:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40479/

SparkQA · 2021-03-09T07:57:41Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40479/

cloud-fan · 2021-03-09T08:00:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

-          Seq(Literal(e.name), e)
+        case Seq(NamePlaceholder, ne: NamedExpression) if ne.resolved =>
+          Seq(Literal(ne.name), ne)
+        case Seq(NamePlaceholder, g @ GetStructField(child, _, Some(name))) if child.resolved =>


This fix is necessary even without the refactor. In some places, we remove non-top-level alias when resolving expressions, which breaks this place.

The refactor always removes non-top-level alias and exposes this bug. I can backport this fix later if people ask, but there is no bug report for this yet.

Could you add a test case for this code path?

Only GetStructField, GetArrayStructFields and GetMapValue are the possible expressions here?

Can it be case Seq(NamePlaceholder, other expression)?

then let's have a separated PR for it: #31808

Thank you for splitting.

cloud-fan · 2021-03-09T08:09:47Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

@@ -80,11 +80,7 @@ class RelationalGroupedDataset protected[sql](
    }
  }

-  // Wrap UnresolvedAttribute with UnresolvedAlias, as when we resolve UnresolvedAttribute, we
-  // will remove intermediate Alias for ExtractValue chain, and we need to alias it again to
-  // make it a NamedExpression.


The comment is wrong as we don't remove top-level aliases for aggregate expressions. It causes problems as it wraps UnresolvedAttribute with UnresolvedAlias, making it not top-level anymore. Then the alias will be removed after this patch and UnresolvedAlias generates a different name.

For nested field a.b, previously the resolved expression is Alias(GetStructField(...), "b") and the Alias is not removed. UnresolvedAlias is useless and will be simply removed. So the final output column name is b. Now we remove the Alias, and UnresolvedAlias kicks in and generates a new Alias with the name a.b, which is a behavior change.

Here I simply remove this UnresolvedAlias, to make the behavior the same as before.

Hmm, this code is pretty old, 2015.

I saw alias is also used to add alias around grouping expressions, not just aggregate expressions. Seems the comment is more for the case?

This is only for Aggregate.aggregateExpressions not Aggregate.groupingExpressions. Aggregate.aggregateExpressions can include grouping expressions, but it doesn't matter. It needs to be Seq[NamedExpression] and Spark won't remove the top-level alias in it.

cloud-fan · 2021-03-09T08:11:27Z

sql/core/src/test/resources/sql-tests/results/struct.sql.out

@@ -83,7 +83,7 @@ struct<ID:int,NST:string>
 -- !query
 SELECT ID, STRUCT(ST.C as STC, ST.D as STD).STD FROM tbl_x
 -- !query schema
-struct<ID:int,struct(ST.C AS `C` AS `STC`, ST.D AS `D` AS `STD`).STD:string>
+struct<ID:int,struct(ST.C AS `STC`, ST.D AS `STD`).STD:string>


I think the new schema field names make more sense

cloud-fan · 2021-03-10T13:56:06Z

retest this please

cloud-fan · 2021-03-10T13:56:26Z

any more comments? @viirya @maropu

dongjoon-hyun · 2021-03-11T19:54:16Z

Could you resolve the conflicts?

maropu · 2021-03-12T00:50:39Z

I've checked that code and the change itself looks fine.

SparkQA · 2021-03-12T07:01:48Z

Test build #135991 has finished for PR 31758 at commit 2e5942d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2021-03-12T15:45:27Z

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40595/

cloud-fan · 2021-03-15T09:22:17Z

thanks for the review, merging to master!

…ate UnresolvedAlias ### What changes were proposed in this pull request? This PR partially backports #31758 to 3.1, to fix a backward compatibility issue caused by #28490 The query below has different output schemas in 3.0 and 3.1 ``` sql("select struct(1, 2) as s").groupBy(col("s.col1")).agg(first("s")) ``` In 3.0 the output column name is `col1`, in 3.1 it's `s.col1`. This breaks existing queries. In #28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column `s.col1` will become `UnresolvedAlias(s.col1, None)`. In `ResolveReference`, the logic used to directly resolve `s.col` to `s.col1 AS col1` but after #28490 we enter the code path with `trimAlias = true and !isTopLevel`, so the alias is removed and resulting in `s.col1`, which will then be resolved in `ResolveAliases` as `s.col1 AS s.col1` #31758 happens to fix this issue because we no longer wrap `UnresolvedAttribute` with `UnresolvedAlias` in `RelationalGroupedDataset`. ### Why are the changes needed? Fix an unexpected query output schema change ### Does this PR introduce _any_ user-facing change? Yes as explained above. ### How was this patch tested? updated test Closes #32239 from cloud-fan/bug. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>

…ate UnresolvedAlias ### What changes were proposed in this pull request? This PR partially backports apache#31758 to 3.1, to fix a backward compatibility issue caused by apache#28490 The query below has different output schemas in 3.0 and 3.1 ``` sql("select struct(1, 2) as s").groupBy(col("s.col1")).agg(first("s")) ``` In 3.0 the output column name is `col1`, in 3.1 it's `s.col1`. This breaks existing queries. In apache#28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column `s.col1` will become `UnresolvedAlias(s.col1, None)`. In `ResolveReference`, the logic used to directly resolve `s.col` to `s.col1 AS col1` but after apache#28490 we enter the code path with `trimAlias = true and !isTopLevel`, so the alias is removed and resulting in `s.col1`, which will then be resolved in `ResolveAliases` as `s.col1 AS s.col1` apache#31758 happens to fix this issue because we no longer wrap `UnresolvedAttribute` with `UnresolvedAlias` in `RelationalGroupedDataset`. ### Why are the changes needed? Fix an unexpected query output schema change ### Does this PR introduce _any_ user-facing change? Yes as explained above. ### How was this patch tested? updated test Closes apache#32239 from cloud-fan/bug. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>

cloud-fan commented Mar 5, 2021

View reviewed changes

github-actions bot added the SQL label Mar 5, 2021

cloud-fan force-pushed the resolve branch from f286379 to 0bae6b4 Compare March 5, 2021 13:59

dongjoon-hyun reviewed Mar 6, 2021

View reviewed changes

cloud-fan force-pushed the resolve branch from 0bae6b4 to 7511a7b Compare March 8, 2021 08:31

cloud-fan force-pushed the resolve branch from 7511a7b to c43dc08 Compare March 8, 2021 11:50

maropu reviewed Mar 8, 2021

View reviewed changes

cloud-fan force-pushed the resolve branch from c43dc08 to 4242154 Compare March 9, 2021 06:01

cloud-fan commented Mar 9, 2021

View reviewed changes

always remove unnecessary Alias in Analyzer.resolveExpression

2e5942d

cloud-fan force-pushed the resolve branch from 4242154 to 2e5942d Compare March 12, 2021 03:46

maropu approved these changes Mar 12, 2021

View reviewed changes

viirya approved these changes Mar 12, 2021

View reviewed changes

fix a corner case

5054765

cloud-fan closed this in be888b2 Mar 15, 2021

cloud-fan mentioned this pull request Apr 19, 2021

[SPARK-34639][SQL][3.1] RelationalGroupedDataset.alias should not create UnresolvedAlias #32239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34639][SQL] Always remove unnecessary Alias in Analyzer.resolveExpression #31758

[SPARK-34639][SQL] Always remove unnecessary Alias in Analyzer.resolveExpression #31758

cloud-fan commented Mar 5, 2021

cloud-fan Mar 5, 2021

cloud-fan commented Mar 5, 2021

SparkQA commented Mar 5, 2021

SparkQA commented Mar 5, 2021

SparkQA commented Mar 5, 2021

dongjoon-hyun left a comment

SparkQA commented Mar 8, 2021

SparkQA commented Mar 8, 2021

SparkQA commented Mar 8, 2021

SparkQA commented Mar 8, 2021

SparkQA commented Mar 8, 2021

maropu Mar 8, 2021

cloud-fan Mar 9, 2021

SparkQA commented Mar 8, 2021

SparkQA commented Mar 9, 2021

SparkQA commented Mar 9, 2021

cloud-fan Mar 9, 2021

dongjoon-hyun Mar 11, 2021

viirya Mar 11, 2021

cloud-fan Mar 11, 2021

dongjoon-hyun Mar 11, 2021

cloud-fan Mar 9, 2021

viirya Mar 11, 2021

cloud-fan Mar 11, 2021

cloud-fan Mar 9, 2021

cloud-fan commented Mar 10, 2021

cloud-fan commented Mar 10, 2021

dongjoon-hyun commented Mar 11, 2021

maropu commented Mar 12, 2021

SparkQA commented Mar 12, 2021

AmplabJenkins commented Mar 12, 2021

cloud-fan commented Mar 15, 2021

[SPARK-34639][SQL] Always remove unnecessary Alias in Analyzer.resolveExpression #31758

[SPARK-34639][SQL] Always remove unnecessary Alias in Analyzer.resolveExpression #31758

Conversation

cloud-fan commented Mar 5, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

cloud-fan commented Mar 5, 2021

SparkQA commented Mar 5, 2021

SparkQA commented Mar 5, 2021

SparkQA commented Mar 5, 2021

dongjoon-hyun left a comment

Choose a reason for hiding this comment

SparkQA commented Mar 8, 2021

SparkQA commented Mar 8, 2021

SparkQA commented Mar 8, 2021

SparkQA commented Mar 8, 2021

SparkQA commented Mar 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 8, 2021

SparkQA commented Mar 9, 2021

SparkQA commented Mar 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Mar 10, 2021

cloud-fan commented Mar 10, 2021

dongjoon-hyun commented Mar 11, 2021

maropu commented Mar 12, 2021

SparkQA commented Mar 12, 2021

AmplabJenkins commented Mar 12, 2021

cloud-fan commented Mar 15, 2021