Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34639][SQL] Always remove unnecessary Alias in Analyzer.resolveExpression #31758

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

In Analyzer.resolveExpression, we have a parameter to decide if we should remove unnecessary Alias or not. This is over complicated and we can always remove unnecessary Alias.

This PR simplifies this part and removes the parameter.

Why are the changes needed?

code cleanup

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

val candidates = q.children.flatMap(_.output)
assert(ordinal >= 0 && ordinal < candidates.length)
candidates.apply(ordinal)
assert(q.children.length == 1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not related to this PR, but a small followup from #31728 (comment)

@cloud-fan
Copy link
Contributor Author

cc @maropu @viirya

@SparkQA
Copy link

SparkQA commented Mar 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40395/

@SparkQA
Copy link

SparkQA commented Mar 5, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40395/

@SparkQA
Copy link

SparkQA commented Mar 5, 2021

Test build #135813 has finished for PR 31758 at commit 0bae6b4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check the UT failures?

@SparkQA
Copy link

SparkQA commented Mar 8, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40446/

@SparkQA
Copy link

SparkQA commented Mar 8, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40446/

@SparkQA
Copy link

SparkQA commented Mar 8, 2021

Test build #135863 has finished for PR 31758 at commit 7511a7b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 8, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40453/

@SparkQA
Copy link

SparkQA commented Mar 8, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40453/

// top-level `GetStructField` if it's safe to do so. Since we will call CleanupAliases
// later in Analyzer, trim non top-level unnecessary alias here is safe.
case Alias(s: GetStructField, _) if !isTopLevel => s
case Alias(s: GetArrayStructFields, _) if !isTopLevel => s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks we also need to add entries for the other ExtractValue classes, e.g., GetMapValue?
They seems to have the same issue with SPARK-31670;

scala> spark.table("t").printSchema()
root
 |-- c0: integer (nullable = false)
 |-- c1: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)


scala> sql("select c0, c1.key, COUNT(1) from t group by c0, c1.key with cube").show()
org.apache.spark.sql.AnalysisException: expression 't.`c1`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
Aggregate [c0#34, key#35, spark_grouping_id#33L], [c0#34, c1#24[key] AS key#28, count(1) AS count(1)#30L]
+- Expand [List(c0#23, c1#24, c0#31, key#32, 0), List(c0#23, c1#24, c0#31, null, 1), List(c0#23, c1#24, null, key#32, 2), List(c0#23, c1#24, null, null, 3)], [c0#23, c1#24, c0#34, key#35, spark_grouping_id#33L]
   +- Project [c0#23, c1#24, c0#23 AS c0#31, c1#24[key] AS key#32]
      +- SubqueryAlias t
         +- View (`t`, [c0#23,c1#24])
            +- Project [cast(col1#25 as int) AS c0#23, cast(col2#26 as map<string,string>) AS c1#24]
               +- Project [col1#25, col2#26]
                  +- LocalRelation [col1#25, col2#26]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@SparkQA
Copy link

SparkQA commented Mar 8, 2021

Test build #135870 has finished for PR 31758 at commit c43dc08.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40479/

@SparkQA
Copy link

SparkQA commented Mar 9, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40479/

Seq(Literal(e.name), e)
case Seq(NamePlaceholder, ne: NamedExpression) if ne.resolved =>
Seq(Literal(ne.name), ne)
case Seq(NamePlaceholder, g @ GetStructField(child, _, Some(name))) if child.resolved =>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix is necessary even without the refactor. In some places, we remove non-top-level alias when resolving expressions, which breaks this place.

The refactor always removes non-top-level alias and exposes this bug. I can backport this fix later if people ask, but there is no bug report for this yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test case for this code path?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only GetStructField, GetArrayStructFields and GetMapValue are the possible expressions here?

Can it be case Seq(NamePlaceholder, other expression)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then let's have a separated PR for it: #31808

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for splitting.

@@ -80,11 +80,7 @@ class RelationalGroupedDataset protected[sql](
}
}

// Wrap UnresolvedAttribute with UnresolvedAlias, as when we resolve UnresolvedAttribute, we
// will remove intermediate Alias for ExtractValue chain, and we need to alias it again to
// make it a NamedExpression.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is wrong as we don't remove top-level aliases for aggregate expressions. It causes problems as it wraps UnresolvedAttribute with UnresolvedAlias, making it not top-level anymore. Then the alias will be removed after this patch and UnresolvedAlias generates a different name.

For nested field a.b, previously the resolved expression is Alias(GetStructField(...), "b") and the Alias is not removed. UnresolvedAlias is useless and will be simply removed. So the final output column name is b. Now we remove the Alias, and UnresolvedAlias kicks in and generates a new Alias with the name a.b, which is a behavior change.

Here I simply remove this UnresolvedAlias, to make the behavior the same as before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this code is pretty old, 2015.

I saw alias is also used to add alias around grouping expressions, not just aggregate expressions. Seems the comment is more for the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for Aggregate.aggregateExpressions not Aggregate.groupingExpressions. Aggregate.aggregateExpressions can include grouping expressions, but it doesn't matter. It needs to be Seq[NamedExpression] and Spark won't remove the top-level alias in it.

@@ -83,7 +83,7 @@ struct<ID:int,NST:string>
-- !query
SELECT ID, STRUCT(ST.C as STC, ST.D as STD).STD FROM tbl_x
-- !query schema
struct<ID:int,struct(ST.C AS `C` AS `STC`, ST.D AS `D` AS `STD`).STD:string>
struct<ID:int,struct(ST.C AS `STC`, ST.D AS `STD`).STD:string>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new schema field names make more sense

@cloud-fan
Copy link
Contributor Author

retest this please

@cloud-fan
Copy link
Contributor Author

any more comments? @viirya @maropu

@dongjoon-hyun
Copy link
Member

Could you resolve the conflicts?

@maropu
Copy link
Member

maropu commented Mar 12, 2021

I've checked that code and the change itself looks fine.

@SparkQA
Copy link

SparkQA commented Mar 12, 2021

Test build #135991 has finished for PR 31758 at commit 2e5942d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40595/

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@cloud-fan cloud-fan closed this in be888b2 Mar 15, 2021
maropu pushed a commit that referenced this pull request Apr 21, 2021
…ate UnresolvedAlias

### What changes were proposed in this pull request?

This PR partially backports #31758 to 3.1, to fix a backward compatibility issue caused by #28490

The query below has different output schemas in 3.0 and 3.1
```
sql("select struct(1, 2) as s").groupBy(col("s.col1")).agg(first("s"))
```

In 3.0 the output column name is `col1`, in 3.1  it's `s.col1`. This breaks existing queries.

In #28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column `s.col1` will become `UnresolvedAlias(s.col1, None)`. In `ResolveReference`, the logic used to directly resolve `s.col` to `s.col1 AS col1` but after #28490 we enter the code path with `trimAlias = true and !isTopLevel`, so the alias is removed and resulting in `s.col1`, which will then be resolved in `ResolveAliases` as `s.col1 AS s.col1`

#31758 happens to fix this issue because we no longer wrap `UnresolvedAttribute` with `UnresolvedAlias` in `RelationalGroupedDataset`.

### Why are the changes needed?

Fix an unexpected query output schema change

### Does this PR introduce _any_ user-facing change?

Yes as explained above.

### How was this patch tested?

updated test

Closes #32239 from cloud-fan/bug.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
…ate UnresolvedAlias

### What changes were proposed in this pull request?

This PR partially backports apache#31758 to 3.1, to fix a backward compatibility issue caused by apache#28490

The query below has different output schemas in 3.0 and 3.1
```
sql("select struct(1, 2) as s").groupBy(col("s.col1")).agg(first("s"))
```

In 3.0 the output column name is `col1`, in 3.1  it's `s.col1`. This breaks existing queries.

In apache#28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column `s.col1` will become `UnresolvedAlias(s.col1, None)`. In `ResolveReference`, the logic used to directly resolve `s.col` to `s.col1 AS col1` but after apache#28490 we enter the code path with `trimAlias = true and !isTopLevel`, so the alias is removed and resulting in `s.col1`, which will then be resolved in `ResolveAliases` as `s.col1 AS s.col1`

apache#31758 happens to fix this issue because we no longer wrap `UnresolvedAttribute` with `UnresolvedAlias` in `RelationalGroupedDataset`.

### Why are the changes needed?

Fix an unexpected query output schema change

### Does this PR introduce _any_ user-facing change?

Yes as explained above.

### How was this patch tested?

updated test

Closes apache#32239 from cloud-fan/bug.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
…ate UnresolvedAlias

### What changes were proposed in this pull request?

This PR partially backports apache#31758 to 3.1, to fix a backward compatibility issue caused by apache#28490

The query below has different output schemas in 3.0 and 3.1
```
sql("select struct(1, 2) as s").groupBy(col("s.col1")).agg(first("s"))
```

In 3.0 the output column name is `col1`, in 3.1  it's `s.col1`. This breaks existing queries.

In apache#28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column `s.col1` will become `UnresolvedAlias(s.col1, None)`. In `ResolveReference`, the logic used to directly resolve `s.col` to `s.col1 AS col1` but after apache#28490 we enter the code path with `trimAlias = true and !isTopLevel`, so the alias is removed and resulting in `s.col1`, which will then be resolved in `ResolveAliases` as `s.col1 AS s.col1`

apache#31758 happens to fix this issue because we no longer wrap `UnresolvedAttribute` with `UnresolvedAlias` in `RelationalGroupedDataset`.

### Why are the changes needed?

Fix an unexpected query output schema change

### Does this PR introduce _any_ user-facing change?

Yes as explained above.

### How was this patch tested?

updated test

Closes apache#32239 from cloud-fan/bug.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants