-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34639][SQL] Always remove unnecessary Alias in Analyzer.resolveExpression #31758
Conversation
val candidates = q.children.flatMap(_.output) | ||
assert(ordinal >= 0 && ordinal < candidates.length) | ||
candidates.apply(ordinal) | ||
assert(q.children.length == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is not related to this PR, but a small followup from #31728 (comment)
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #135813 has finished for PR 31758 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check the UT failures?
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #135863 has finished for PR 31758 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
// top-level `GetStructField` if it's safe to do so. Since we will call CleanupAliases | ||
// later in Analyzer, trim non top-level unnecessary alias here is safe. | ||
case Alias(s: GetStructField, _) if !isTopLevel => s | ||
case Alias(s: GetArrayStructFields, _) if !isTopLevel => s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks we also need to add entries for the other ExtractValue
classes, e.g., GetMapValue
?
They seems to have the same issue with SPARK-31670;
scala> spark.table("t").printSchema()
root
|-- c0: integer (nullable = false)
|-- c1: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = true)
scala> sql("select c0, c1.key, COUNT(1) from t group by c0, c1.key with cube").show()
org.apache.spark.sql.AnalysisException: expression 't.`c1`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
Aggregate [c0#34, key#35, spark_grouping_id#33L], [c0#34, c1#24[key] AS key#28, count(1) AS count(1)#30L]
+- Expand [List(c0#23, c1#24, c0#31, key#32, 0), List(c0#23, c1#24, c0#31, null, 1), List(c0#23, c1#24, null, key#32, 2), List(c0#23, c1#24, null, null, 3)], [c0#23, c1#24, c0#34, key#35, spark_grouping_id#33L]
+- Project [c0#23, c1#24, c0#23 AS c0#31, c1#24[key] AS key#32]
+- SubqueryAlias t
+- View (`t`, [c0#23,c1#24])
+- Project [cast(col1#25 as int) AS c0#23, cast(col2#26 as map<string,string>) AS c1#24]
+- Project [col1#25, col2#26]
+- LocalRelation [col1#25, col2#26]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
Test build #135870 has finished for PR 31758 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Seq(Literal(e.name), e) | ||
case Seq(NamePlaceholder, ne: NamedExpression) if ne.resolved => | ||
Seq(Literal(ne.name), ne) | ||
case Seq(NamePlaceholder, g @ GetStructField(child, _, Some(name))) if child.resolved => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix is necessary even without the refactor. In some places, we remove non-top-level alias when resolving expressions, which breaks this place.
The refactor always removes non-top-level alias and exposes this bug. I can backport this fix later if people ask, but there is no bug report for this yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test case for this code path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only GetStructField
, GetArrayStructFields
and GetMapValue
are the possible expressions here?
Can it be case Seq(NamePlaceholder, other expression)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then let's have a separated PR for it: #31808
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for splitting.
@@ -80,11 +80,7 @@ class RelationalGroupedDataset protected[sql]( | |||
} | |||
} | |||
|
|||
// Wrap UnresolvedAttribute with UnresolvedAlias, as when we resolve UnresolvedAttribute, we | |||
// will remove intermediate Alias for ExtractValue chain, and we need to alias it again to | |||
// make it a NamedExpression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is wrong as we don't remove top-level aliases for aggregate expressions. It causes problems as it wraps UnresolvedAttribute
with UnresolvedAlias
, making it not top-level anymore. Then the alias will be removed after this patch and UnresolvedAlias
generates a different name.
For nested field a.b
, previously the resolved expression is Alias(GetStructField(...), "b")
and the Alias
is not removed. UnresolvedAlias
is useless and will be simply removed. So the final output column name is b
. Now we remove the Alias
, and UnresolvedAlias
kicks in and generates a new Alias
with the name a.b
, which is a behavior change.
Here I simply remove this UnresolvedAlias
, to make the behavior the same as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this code is pretty old, 2015.
I saw alias
is also used to add alias around grouping expressions, not just aggregate expressions. Seems the comment is more for the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only for Aggregate.aggregateExpressions
not Aggregate.groupingExpressions
. Aggregate.aggregateExpressions
can include grouping expressions, but it doesn't matter. It needs to be Seq[NamedExpression]
and Spark won't remove the top-level alias in it.
@@ -83,7 +83,7 @@ struct<ID:int,NST:string> | |||
-- !query | |||
SELECT ID, STRUCT(ST.C as STC, ST.D as STD).STD FROM tbl_x | |||
-- !query schema | |||
struct<ID:int,struct(ST.C AS `C` AS `STC`, ST.D AS `D` AS `STD`).STD:string> | |||
struct<ID:int,struct(ST.C AS `STC`, ST.D AS `STD`).STD:string> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the new schema field names make more sense
retest this please |
Could you resolve the conflicts? |
I've checked that code and the change itself looks fine. |
Test build #135991 has finished for PR 31758 at commit
|
Refer to this link for build results (access rights to CI server needed): |
thanks for the review, merging to master! |
…ate UnresolvedAlias ### What changes were proposed in this pull request? This PR partially backports #31758 to 3.1, to fix a backward compatibility issue caused by #28490 The query below has different output schemas in 3.0 and 3.1 ``` sql("select struct(1, 2) as s").groupBy(col("s.col1")).agg(first("s")) ``` In 3.0 the output column name is `col1`, in 3.1 it's `s.col1`. This breaks existing queries. In #28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column `s.col1` will become `UnresolvedAlias(s.col1, None)`. In `ResolveReference`, the logic used to directly resolve `s.col` to `s.col1 AS col1` but after #28490 we enter the code path with `trimAlias = true and !isTopLevel`, so the alias is removed and resulting in `s.col1`, which will then be resolved in `ResolveAliases` as `s.col1 AS s.col1` #31758 happens to fix this issue because we no longer wrap `UnresolvedAttribute` with `UnresolvedAlias` in `RelationalGroupedDataset`. ### Why are the changes needed? Fix an unexpected query output schema change ### Does this PR introduce _any_ user-facing change? Yes as explained above. ### How was this patch tested? updated test Closes #32239 from cloud-fan/bug. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>
…ate UnresolvedAlias ### What changes were proposed in this pull request? This PR partially backports apache#31758 to 3.1, to fix a backward compatibility issue caused by apache#28490 The query below has different output schemas in 3.0 and 3.1 ``` sql("select struct(1, 2) as s").groupBy(col("s.col1")).agg(first("s")) ``` In 3.0 the output column name is `col1`, in 3.1 it's `s.col1`. This breaks existing queries. In apache#28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column `s.col1` will become `UnresolvedAlias(s.col1, None)`. In `ResolveReference`, the logic used to directly resolve `s.col` to `s.col1 AS col1` but after apache#28490 we enter the code path with `trimAlias = true and !isTopLevel`, so the alias is removed and resulting in `s.col1`, which will then be resolved in `ResolveAliases` as `s.col1 AS s.col1` apache#31758 happens to fix this issue because we no longer wrap `UnresolvedAttribute` with `UnresolvedAlias` in `RelationalGroupedDataset`. ### Why are the changes needed? Fix an unexpected query output schema change ### Does this PR introduce _any_ user-facing change? Yes as explained above. ### How was this patch tested? updated test Closes apache#32239 from cloud-fan/bug. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>
…ate UnresolvedAlias ### What changes were proposed in this pull request? This PR partially backports apache#31758 to 3.1, to fix a backward compatibility issue caused by apache#28490 The query below has different output schemas in 3.0 and 3.1 ``` sql("select struct(1, 2) as s").groupBy(col("s.col1")).agg(first("s")) ``` In 3.0 the output column name is `col1`, in 3.1 it's `s.col1`. This breaks existing queries. In apache#28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column `s.col1` will become `UnresolvedAlias(s.col1, None)`. In `ResolveReference`, the logic used to directly resolve `s.col` to `s.col1 AS col1` but after apache#28490 we enter the code path with `trimAlias = true and !isTopLevel`, so the alias is removed and resulting in `s.col1`, which will then be resolved in `ResolveAliases` as `s.col1 AS s.col1` apache#31758 happens to fix this issue because we no longer wrap `UnresolvedAttribute` with `UnresolvedAlias` in `RelationalGroupedDataset`. ### Why are the changes needed? Fix an unexpected query output schema change ### Does this PR introduce _any_ user-facing change? Yes as explained above. ### How was this patch tested? updated test Closes apache#32239 from cloud-fan/bug. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>
What changes were proposed in this pull request?
In
Analyzer.resolveExpression
, we have a parameter to decide if we should remove unnecessaryAlias
or not. This is over complicated and we can always remove unnecessaryAlias
.This PR simplifies this part and removes the parameter.
Why are the changes needed?
code cleanup
Does this PR introduce any user-facing change?
no
How was this patch tested?
existing tests