-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34037][SQL] Remove unnecessary upcasting for Avg & Sum which handle by themself internally #31079
Conversation
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #133773 has finished for PR 31079 at commit
|
Kubernetes integration test starting |
cc @maropu and @cloud-fan |
Kubernetes integration test status success |
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test status success |
Test build #133763 has finished for PR 31079 at commit
|
Test build #133774 has finished for PR 31079 at commit
|
Test build #133769 has finished for PR 31079 at commit
|
Hmm? Can you elaborate the issue more clearly? I don't get what going to be wrong with the current PR description. What the bug is, and how it affects query result? |
// Promote SUM, SUM DISTINCT and AVERAGE to largest types to prevent overflows. | ||
case s @ Sum(e @ DecimalType()) => s // Decimal is already the biggest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a reason about promoting these aggregation functions, why removing them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the type-coercion for numeric types of average and sum is not necessary at all, as the resultType and sumType can prevent the overflow. and it causes the issue here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, Sum
/Average
already casts the inputs internally.
Let's add a UT to show the bug directly. TPCDS query plans are hard to read. |
Kubernetes integration test starting |
retest this please |
Test build #133802 has started for PR 31079 at commit |
Kubernetes integration test starting |
Kubernetes integration test status success |
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #133827 has finished for PR 31079 at commit
|
@yaooqinn can we update the PR title and description? Now it's a simple improvement to avoid unnecessary casts for sum/avg. |
Kubernetes integration test starting |
updated thanks |
Kubernetes integration test status success |
Test build #134089 has finished for PR 31079 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #134105 has finished for PR 31079 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks making sense.
@yaooqinn Can you also update the JIRA accordingly as the current PR title and description? |
OK |
Thanks! Merging to master. |
…& Sum which handle by themself internally (#940) * Fix Number of partitions (0) must be positive * [SPARK-34037][SQL] Remove unnecessary upcasting for Avg & Sum which handle by themself internally ### What changes were proposed in this pull request? The type-coercion for numeric types of average and sum is not necessary at all, as the resultType and sumType can prevent the overflow. ### Why are the changes needed? rm unnecessary logic which may cause potential performance regressions ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tpcds tests for plan Closes #31079 from yaooqinn/SPARK-34037. Authored-by: Kent Yao <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]> (cherry picked from commit a235c3b) * [SPARK-34037][SQL] Remove unnecessary upcasting for Avg & Sum which handle by themself internally ### What changes were proposed in this pull request? The type-coercion for numeric types of average and sum is not necessary at all, as the resultType and sumType can prevent the overflow. ### Why are the changes needed? rm unnecessary logic which may cause potential performance regressions ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tpcds tests for plan Closes #31079 from yaooqinn/SPARK-34037. Authored-by: Kent Yao <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]> (cherry picked from commit a235c3b) * fix Co-authored-by: Kent Yao <[email protected]>
… expressions from aggregate expressions without aggregate function (#941) * [SPARK-34581][SQL] Don't optimize out grouping expressions from aggregate expressions without aggregate function ### What changes were proposed in this pull request? This PR adds a new rule `PullOutGroupingExpressions` to pull out complex grouping expressions to a `Project` node under an `Aggregate`. These expressions are then referenced in both grouping expressions and aggregate expressions without aggregate functions to ensure that optimization rules don't change the aggregate expressions to invalid ones that no longer refer to any grouping expressions. ### Why are the changes needed? If aggregate expressions (without aggregate functions) in an `Aggregate` node are complex then the `Optimizer` can optimize out grouping expressions from them and so making aggregate expressions invalid. Here is a simple example: ``` SELECT not(t.id IS NULL) , count(*) FROM t GROUP BY t.id IS NULL ``` In this case the `BooleanSimplification` rule does this: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.BooleanSimplification === !Aggregate [isnull(id#222)], [NOT isnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L] Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L] +- Project [value#219 AS id#222] +- Project [value#219 AS id#222] +- LocalRelation [value#219] +- LocalRelation [value#219] ``` where `NOT isnull(id#222)` is optimized to `isnotnull(id#222)` and so it no longer refers to any grouping expression. Before this PR: ``` == Optimized Logical Plan == Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#234, count(1) AS c#232L] +- Project [value#219 AS id#222] +- LocalRelation [value#219] ``` and running the query throws an error: ``` Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L] java.lang.IllegalStateException: Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L] ``` After this PR: ``` == Optimized Logical Plan == Aggregate [_groupingexpression#233], [NOT _groupingexpression#233 AS (NOT (id IS NULL))#230, count(1) AS c#228L] +- Project [isnull(value#219) AS _groupingexpression#233] +- LocalRelation [value#219] ``` and the query works. ### Does this PR introduce _any_ user-facing change? Yes, the query works. ### How was this patch tested? Added new UT. Closes #32396 from peter-toth/SPARK-34581-keep-grouping-expressions-2. Authored-by: Peter Toth <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit cfc0495) * [SPARK-34037][SQL] Remove unnecessary upcasting for Avg & Sum which handle by themself internally ### What changes were proposed in this pull request? The type-coercion for numeric types of average and sum is not necessary at all, as the resultType and sumType can prevent the overflow. ### Why are the changes needed? rm unnecessary logic which may cause potential performance regressions ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tpcds tests for plan Closes #31079 from yaooqinn/SPARK-34037. Authored-by: Kent Yao <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]> (cherry picked from commit a235c3b) Co-authored-by: Peter Toth <[email protected]>
What changes were proposed in this pull request?
The type-coercion for numeric types of average and sum is not necessary at all, as the resultType and sumType can prevent the overflow.
Why are the changes needed?
rm unnecessary logic which may cause potential performance regressions
Does this PR introduce any user-facing change?
no
How was this patch tested?
tpcds tests for plan