[SPARK-34037][SQL] Remove unnecessary upcasting for Avg & Sum which handle by themself internally #31079

yaooqinn · 2021-01-07T03:11:12Z

What changes were proposed in this pull request?

The type-coercion for numeric types of average and sum is not necessary at all, as the resultType and sumType can prevent the overflow.

Why are the changes needed?

rm unnecessary logic which may cause potential performance regressions

Does this PR introduce any user-facing change?

no

How was this patch tested?

tpcds tests for plan

…nal attribute

SparkQA · 2021-01-07T04:05:12Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38355/

SparkQA · 2021-01-07T04:35:18Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38355/

SparkQA · 2021-01-07T04:54:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38357/

SparkQA · 2021-01-07T05:22:51Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38357/

SparkQA · 2021-01-07T05:36:12Z

Test build #133773 has finished for PR 31079 at commit 238cd1c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
new AnalysisException(s\"Can not load class '$className' when registering \" +
sealed abstract class LikeAllBase extends MultiLikeBase
sealed abstract class LikeAnyBase extends MultiLikeBase

SparkQA · 2021-01-07T06:05:22Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38361/

dongjoon-hyun · 2021-01-07T06:38:37Z

cc @maropu and @cloud-fan

SparkQA · 2021-01-07T06:46:27Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38361/

SparkQA · 2021-01-07T06:53:52Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38363/

SparkQA · 2021-01-07T07:05:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38367/

SparkQA · 2021-01-07T07:10:11Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38367/

SparkQA · 2021-01-07T07:22:10Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38363/

SparkQA · 2021-01-07T07:30:35Z

Test build #133763 has finished for PR 31079 at commit b4dc838.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-07T08:09:56Z

Test build #133774 has finished for PR 31079 at commit ca9f82f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-07T08:34:33Z

Test build #133769 has finished for PR 31079 at commit 858a545.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-01-07T09:05:36Z

Hmm? Can you elaborate the issue more clearly? I don't get what going to be wrong with the current PR description. What the bug is, and how it affects query result?

viirya · 2021-01-07T09:07:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

-      // Promote SUM, SUM DISTINCT and AVERAGE to largest types to prevent overflows.
-      case s @ Sum(e @ DecimalType()) => s // Decimal is already the biggest.


There is a reason about promoting these aggregation functions, why removing them?

the type-coercion for numeric types of average and sum is not necessary at all, as the resultType and sumType can prevent the overflow. and it causes the issue here

Yea, Sum/Average already casts the inputs internally.

cloud-fan · 2021-01-07T09:08:31Z

Let's add a UT to show the bug directly. TPCDS query plans are hard to read.

SparkQA · 2021-01-07T10:09:37Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38374/

yaooqinn · 2021-01-07T17:09:48Z

retest this please

SparkQA · 2021-01-07T17:20:43Z

Test build #133802 has started for PR 31079 at commit 43aa93d.

SparkQA · 2021-01-07T17:52:42Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38391/

SparkQA · 2021-01-07T18:20:59Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38391/

yaooqinn · 2021-01-08T08:00:18Z

retest this please

SparkQA · 2021-01-08T09:04:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38416/

SparkQA · 2021-01-08T09:33:36Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38416/

SparkQA · 2021-01-08T12:53:24Z

Test build #133827 has finished for PR 31079 at commit 43aa93d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-01-15T07:22:34Z

@yaooqinn can we update the PR title and description? Now it's a simple improvement to avoid unnecessary casts for sum/avg.

SparkQA · 2021-01-15T07:35:47Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38675/

yaooqinn · 2021-01-15T07:36:20Z

updated thanks

SparkQA · 2021-01-15T08:03:41Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38675/

SparkQA · 2021-01-15T10:23:03Z

Test build #134089 has finished for PR 31079 at commit e891f5e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-15T13:46:23Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38689/

SparkQA · 2021-01-15T14:18:51Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38689/

SparkQA · 2021-01-15T17:32:46Z

Test build #134105 has finished for PR 31079 at commit 494a592.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya

looks making sense.

viirya · 2021-01-15T17:37:29Z

@yaooqinn Can you also update the JIRA accordingly as the current PR title and description?

yaooqinn · 2021-01-15T17:39:09Z

@yaooqinn Can you also update the JIRA accordingly as the current PR title and description?

OK

viirya · 2021-01-15T18:18:08Z

Thanks! Merging to master.

…& Sum which handle by themself internally (#940) * Fix Number of partitions (0) must be positive * [SPARK-34037][SQL] Remove unnecessary upcasting for Avg & Sum which handle by themself internally ### What changes were proposed in this pull request? The type-coercion for numeric types of average and sum is not necessary at all, as the resultType and sumType can prevent the overflow. ### Why are the changes needed? rm unnecessary logic which may cause potential performance regressions ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tpcds tests for plan Closes #31079 from yaooqinn/SPARK-34037. Authored-by: Kent Yao <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]> (cherry picked from commit a235c3b) * [SPARK-34037][SQL] Remove unnecessary upcasting for Avg & Sum which handle by themself internally ### What changes were proposed in this pull request? The type-coercion for numeric types of average and sum is not necessary at all, as the resultType and sumType can prevent the overflow. ### Why are the changes needed? rm unnecessary logic which may cause potential performance regressions ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tpcds tests for plan Closes #31079 from yaooqinn/SPARK-34037. Authored-by: Kent Yao <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]> (cherry picked from commit a235c3b) * fix Co-authored-by: Kent Yao <[email protected]>

… expressions from aggregate expressions without aggregate function (#941) * [SPARK-34581][SQL] Don't optimize out grouping expressions from aggregate expressions without aggregate function ### What changes were proposed in this pull request? This PR adds a new rule `PullOutGroupingExpressions` to pull out complex grouping expressions to a `Project` node under an `Aggregate`. These expressions are then referenced in both grouping expressions and aggregate expressions without aggregate functions to ensure that optimization rules don't change the aggregate expressions to invalid ones that no longer refer to any grouping expressions. ### Why are the changes needed? If aggregate expressions (without aggregate functions) in an `Aggregate` node are complex then the `Optimizer` can optimize out grouping expressions from them and so making aggregate expressions invalid. Here is a simple example: ``` SELECT not(t.id IS NULL) , count(*) FROM t GROUP BY t.id IS NULL ``` In this case the `BooleanSimplification` rule does this: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.BooleanSimplification === !Aggregate [isnull(id#222)], [NOT isnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L] Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L] +- Project [value#219 AS id#222] +- Project [value#219 AS id#222] +- LocalRelation [value#219] +- LocalRelation [value#219] ``` where `NOT isnull(id#222)` is optimized to `isnotnull(id#222)` and so it no longer refers to any grouping expression. Before this PR: ``` == Optimized Logical Plan == Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#234, count(1) AS c#232L] +- Project [value#219 AS id#222] +- LocalRelation [value#219] ``` and running the query throws an error: ``` Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L] java.lang.IllegalStateException: Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L] ``` After this PR: ``` == Optimized Logical Plan == Aggregate [_groupingexpression#233], [NOT _groupingexpression#233 AS (NOT (id IS NULL))#230, count(1) AS c#228L] +- Project [isnull(value#219) AS _groupingexpression#233] +- LocalRelation [value#219] ``` and the query works. ### Does this PR introduce _any_ user-facing change? Yes, the query works. ### How was this patch tested? Added new UT. Closes #32396 from peter-toth/SPARK-34581-keep-grouping-expressions-2. Authored-by: Peter Toth <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit cfc0495) * [SPARK-34037][SQL] Remove unnecessary upcasting for Avg & Sum which handle by themself internally ### What changes were proposed in this pull request? The type-coercion for numeric types of average and sum is not necessary at all, as the resultType and sumType can prevent the overflow. ### Why are the changes needed? rm unnecessary logic which may cause potential performance regressions ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tpcds tests for plan Closes #31079 from yaooqinn/SPARK-34037. Authored-by: Kent Yao <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]> (cherry picked from commit a235c3b) Co-authored-by: Peter Toth <[email protected]>

yaooqinn added 2 commits January 7, 2021 11:06

[SPARK-34037][SQL] aggOrder should not be output as a auxiliary inter…

b4dc838

…nal attribute

nit

858a545

github-actions bot added the SQL label Jan 7, 2021

yaooqinn added 2 commits January 7, 2021 12:21

sum and averge

4288380

Merge branch 'master' into SPARK-34037

238cd1c

yaooqinn added 4 commits January 7, 2021 13:51

remove type coarecion for avg and sum

066c104

restore typecheck

0a3ae02

Merge branch 'master' into SPARK-34037

b154509

nit

41977fd

yaooqinn added 2 commits January 7, 2021 14:05

regen

4477423

sf

ca9f82f

viirya reviewed Jan 7, 2021

View reviewed changes

yaooqinn added 3 commits January 15, 2021 14:14

Merge branch 'master' into SPARK-34037

0f754ff

result file

a8ec532

result file

e891f5e

yaooqinn changed the title ~~[SPARK-34037][SQL] ResolveAggregateFunctions pushes duplicated sort order into aggregate because of unnecessary casting~~ [SPARK-34037][SQL] Remove unnecessary upcasting for Avg & Sum which handle by themself internally Jan 15, 2021

update results

494a592

cloud-fan approved these changes Jan 15, 2021

View reviewed changes

viirya approved these changes Jan 15, 2021

View reviewed changes

viirya closed this in a235c3b Jan 15, 2021

baibaichen mentioned this pull request Mar 22, 2021

[SPARK-22390][SPARK-32833][SQL] JDBC V2 Datasource aggregate push down #29695

Closed

		// Promote SUM, SUM DISTINCT and AVERAGE to largest types to prevent overflows.
		case s @ Sum(e @ DecimalType()) => s // Decimal is already the biggest.

[SPARK-34037][SQL] Remove unnecessary upcasting for Avg & Sum which handle by themself internally #31079

[SPARK-34037][SQL] Remove unnecessary upcasting for Avg & Sum which handle by themself internally #31079

Conversation

yaooqinn commented Jan 7, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

dongjoon-hyun commented Jan 7, 2021

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

viirya commented Jan 7, 2021

viirya Jan 7, 2021

Choose a reason for hiding this comment

yaooqinn Jan 8, 2021

Choose a reason for hiding this comment

cloud-fan Jan 8, 2021

Choose a reason for hiding this comment

cloud-fan commented Jan 7, 2021

SparkQA commented Jan 7, 2021

yaooqinn commented Jan 7, 2021

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

SparkQA commented Jan 7, 2021

yaooqinn commented Jan 8, 2021

SparkQA commented Jan 8, 2021

SparkQA commented Jan 8, 2021

SparkQA commented Jan 8, 2021

cloud-fan commented Jan 15, 2021

SparkQA commented Jan 15, 2021

yaooqinn commented Jan 15, 2021

SparkQA commented Jan 15, 2021

SparkQA commented Jan 15, 2021

SparkQA commented Jan 15, 2021

SparkQA commented Jan 15, 2021

SparkQA commented Jan 15, 2021

viirya left a comment

Choose a reason for hiding this comment

viirya commented Jan 15, 2021

yaooqinn commented Jan 15, 2021

viirya commented Jan 15, 2021

yaooqinn commented Jan 7, 2021 •

edited

Loading