[SPARK-8160][SQL]Support using external sorting to run aggregate #6875

lianhuiwang · 2015-06-18T07:54:16Z

add config 'spark.sql.planner.sortMergeAggregate' to turn on sortMerge Aggregate, now default is false. and add two class to support sortMergeAggregate:
SortMergeAggregate is for Aggregate that cannot be codegened.
SortMergeGeneratedAggregate is for GeneratedAggregate that can be codegened.

SparkQA · 2015-06-18T08:00:56Z

Test build #35110 has finished for PR 6875 at commit daab5ba.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- trait GeneratedAggregate
- case class HashGeneratedAggregate(
- case class SortMergeAggregate(
- case class ComputedAggregate(
- case class SortMergeGeneratedAggregate(

SparkQA · 2015-06-18T11:33:13Z

Test build #35122 has finished for PR 6875 at commit 44a3e62.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- trait GeneratedAggregate
- case class HashGeneratedAggregate(
- case class SortMergeAggregate(
- case class ComputedAggregate(
- case class SortMergeGeneratedAggregate(

andrewor14 · 2015-06-18T20:49:20Z

@JoshRosen this may be of interest to you

JoshRosen · 2015-06-18T20:58:57Z

@andrewor14, yes, definitely 😄

I have a patch (#6444) to implement an optimized binary processing sort for use in Spark SQL and the change here will amplify the benefits of the new sort, so I'm super excited about this.

lianhuiwang · 2015-06-19T01:26:36Z

@JoshRosen yes, i have looked at patch (#6444). this pr has no conflict with (#6444) because it just implement based-sort aggregate after sort by groupKey. so it can run on both external sort and binary processing sort.

davies · 2015-06-30T23:58:21Z

@lianhuiwang Can we do hash based aggregation first, then switch to sort based if We can not hold all of them in memory? (we still can have a flag to disable it)

lianhuiwang · 2015-07-01T05:07:07Z

@davies if we can not hold all of them in memory and then switch to sort based, it should re-shuffle data to do sort. so its computation cost is very expensive. i think it is determined by statistics before physical plan execution. this problem is similar as hash join or sort-merge join. now sort merge join is determined by spark.sql.planner.sortMergeJoin(default is false). like sort merge join, sort based aggregation of this PR is also determined by spark.sql.planner.sortMergeAggregate(default is false).

davies · 2015-07-01T05:18:55Z

@lianhuiwang Aggregation is different than join, because aggregation could aggregation could reduce the data size, but join cannot. Optimizer could figure out whether use broadcast join or sort merge join based on the size of table, but it's very hard to guess what's the memory assumption will for aggregation (which is determined by the number of unique groups and aggregation algorithm).

All the aggregations happens within a partition, so no shuffling is needed. Usually, there are two aggregations happen before and after shuffling.

lianhuiwang · 2015-07-01T08:56:29Z

@davies thanks. yes,i get it. i think it is similar as aggregation of ExternalSorter.

lianhuiwang · 2015-07-15T16:36:29Z

@davies, i have implemented it in #7423, please take a look at it, thanks.

JoshRosen · 2015-07-15T17:50:04Z

I think we should close this PR for now while we review the other one; let's re-open if necessary.

lianhuiwang · 2015-07-16T01:49:11Z

ok,@JoshRosen, thanks, i close this PR.

lianhuiwang added 2 commits June 18, 2015 14:57

add sortmerge-aggregate

674b66c

rm unused file

daab5ba

fix scala style

44a3e62

lianhuiwang closed this Jul 16, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8160][SQL]Support using external sorting to run aggregate #6875

[SPARK-8160][SQL]Support using external sorting to run aggregate #6875

lianhuiwang commented Jun 18, 2015

SparkQA commented Jun 18, 2015

SparkQA commented Jun 18, 2015

andrewor14 commented Jun 18, 2015

JoshRosen commented Jun 18, 2015

lianhuiwang commented Jun 19, 2015

davies commented Jun 30, 2015

lianhuiwang commented Jul 1, 2015

davies commented Jul 1, 2015

lianhuiwang commented Jul 1, 2015

lianhuiwang commented Jul 15, 2015

JoshRosen commented Jul 15, 2015

lianhuiwang commented Jul 16, 2015

[SPARK-8160][SQL]Support using external sorting to run aggregate #6875

[SPARK-8160][SQL]Support using external sorting to run aggregate #6875

Conversation

lianhuiwang commented Jun 18, 2015

SparkQA commented Jun 18, 2015

SparkQA commented Jun 18, 2015

andrewor14 commented Jun 18, 2015

JoshRosen commented Jun 18, 2015

lianhuiwang commented Jun 19, 2015

davies commented Jun 30, 2015

lianhuiwang commented Jul 1, 2015

davies commented Jul 1, 2015

lianhuiwang commented Jul 1, 2015

lianhuiwang commented Jul 15, 2015

JoshRosen commented Jul 15, 2015

lianhuiwang commented Jul 16, 2015