[SPARK-2926][Shuffle]Add MR style sort-merge shuffle read for Spark sort-based shuffle #3438

jerryshao · 2014-11-25T01:53:39Z

This is a joint work with @sryza, Details and performance test report can be seen in SPARK-2926.

SparkQA · 2014-11-25T02:01:13Z

Test build #23807 has finished for PR 3438 at commit bfc2614.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-11-25T03:42:54Z

Test build #23809 has finished for PR 3438 at commit 7d839cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class MemoryShuffleBlock(blockId: BlockId, blockData: ManagedBuffer)
- final class ShuffleBlockFetcherIterator(
- final class ShuffleRawBlockFetcherIterator(
- case class DiskShuffleBlock(blockId: BlockId, file: File, len: Long)

sryza · 2014-11-25T06:18:02Z

core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala

@@ -17,6 +17,7 @@

 package org.apache.spark.rdd

+import scala.collection.mutable


Redundant imports

sryza · 2014-11-25T06:48:16Z

The main changes we implemented here are:

When a shuffle operation has a key ordering, sort records by key on the map side in addition to sorting by partition.
On the reduce side, keep blocks in serialized form, and deserialize and merge them when passing to the operation's output iterator. This means that only (# of blocks being merged) records need to be deserialized at any point in time. This part can be found in SortShuffleReader.
If the fetched blocks overflow memory, merge them to an on-disk file.
Add a TieredDiskMerger that avoids random I/O by merging 100 on-disk blocks at once. This should be able to be used by ExternalAppendOnlyMap and ExternalSorter as well.

andrewor14 · 2015-02-19T22:10:52Z

Hi @jerryshao @sryza what is the status of this, is it still WIP?

sryza · 2015-02-19T23:09:27Z

Hi @andrewor14, this is no longer a WIP. It requires a rebase, but was hoping to get some feedback on the approach before working on that.

andrewor14 · 2015-02-19T23:15:14Z

I see. By the way in general it's good to remove the WIP in the title if it is no longer so to encourage reviewers look at this closely.

sryza · 2015-02-19T23:21:31Z

Ah yeah, great point. @jerryshao mind updating the title? I don't have access.

jerryshao · 2015-02-20T02:46:33Z

Yeah, will do, thanks a lot :).

SparkQA · 2015-02-22T07:35:05Z

Test build #27831 has finished for PR 3438 at commit c3275ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class MemoryShuffleBlock(blockId: BlockId, blockData: ManagedBuffer)
- final class ShuffleBlockFetcherIterator(
- final class ShuffleRawBlockFetcherIterator(
- case class DiskShuffleBlock(blockId: BlockId, file: File, len: Long)

jerryshao · 2015-02-22T18:11:48Z

Hi @andrewor14 , I've rebased this code against the latest master, would you please help to review this code, thanks a lot.

chenghao-intel · 2015-04-10T18:08:08Z

core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleReader.scala

+  /** keyComparator for mergeSort, id keyOrdering is not available,
+    * using hashcode of key to compare */
+  private val keyComparator: Comparator[K] = dep.keyOrdering.getOrElse(new Comparator[K] {
+    override def compare(a: K, b: K) = {


We never run into here right? As dep.keyOrdering should be always defined for the SortShuffleReader, simply throws an exception?

chenghao-intel · 2015-04-10T18:20:23Z

Thanks @jerryshao @sryza , I have some minor comments for this.
This PR is quite critical for performance improvement in Spark SQL (Sort Merge Join) #5208, I'd like to see merging this PR ASAP.
And we took both this PR and #5208 for a performance benchmark, it showed memory utilization reduce dramatically, and even better performance than without those 2 PRs. More benchmark details will be published soon.

cc / @andrewor14 @rxin

lianhuiwang · 2015-04-11T02:31:48Z

core/src/main/scala/org/apache/spark/util/collection/TieredDiskMerger.scala

+   * Notify the merger that no more on disk blocks will be registered.
+   */
+  def doneRegisteringOnDiskBlocks(): Unit = {
+    doneRegistering = true


there is a dead lock with Line 176. so i think we need to put doneRegistering = true to mergeReadyMonitor.synchronized {}.

Thanks a lot @lianhuiwang for your comments, we've also met this issue through running queries. I will fix this ASAP.

Conflicts: core/src/main/scala/org/apache/spark/storage/BlockFetcherIterator.scala core/src/main/scala/org/apache/spark/storage/BlockManager.scala core/src/test/scala/org/apache/spark/storage/BlockFetcherIteratorSuite.scala Conflicts: core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

…vels

…tion

…k size, and properly increment _diskBytesSpilled

SparkQA · 2015-04-13T06:39:54Z

Test build #30143 has finished for PR 3438 at commit d6c94da.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class MemoryShuffleBlock(blockId: BlockId, blockData: ManagedBuffer)
- final class ShuffleBlockFetcherIterator(
- final class ShuffleRawBlockFetcherIterator(
- case class DiskShuffleBlock(blockId: BlockId, file: File, len: Long)
This patch does not change any dependencies.

adrian-wang · 2015-04-13T07:56:23Z

core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala

-    assert(partitions(0) === Seq((0, 5), (0, 8), (2, 6)))
-    assert(partitions(1) === Seq((1, 3), (3, 8), (3, 8)))
+    assert(partitions(0).toSet === Set((0, 5), (0, 8), (2, 6)))
+    assert(partitions(1).toSet === Set((1, 3), (3, 8), (3, 8)))


why change this? it would not be able to check the right order after sort.

Hi @adrian-wang , the ordering is correct, since we only guarantee by-key ordering, the order of tuples with same key is not guaranteed by this PR, this required secondary-sort.

The original code will return (0, 5), (0, 8), (2, 6) and this PR will return (0, 8), (0, 5), (2, 6). I think they both follow the criteria of sort based shuffle (keep the key ordering). Maybe the change is not so straightforward, but it is correct from my understanding.

From the test it self, it couldn't tell the tuples in partitions(0) or partition(1) is sorted, right?

Yeah, this is a problem. I will figure out a better way to test it.

Thanks for the initial work from Ishiihara in #3173 This PR introduce a new join method of sort merge join, which firstly ensure that keys of same value are in the same partition, and inside each partition the Rows are sorted by key. Then we can run down both sides together, find matched rows using [sort merge join](http://en.wikipedia.org/wiki/Sort-merge_join). In this way, we don't have to store the whole hash table of one side as hash join, thus we have less memory usage. Also, this PR would benefit from #3438 , making the sorting phrase much more efficient. We introduced a new configuration of "spark.sql.planner.sortMergeJoin" to switch between this(`true`) and ShuffledHashJoin(`false`), probably we want the default value of it be `false` at first. Author: Daoyuan Wang <[email protected]> Author: Michael Armbrust <[email protected]> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <[email protected]> Closes #5208 from adrian-wang/smj and squashes the following commits: 2493b9f [Daoyuan Wang] fix style 5049d88 [Daoyuan Wang] propagate rowOrdering for RangePartitioning f91a2ae [Daoyuan Wang] yin's comment: use external sort if option is enabled, add comments f515cd2 [Daoyuan Wang] yin's comment: outputOrdering, join suite refine ec8061b [Daoyuan Wang] minor change 413fd24 [Daoyuan Wang] Merge pull request #3 from marmbrus/pr/5208 952168a [Michael Armbrust] add type 5492884 [Michael Armbrust] copy when ordering 7ddd656 [Michael Armbrust] Cleanup addition of ordering requirements b198278 [Daoyuan Wang] inherit ordering in project c8e82a3 [Daoyuan Wang] fix style 6e897dd [Daoyuan Wang] hide boundReference from manually construct RowOrdering for key compare in smj 8681d73 [Daoyuan Wang] refactor Exchange and fix copy for sorting 2875ef2 [Daoyuan Wang] fix changed configuration 61d7f49 [Daoyuan Wang] add omitted comment 00a4430 [Daoyuan Wang] fix bug 078d69b [Daoyuan Wang] address comments: add comments, do sort in shuffle, and others 3af6ba5 [Daoyuan Wang] use buffer for only one side 171001f [Daoyuan Wang] change default outputordering 47455c9 [Daoyuan Wang] add apache license ... a28277f [Daoyuan Wang] fix style 645c70b [Daoyuan Wang] address comments using sort 068c35d [Daoyuan Wang] fix new style and add some tests 925203b [Daoyuan Wang] address comments 07ce92f [Daoyuan Wang] fix ArrayIndexOutOfBound 42fca0e [Daoyuan Wang] code clean e3ec096 [Daoyuan Wang] fix comment style.. 2edd235 [Daoyuan Wang] fix outputpartitioning 57baa40 [Daoyuan Wang] fix sort eval bug 303b6da [Daoyuan Wang] fix several errors 95db7ad [Daoyuan Wang] fix brackets for if-statement 4464f16 [Daoyuan Wang] fix error 880d8e9 [Daoyuan Wang] sort merge join for spark sql

jeanlyn · 2015-05-14T03:33:49Z

core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleReader.scala

+    val partialMergedItr =
+      MergeUtil.mergeSort(itrGroup, keyComparator, dep.keyOrdering, dep.aggregator)
+    val curWriteMetrics = new ShuffleWriteMetrics()
+    var writer = blockManager.getDiskWriter(tmpBlockId, file, ser, fileBufferSize, curWriteMetrics)


getDiskWriter had changed on #5606

andrewor14 · 2015-09-02T02:49:56Z

@jerryshao unfortunately the shuffle code has changed significantly since this patch was last updated and it will unlikely be merged. Would you mind closing this patch for now? If there's interest we can always reopen it against the latest master branch.

jerryshao · 2015-09-02T05:21:58Z

OK, thanks a lot.

yaooqinn · 2016-04-12T08:03:53Z

Is there any progress about this mechanism study?

jerryshao changed the title ~~[SPARK-2926][Shuffle]Add MR style sort-merge shuffle read for Spark sort-based shuffle~~ [WIP][SPARK-2926][Shuffle]Add MR style sort-merge shuffle read for Spark sort-based shuffle Nov 25, 2014

sryza reviewed Nov 25, 2014
View reviewed changes

jerryshao changed the title ~~[WIP][SPARK-2926][Shuffle]Add MR style sort-merge shuffle read for Spark sort-based shuffle~~ [SPARK-2926][Shuffle]Add MR style sort-merge shuffle read for Spark sort-based shuffle Feb 20, 2015

jerryshao force-pushed the sort-shuffle-read-new-netty branch from 7d839cd to c3275ff Compare February 22, 2015 06:09

chenghao-intel mentioned this pull request Apr 10, 2015

[SPARK-2213] [SQL] sort merge join for spark sql #5208

Closed

chenghao-intel reviewed Apr 10, 2015
View reviewed changes

lianhuiwang reviewed Apr 11, 2015
View reviewed changes

jerryshao force-pushed the sort-shuffle-read-new-netty branch from c3275ff to d6c94da Compare April 13, 2015 05:11

jerryshao and others added 9 commits April 13, 2015 13:15

Readability improvements to SortShuffleReader

4f46dc0

Clarify mergeWidth logic

0861cf9

Add blocks remaining at level counter back in

8f49b78

Small fix

fcafa16

Move merge to a separate class and use a priority queue instead of le…

21dae69

…vels

Rebase to the latest code and fix some conflicts

8e3766a

SortShuffleReader code improvement

98c039b

Changes to rebase to the latest master branch

7d999ef

sryza and others added 18 commits April 13, 2015 13:15

Don't spill more blocks than we need to

319e6d1

Fix bug: add to inMemoryBlocks

96ef5c1

Fix another bug

d481c98

Bug fix and revert ShuffleMemoryManager

bf6a49d

Fix some bugs in spilling to disk

79dc823

Modify to use BlockObjectWriter to write data

2e04b85

Fix incorrect block size introduced bugs

c1f97b6

Address the comments

b5e472d

Fix some bugs

40c59df

Improve the failure process and expand ManagedBuffer

42bf77d

Copy the memory from off-heap to on-heap and some code style modifica…

a9eaef8

…tion

Fix rebase introduced issue

6f48c5c

Revert some unwanted changes

c2ddcce

Clean up comments, break up large methods, spill based on actual bloc…

f170db3

…k size, and properly increment _diskBytesSpilled

Log improve

123aea1

Fix scala style issue

e035105

Fix rebase issues

8b73701

Fix dead lock

d6c94da

adrian-wang reviewed Apr 13, 2015
View reviewed changes

adrian-wang mentioned this pull request Apr 27, 2015

[SPARK-7165] [SQL] use sort merge join for outer join #5717

Closed

sryza mentioned this pull request May 5, 2015

[SPARK-7081] Faster sort-based shuffle path using binary processing cache-aware sort #5868

Closed

jeanlyn reviewed May 14, 2015
View reviewed changes

jerryshao closed this Sep 2, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2926][Shuffle]Add MR style sort-merge shuffle read for Spark sort-based shuffle #3438

[SPARK-2926][Shuffle]Add MR style sort-merge shuffle read for Spark sort-based shuffle #3438

jerryshao commented Nov 25, 2014

SparkQA commented Nov 25, 2014

SparkQA commented Nov 25, 2014

sryza Nov 25, 2014

sryza commented Nov 25, 2014

andrewor14 commented Feb 19, 2015

sryza commented Feb 19, 2015

andrewor14 commented Feb 19, 2015

sryza commented Feb 19, 2015

jerryshao commented Feb 20, 2015

SparkQA commented Feb 22, 2015

jerryshao commented Feb 22, 2015

chenghao-intel Apr 10, 2015

chenghao-intel commented Apr 10, 2015

lianhuiwang Apr 11, 2015

jerryshao Apr 13, 2015

SparkQA commented Apr 13, 2015

adrian-wang Apr 13, 2015

jerryshao Apr 13, 2015

jerryshao Apr 13, 2015

chenghao-intel Apr 13, 2015

jerryshao Apr 13, 2015

jeanlyn May 14, 2015

andrewor14 commented Sep 2, 2015

jerryshao commented Sep 2, 2015

yaooqinn commented Apr 12, 2016

		@@ -17,6 +17,7 @@

		package org.apache.spark.rdd

		import scala.collection.mutable

[SPARK-2926][Shuffle]Add MR style sort-merge shuffle read for Spark sort-based shuffle #3438

[SPARK-2926][Shuffle]Add MR style sort-merge shuffle read for Spark sort-based shuffle #3438

Conversation

jerryshao commented Nov 25, 2014

SparkQA commented Nov 25, 2014

SparkQA commented Nov 25, 2014

Choose a reason for hiding this comment

sryza commented Nov 25, 2014

andrewor14 commented Feb 19, 2015

sryza commented Feb 19, 2015

andrewor14 commented Feb 19, 2015

sryza commented Feb 19, 2015

jerryshao commented Feb 20, 2015

SparkQA commented Feb 22, 2015

jerryshao commented Feb 22, 2015

Choose a reason for hiding this comment

chenghao-intel commented Apr 10, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewor14 commented Sep 2, 2015

jerryshao commented Sep 2, 2015

yaooqinn commented Apr 12, 2016