Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4084] Reuse sort key in Sorter #2937

Closed
wants to merge 13 commits into from
Closed

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented Oct 24, 2014

Sorter uses generic-typed key for sorting. When data is large, it creates lots of key objects, which is not efficient. We should reuse the key in Sorter for memory efficiency. This change is part of the petabyte sort implementation from @rxin .

The Sorter class was written in Java and marked package private. So it is only available to org.apache.spark.util.collection. I renamed it to TimSort and add a simple wrapper of it, still called Sorter, in Scala, which is private[spark].

The benchmark code is updated, which now resets the array before each run. Here is the result on sorting primitive Int arrays of size 25 million using Sorter:

[info] - Sorter benchmark for key-value pairs !!! IGNORED !!!
Java Arrays.sort() on non-primitive int array: Took 13237 ms
Java Arrays.sort() on non-primitive int array: Took 13320 ms
Java Arrays.sort() on non-primitive int array: Took 15718 ms
Java Arrays.sort() on non-primitive int array: Took 13283 ms
Java Arrays.sort() on non-primitive int array: Took 13267 ms
Java Arrays.sort() on non-primitive int array: Took 15122 ms
Java Arrays.sort() on non-primitive int array: Took 15495 ms
Java Arrays.sort() on non-primitive int array: Took 14877 ms
Java Arrays.sort() on non-primitive int array: Took 16429 ms
Java Arrays.sort() on non-primitive int array: Took 14250 ms
Java Arrays.sort() on non-primitive int array: (13878 ms first try, 14499 ms average)
Java Arrays.sort() on primitive int array: Took 2683 ms
Java Arrays.sort() on primitive int array: Took 2683 ms
Java Arrays.sort() on primitive int array: Took 2701 ms
Java Arrays.sort() on primitive int array: Took 2746 ms
Java Arrays.sort() on primitive int array: Took 2685 ms
Java Arrays.sort() on primitive int array: Took 2735 ms
Java Arrays.sort() on primitive int array: Took 2669 ms
Java Arrays.sort() on primitive int array: Took 2693 ms
Java Arrays.sort() on primitive int array: Took 2680 ms
Java Arrays.sort() on primitive int array: Took 2642 ms
Java Arrays.sort() on primitive int array: (2948 ms first try, 2691 ms average)
Sorter without key reuse on primitive int array: Took 10732 ms
Sorter without key reuse on primitive int array: Took 12482 ms
Sorter without key reuse on primitive int array: Took 10718 ms
Sorter without key reuse on primitive int array: Took 12650 ms
Sorter without key reuse on primitive int array: Took 10747 ms
Sorter without key reuse on primitive int array: Took 10783 ms
Sorter without key reuse on primitive int array: Took 12721 ms
Sorter without key reuse on primitive int array: Took 10604 ms
Sorter without key reuse on primitive int array: Took 10622 ms
Sorter without key reuse on primitive int array: Took 11843 ms
Sorter without key reuse on primitive int array: (11089 ms first try, 11390 ms average)
Sorter with key reuse on primitive int array: Took 5141 ms
Sorter with key reuse on primitive int array: Took 5298 ms
Sorter with key reuse on primitive int array: Took 5066 ms
Sorter with key reuse on primitive int array: Took 5164 ms
Sorter with key reuse on primitive int array: Took 5203 ms
Sorter with key reuse on primitive int array: Took 5274 ms
Sorter with key reuse on primitive int array: Took 5186 ms
Sorter with key reuse on primitive int array: Took 5159 ms
Sorter with key reuse on primitive int array: Took 5164 ms
Sorter with key reuse on primitive int array: Took 5078 ms
Sorter with key reuse on primitive int array: (5311 ms first try, 5173 ms average)

So with key reuse, it is faster and less likely to trigger GC.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22175 has started for PR 2937 at commit b00db4d.

  • This patch merges cleanly.

/**
* Returns the sort key for the element at the given index and reuse the input key if possible.
*/
protected def getKey(data: Buffer, pos: Int, reuse: K): K = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's somewhat surprising that a universal trait can have a default implementation, but maybe we can convert this to an abstract class to ensure it's still compiled into simple Java bytecode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also funny that Java ignores the protectedness, maybe we can upgrade all actually-public methods to public (that's everything but getKey(data: Buffer, pos: Int), which is only used internally)

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22175 has finished for PR 2937 at commit b00db4d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • raise TypeError("Cannot convert a Row class into dict")
    • class ShimFileSinkDesc(var dir: String, var tableInfo: TableDesc, var compressed: Boolean)
    • class ShimFileSinkDesc(var dir: String, var tableInfo: TableDesc, var compressed: Boolean)
    • case class LogInfo(startTime: Long, endTime: Long, path: String)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22175/
Test FAILed.

@mengxr
Copy link
Contributor Author

mengxr commented Oct 26, 2014

@aarondav I updated the PR based on your comment. See the description for renaming Sorter.java to TimSort.Java.

@SparkQA
Copy link

SparkQA commented Oct 26, 2014

Test build #22230 has started for PR 2937 at commit 5f0d530.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 26, 2014

Test build #22230 has finished for PR 2937 at commit 5f0d530.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22230/
Test FAILed.

@mengxr
Copy link
Contributor Author

mengxr commented Oct 26, 2014

test this please

@SparkQA
Copy link

SparkQA commented Oct 26, 2014

Test build #22231 has started for PR 2937 at commit 5f0d530.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 26, 2014

Test build #22231 has finished for PR 2937 at commit 5f0d530.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22231/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Oct 27, 2014

Test build #22307 has started for PR 2937 at commit 7de2efd.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 27, 2014

Test build #22307 has finished for PR 2937 at commit 7de2efd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22307/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Oct 27, 2014

Test build #22312 has started for PR 2937 at commit 78f2879.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22312 has finished for PR 2937 at commit 78f2879.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22322 has started for PR 2937 at commit a72f53c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22322 has finished for PR 2937 at commit a72f53c.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22322/
Test FAILed.

val start = System.currentTimeMillis
times(numIters)(f)
System.currentTimeMillis - start
def timeIt(numIters: Int)(f: => Unit, prepare: Option[() => Unit] = None): Long = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add that it returns the total time across all iterations (which is not the behavior I expected).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22334 has started for PR 2937 at commit 0b7b682.

  • This patch merges cleanly.

@@ -27,33 +27,51 @@ import scala.reflect.ClassTag
* Example format: an array of numbers, where each element is also the key.
* See [[KVArraySortDataFormat]] for a more exciting format.
*
* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).
* Declaring and instantiating multiple subclasses of this class would prevent JIT inlining
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe slight rephrasing of this to something like

Note: Declaring and instantiating multiple subclasses of this class would prevent JIT inlining overridden methods and hence may decrease the shuffle performance.

I just added "Note" since it's sort of a tangent and "may" since it may not be noticeable for non computationally-intensive workloads.

@aarondav
Copy link
Contributor

LGTM once it passes travis!

@mengxr
Copy link
Contributor Author

mengxr commented Oct 28, 2014

Well, it won't pass travis ...

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22336 has started for PR 2937 at commit c63927f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22336 has finished for PR 2937 at commit c63927f.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22336/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22334 has finished for PR 2937 at commit 0b7b682.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22334/
Test PASSed.

@mengxr
Copy link
Contributor Author

mengxr commented Oct 28, 2014

@aarondav Does Jenkins count?

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22365 has started for PR 2937 at commit d73c3d0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22365 has finished for PR 2937 at commit d73c3d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22365/
Test PASSed.

@aarondav
Copy link
Contributor

LGTM! Doing the merging.

@asfgit asfgit closed this in 84e5da8 Oct 28, 2014
asfgit pushed a commit that referenced this pull request Oct 29, 2014
I forgot to disable the benchmark code in #2937, which increased the Jenkins build time by couple minutes.

aarondav

Author: Xiangrui Meng <[email protected]>

Closes #2990 from mengxr/disable-benchmark and squashes the following commits:

c58f070 [Xiangrui Meng] disable benchmark code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants