-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4084] Reuse sort key in Sorter #2937
Conversation
Test build #22175 has started for PR 2937 at commit
|
/** | ||
* Returns the sort key for the element at the given index and reuse the input key if possible. | ||
*/ | ||
protected def getKey(data: Buffer, pos: Int, reuse: K): K = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's somewhat surprising that a universal trait can have a default implementation, but maybe we can convert this to an abstract class to ensure it's still compiled into simple Java bytecode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also funny that Java ignores the protectedness, maybe we can upgrade all actually-public methods to public (that's everything but getKey(data: Buffer, pos: Int)
, which is only used internally)
Test build #22175 has finished for PR 2937 at commit
|
Test FAILed. |
@aarondav I updated the PR based on your comment. See the description for renaming |
Test build #22230 has started for PR 2937 at commit
|
Test build #22230 has finished for PR 2937 at commit
|
Test FAILed. |
test this please |
Test build #22231 has started for PR 2937 at commit
|
Test build #22231 has finished for PR 2937 at commit
|
Test PASSed. |
Test build #22307 has started for PR 2937 at commit
|
Test build #22307 has finished for PR 2937 at commit
|
Test PASSed. |
Test build #22312 has started for PR 2937 at commit
|
Test build #22312 has finished for PR 2937 at commit
|
Test build #22322 has started for PR 2937 at commit
|
Test build #22322 has finished for PR 2937 at commit
|
Test FAILed. |
val start = System.currentTimeMillis | ||
times(numIters)(f) | ||
System.currentTimeMillis - start | ||
def timeIt(numIters: Int)(f: => Unit, prepare: Option[() => Unit] = None): Long = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add that it returns the total time across all iterations (which is not the behavior I expected).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Test build #22334 has started for PR 2937 at commit
|
@@ -27,33 +27,51 @@ import scala.reflect.ClassTag | |||
* Example format: an array of numbers, where each element is also the key. | |||
* See [[KVArraySortDataFormat]] for a more exciting format. | |||
* | |||
* This trait extends Any to ensure it is universal (and thus compiled to a Java interface). | |||
* Declaring and instantiating multiple subclasses of this class would prevent JIT inlining |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe slight rephrasing of this to something like
Note: Declaring and instantiating multiple subclasses of this class would prevent JIT inlining overridden methods and hence may decrease the shuffle performance.
I just added "Note" since it's sort of a tangent and "may" since it may not be noticeable for non computationally-intensive workloads.
LGTM once it passes travis! |
Well, it won't pass |
Test build #22336 has started for PR 2937 at commit
|
Test build #22336 has finished for PR 2937 at commit
|
Test FAILed. |
Test build #22334 has finished for PR 2937 at commit
|
Test PASSed. |
@aarondav Does |
Test build #22365 has started for PR 2937 at commit
|
Test build #22365 has finished for PR 2937 at commit
|
Test PASSed. |
LGTM! Doing the merging. |
I forgot to disable the benchmark code in #2937, which increased the Jenkins build time by couple minutes. aarondav Author: Xiangrui Meng <[email protected]> Closes #2990 from mengxr/disable-benchmark and squashes the following commits: c58f070 [Xiangrui Meng] disable benchmark code
Sorter uses generic-typed key for sorting. When data is large, it creates lots of key objects, which is not efficient. We should reuse the key in Sorter for memory efficiency. This change is part of the petabyte sort implementation from @rxin .
The
Sorter
class was written in Java and marked package private. So it is only available toorg.apache.spark.util.collection
. I renamed it toTimSort
and add a simple wrapper of it, still calledSorter
, in Scala, which isprivate[spark]
.The benchmark code is updated, which now resets the array before each run. Here is the result on sorting primitive Int arrays of size 25 million using Sorter:
So with key reuse, it is faster and less likely to trigger GC.