-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4818][Core] Add 'iterator' to reduce memory consumed by join #3671
Conversation
Test build #24348 has started for PR 3671 at commit
|
Test build #24348 has finished for PR 3671 at commit
|
Test FAILed. |
Jenkins, retest this please. |
Test build #24351 has started for PR 3671 at commit
|
Test build #24351 has finished for PR 3671 at commit
|
Test PASSed. |
@@ -493,9 +493,9 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) | |||
def leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))] = { | |||
this.cogroup(other, partitioner).flatMapValues { pair => | |||
if (pair._2.isEmpty) { | |||
pair._1.map(v => (v, None)) | |||
pair._1.iterator.map(v => (v, None): (V, Option[W])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, are these types required? or can it be limited to just changing None
to None: Option[W]
?
Not that it hurts to spell out the types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None to None: Option[W]
Have tried. But not work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zsxwing First of all thanks for the patch! What kind of error are you getting without these explicit types? Compile time or runtime? I don't get any compile error/warning without these types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. My IDE had some problem. After I rebuilt the project, the errors gone.
Test build #24407 has started for PR 3671 at commit
|
Very interesting, any test for the effect on memory or performance? |
Test build #24407 has finished for PR 3671 at commit
|
Test PASSed. |
No. But I expect the memory will descrease from O(m * n) to O(m + n). |
@pwendell Is it OK to put this patch into branch 1.2, or it's too late? |
This looks good to me. This is a small fix but one which could significantly improve memory usage during joins, so I'm going to pull this into |
In Scala, `map` and `flatMap` of `Iterable` will copy the contents of `Iterable` to a new `Seq`. Such as, ```Scala val iterable = Seq(1, 2, 3).map(v => { println(v) v }) println("Iterable map done") val iterator = Seq(1, 2, 3).iterator.map(v => { println(v) v }) println("Iterator map done") ``` outputed ``` 1 2 3 Iterable map done Iterator map done ``` So we should use 'iterator' to reduce memory consumed by join. Found by Johannes Simon in http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3C5BE70814-9D03-4F61-AE2C-0D63F2DE4446%40mail.de%3E Author: zsxwing <[email protected]> Closes #3671 from zsxwing/SPARK-4824 and squashes the following commits: 48ee7b9 [zsxwing] Remove the explicit types 95d59d6 [zsxwing] Add 'iterator' to reduce memory consumed by join (cherry picked from commit c233ab3) Signed-off-by: Josh Rosen <[email protected]>
In Scala, `map` and `flatMap` of `Iterable` will copy the contents of `Iterable` to a new `Seq`. Such as, ```Scala val iterable = Seq(1, 2, 3).map(v => { println(v) v }) println("Iterable map done") val iterator = Seq(1, 2, 3).iterator.map(v => { println(v) v }) println("Iterator map done") ``` outputed ``` 1 2 3 Iterable map done Iterator map done ``` So we should use 'iterator' to reduce memory consumed by join. Found by Johannes Simon in http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3C5BE70814-9D03-4F61-AE2C-0D63F2DE4446%40mail.de%3E Author: zsxwing <[email protected]> Closes #3671 from zsxwing/SPARK-4824 and squashes the following commits: 48ee7b9 [zsxwing] Remove the explicit types 95d59d6 [zsxwing] Add 'iterator' to reduce memory consumed by join (cherry picked from commit c233ab3) Signed-off-by: Josh Rosen <[email protected]> Conflicts: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
In Scala,
map
andflatMap
ofIterable
will copy the contents ofIterable
to a newSeq
. Such as,outputed
So we should use 'iterator' to reduce memory consumed by join.
Found by Johannes Simon in http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3C5BE70814-9D03-4F61-AE2C-0D63F2DE4446%40mail.de%3E