[SPARK-4818][Core] Add 'iterator' to reduce memory consumed by join #3671

zsxwing · 2014-12-11T03:02:54Z

In Scala, map and flatMap of Iterable will copy the contents of Iterable to a new Seq. Such as,

  val iterable = Seq(1, 2, 3).map(v => {
    println(v)
    v
  })
  println("Iterable map done")

  val iterator = Seq(1, 2, 3).iterator.map(v => {
    println(v)
    v
  })
  println("Iterator map done")

outputed

1
2
3
Iterable map done
Iterator map done

So we should use 'iterator' to reduce memory consumed by join.

Found by Johannes Simon in http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3C5BE70814-9D03-4F61-AE2C-0D63F2DE4446%40mail.de%3E

SparkQA · 2014-12-11T03:07:27Z

Test build #24348 has started for PR 3671 at commit 95d59d6.

This patch merges cleanly.

SparkQA · 2014-12-11T04:10:37Z

Test build #24348 has finished for PR 3671 at commit 95d59d6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-11T04:10:41Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24348/
Test FAILed.

zsxwing · 2014-12-11T04:27:16Z

Jenkins, retest this please.

SparkQA · 2014-12-11T04:35:35Z

Test build #24351 has started for PR 3671 at commit 95d59d6.

This patch merges cleanly.

SparkQA · 2014-12-11T05:54:42Z

Test build #24351 has finished for PR 3671 at commit 95d59d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-11T05:54:45Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24351/
Test PASSed.

srowen · 2014-12-11T10:43:46Z

core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

@@ -493,9 +493,9 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
  def leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))] = {
    this.cogroup(other, partitioner).flatMapValues { pair =>
      if (pair._2.isEmpty) {
-        pair._1.map(v => (v, None))
+        pair._1.iterator.map(v => (v, None): (V, Option[W]))


Interesting, are these types required? or can it be limited to just changing None to None: Option[W]?
Not that it hurts to spell out the types.

None to None: Option[W]

Have tried. But not work.

@zsxwing First of all thanks for the patch! What kind of error are you getting without these explicit types? Compile time or runtime? I don't get any compile error/warning without these types.

You're right. My IDE had some problem. After I rebuilt the project, the errors gone.

SparkQA · 2014-12-12T13:25:11Z

Test build #24407 has started for PR 3671 at commit 48ee7b9.

This patch merges cleanly.

scwf · 2014-12-12T13:30:38Z

Very interesting, any test for the effect on memory or performance?

SparkQA · 2014-12-12T14:46:16Z

Test build #24407 has finished for PR 3671 at commit 48ee7b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-12T14:46:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24407/
Test PASSed.

zsxwing · 2014-12-12T15:59:26Z

Very interesting, any test for the effect on memory or performance?

No. But I expect the memory will descrease from O(m * n) to O(m + n).

zsxwing · 2014-12-16T07:27:44Z

@pwendell Is it OK to put this patch into branch 1.2, or it's too late?

JoshRosen · 2014-12-22T22:26:06Z

This looks good to me. This is a small fix but one which could significantly improve memory usage during joins, so I'm going to pull this into master (1.3.0), branch-1.2 (1.2.1), and branch-1.1 (1.1.2).

In Scala, `map` and `flatMap` of `Iterable` will copy the contents of `Iterable` to a new `Seq`. Such as, ```Scala val iterable = Seq(1, 2, 3).map(v => { println(v) v }) println("Iterable map done") val iterator = Seq(1, 2, 3).iterator.map(v => { println(v) v }) println("Iterator map done") ``` outputed ``` 1 2 3 Iterable map done Iterator map done ``` So we should use 'iterator' to reduce memory consumed by join. Found by Johannes Simon in http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3C5BE70814-9D03-4F61-AE2C-0D63F2DE4446%40mail.de%3E Author: zsxwing <[email protected]> Closes #3671 from zsxwing/SPARK-4824 and squashes the following commits: 48ee7b9 [zsxwing] Remove the explicit types 95d59d6 [zsxwing] Add 'iterator' to reduce memory consumed by join (cherry picked from commit c233ab3) Signed-off-by: Josh Rosen <[email protected]>

In Scala, `map` and `flatMap` of `Iterable` will copy the contents of `Iterable` to a new `Seq`. Such as, ```Scala val iterable = Seq(1, 2, 3).map(v => { println(v) v }) println("Iterable map done") val iterator = Seq(1, 2, 3).iterator.map(v => { println(v) v }) println("Iterator map done") ``` outputed ``` 1 2 3 Iterable map done Iterator map done ``` So we should use 'iterator' to reduce memory consumed by join. Found by Johannes Simon in http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3C5BE70814-9D03-4F61-AE2C-0D63F2DE4446%40mail.de%3E Author: zsxwing <[email protected]> Closes #3671 from zsxwing/SPARK-4824 and squashes the following commits: 48ee7b9 [zsxwing] Remove the explicit types 95d59d6 [zsxwing] Add 'iterator' to reduce memory consumed by join (cherry picked from commit c233ab3) Signed-off-by: Josh Rosen <[email protected]> Conflicts: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

Add 'iterator' to reduce memory consumed by join

95d59d6

zsxwing changed the title ~~[SPARK-4824][Core] Add 'iterator' to reduce memory consumed by join~~ [SPARK-4818][Core] Add 'iterator' to reduce memory consumed by join Dec 11, 2014

srowen reviewed Dec 11, 2014
View reviewed changes

Remove the explicit types

48ee7b9

asfgit closed this in c233ab3 Dec 22, 2014

zsxwing deleted the SPARK-4824 branch December 23, 2014 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4818][Core] Add 'iterator' to reduce memory consumed by join #3671

[SPARK-4818][Core] Add 'iterator' to reduce memory consumed by join #3671

zsxwing commented Dec 11, 2014

SparkQA commented Dec 11, 2014

SparkQA commented Dec 11, 2014

AmplabJenkins commented Dec 11, 2014

zsxwing commented Dec 11, 2014

SparkQA commented Dec 11, 2014

SparkQA commented Dec 11, 2014

AmplabJenkins commented Dec 11, 2014

srowen Dec 11, 2014

zsxwing Dec 11, 2014

johannessimon Dec 12, 2014

zsxwing Dec 12, 2014

SparkQA commented Dec 12, 2014

scwf commented Dec 12, 2014

SparkQA commented Dec 12, 2014

AmplabJenkins commented Dec 12, 2014

zsxwing commented Dec 12, 2014

zsxwing commented Dec 16, 2014

JoshRosen commented Dec 22, 2014

[SPARK-4818][Core] Add 'iterator' to reduce memory consumed by join #3671

[SPARK-4818][Core] Add 'iterator' to reduce memory consumed by join #3671

Conversation

zsxwing commented Dec 11, 2014

SparkQA commented Dec 11, 2014

SparkQA commented Dec 11, 2014

AmplabJenkins commented Dec 11, 2014

zsxwing commented Dec 11, 2014

SparkQA commented Dec 11, 2014

SparkQA commented Dec 11, 2014

AmplabJenkins commented Dec 11, 2014

srowen Dec 11, 2014

Choose a reason for hiding this comment

zsxwing Dec 11, 2014

Choose a reason for hiding this comment

johannessimon Dec 12, 2014

Choose a reason for hiding this comment

zsxwing Dec 12, 2014

Choose a reason for hiding this comment

SparkQA commented Dec 12, 2014

scwf commented Dec 12, 2014

SparkQA commented Dec 12, 2014

AmplabJenkins commented Dec 12, 2014

zsxwing commented Dec 12, 2014

zsxwing commented Dec 16, 2014

JoshRosen commented Dec 22, 2014