[SPARK-22465][Core] Add a safety-check to RDD defaultPartitioner #20002

sujithjay · 2017-12-16T12:56:38Z

What changes were proposed in this pull request?

In choosing a Partitioner to use for a cogroup-like operation between a number of RDDs, the default behaviour was if some of the RDDs already have a partitioner, we choose the one amongst them with the maximum number of partitions.

This behaviour, in some cases, could hit the 2G limit (SPARK-6235). To illustrate one such scenario, consider two RDDs:
rDD1: with smaller data and smaller number of partitions, alongwith a Partitioner.
rDD2: with much larger data and a larger number of partitions, without a Partitioner.

The cogroup of these two RDDs could hit the 2G limit, as a larger amount of data is shuffled into a smaller number of partitions.

This PR introduces a safety-check wherein the Partitioner is chosen only if either of the following conditions are met:

if the number of partitions of the RDD associated with the Partitioner is greater than or equal to the max number of upstream partitions; or
if the number of partitions of the RDD associated with the Partitioner is less than and within a single order of magnitude of the max number of upstream partitions.

How was this patch tested?

Unit tests in PartitioningSuite and PairRDDFunctionsSuite

that ignores existing Partitioners, if they are more than a single order of magnitude smaller than the max number of upstream partitions

sujithjay · 2017-12-16T13:40:37Z

SparkR test failure seems unrelated to this PR. Any ideas what's wrong?

sujithjay · 2017-12-16T13:48:04Z

Hi @HyukjinKwon , can you please help me with these SparkR test failures? They seem unrelated to me.

sujithjay · 2017-12-16T13:56:02Z

cc: @tgravescs @codlife Could you please review this PR?

HyukjinKwon · 2017-12-16T14:26:17Z

Yup, AppVeyor test seems unrelated. I took a quick look and it seems AppVeyor test failure is related with the latest testthat (1.0.2 -> 2.0.0). Will take a look for this separately.

HyukjinKwon · 2017-12-17T05:43:20Z

It was fixed in #20003. Rebasing should make the test passed.

sujithjay · 2017-12-17T07:42:17Z

Thank you, @HyukjinKwon . The tests passed after rebasing.

sujithjay · 2017-12-20T12:10:53Z

@tgravescs , could you please take a look when you have some time ?

tgravescs · 2017-12-20T14:50:16Z

Jenkins, test this please

SparkQA · 2017-12-20T14:54:14Z

Test build #85192 has finished for PR 20002 at commit 4b2dcac.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2017-12-20T17:57:57Z

ok to test

tgravescs · 2017-12-20T18:12:04Z

@sujithjay thanks for working on this. I will review but I'm not sure I will get to it for a bit, I'm out for the holidays and not sure I can give this the time it needs for a full review today.

sujithjay · 2017-12-20T18:24:55Z

@tgravescs Thank you for keeping me informed. I look forward to receiving your review. Happy holidays!

mridulm · 2017-12-20T19:03:58Z

core/src/main/scala/org/apache/spark/Partitioner.scala

@@ -57,7 +60,8 @@ object Partitioner {
  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    val rdds = (Seq(rdd) ++ others)
    val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))
-    if (hasPartitioner.nonEmpty) {
+    if (hasPartitioner.nonEmpty
+      && isEligiblePartitioner(hasPartitioner.maxBy(_.partitions.length), rdds)) {


hasPartitioner.maxBy(_.partitions.length) is used repeatedly, pull that into a variable ?

mridulm · 2017-12-20T19:50:00Z

core/src/main/scala/org/apache/spark/Partitioner.scala

+   */
+  private def isEligiblePartitioner(hasMaxPartitioner: RDD[_], rdds: Seq[RDD[_]]): Boolean = {
+    val maxPartitions = rdds.map(_.partitions.length).max
+    log10(maxPartitions).floor - log10(hasMaxPartitioner.getNumPartitions).floor < 1


Why .floor ?
It causes unnecessary discontinuity imo, for example: (9, 11) will not satisfy - but it should.

Hi @mridulm , I suppose I was trying to ensure a strict order-of-magnitude check; but, I agree it leads to a discontinuity. I will change this, and the corresponding test cases.

SparkQA · 2017-12-20T20:46:29Z

Test build #85200 has finished for PR 20002 at commit ca6aa08.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-23T15:29:16Z

Test build #85342 has finished for PR 20002 at commit 961e384.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-23T15:34:20Z

Test build #85344 has finished for PR 20002 at commit 8b35452.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-23T15:54:12Z

Test build #85346 has finished for PR 20002 at commit 4729d80.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

sujithjay · 2017-12-23T16:00:25Z

Scala style tests are failing on a file 'SparkHiveExample.scala' , which is unrelated to this PR. Will rebase to master and try again.

HyukjinKwon · 2017-12-23T16:02:40Z

@sujithjay, I opened a hotfix. It should be fine soon (maybe after few hours).

sujithjay · 2017-12-23T16:06:28Z

Thank you, @HyukjinKwon . I will try again after the hotfix is merged to master.

SparkQA · 2017-12-23T16:32:24Z

Test build #85348 has finished for PR 20002 at commit 6623227.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

…mpiler warnings

mridulm · 2017-12-23T17:43:26Z

core/src/main/scala/org/apache/spark/Partitioner.scala

@@ -21,6 +21,8 @@ import java.io.{IOException, ObjectInputStream, ObjectOutputStream}

 import scala.collection.mutable
 import scala.collection.mutable.ArrayBuffer
+import scala.language.existentials


Curious, why was this required ?

Without this import, there was a compiler warning:

Warning:(66, 29) inferred existential type Option[org.apache.spark.rdd.RDD[_$2]]( forSome { type _$2 }), which cannot be expressed by wildcards, should be enabled by making the implicit value scala.language.existentials visible. This can be achieved by adding the import clause 'import scala.language.existentials' or by setting the compiler option -language:existentials. See the Scaladoc for value scala.language.existentials for a discussion why the feature should be explicitly enabled.

The build on Jenkins failed because of this warning.

If we explicitly set the type, is it still required ? For example, with val hasMaxPartitioner: Option[RDD[_]] = ... ?

SparkQA · 2017-12-23T20:39:41Z

Test build #85349 has finished for PR 20002 at commit 3dd1ad8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2017-12-24T03:13:20Z

core/src/test/scala/org/apache/spark/PartitioningSuite.scala

+    assert(partitioner1.numPartitions == rdd1.getNumPartitions)
+    assert(partitioner2.numPartitions == rdd3.getNumPartitions)
+    assert(partitioner3.numPartitions == rdd3.getNumPartitions)
+    assert(partitioner4.numPartitions == rdd3.getNumPartitions)


Can you add a testcase such that numPartitions 9 vs 11 is not treated as an order of magnitude jump (to prevent future changes which end up breaking this).

mridulm · 2017-12-24T03:14:10Z

I left a couple of comments @sujithjay, overall it is looking good, thanks for working on it !
We can merge it once they are addressed.

…xplicitly mention type of existential type

sujithjay · 2017-12-24T05:34:14Z

Thank you, @mridulm for reviewing this PR. I have addressed the latest review comments.

SparkQA · 2017-12-24T08:05:02Z

Test build #85354 has finished for PR 20002 at commit 3b08951.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-12-24T08:16:20Z

retest this please

mridulm · 2017-12-24T09:42:43Z

Looks good @sujithjay ... once we have a successful build, I will merge it in.

SparkQA · 2017-12-24T11:17:51Z

Test build #85357 has finished for PR 20002 at commit 3b08951.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sujithjay · 2017-12-24T11:50:15Z

The failed unit test (at HistoryServerSuite.scala:350) seems unrelated to this PR.

HyukjinKwon · 2017-12-24T12:01:41Z

retest this please

SparkQA · 2017-12-24T15:37:48Z

Test build #85360 has finished for PR 20002 at commit 3b08951.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2017-12-24T19:16:05Z

Merged, thanks for fixing this @sujithjay !

…rtitioner when defaultParallelism is set ## What changes were proposed in this pull request? apache#20002 purposed a way to safe check the default partitioner, however, if `spark.default.parallelism` is set, the defaultParallelism still could be smaller than the proper number of partitions for upstreams RDDs. This PR tries to extend the approach to address the condition when `spark.default.parallelism` is set. The requirements where the PR helps with are : - Max partitioner is not eligible since it is atleast an order smaller, and - User has explicitly set 'spark.default.parallelism', and - Value of 'spark.default.parallelism' is lower than max partitioner - Since max partitioner was discarded due to being at least an order smaller, default parallelism is worse - even though user specified. Under the rest cases, the changes should be no-op. ## How was this patch tested? Add corresponding test cases in `PairRDDFunctionsSuite` and `PartitioningSuite`. Author: Xingbo Jiang <[email protected]> Closes apache#20091 from jiangxb1987/partitioner.

…rtitioner when defaultParallelism is set ## What changes were proposed in this pull request? #20002 purposed a way to safe check the default partitioner, however, if `spark.default.parallelism` is set, the defaultParallelism still could be smaller than the proper number of partitions for upstreams RDDs. This PR tries to extend the approach to address the condition when `spark.default.parallelism` is set. The requirements where the PR helps with are : - Max partitioner is not eligible since it is atleast an order smaller, and - User has explicitly set 'spark.default.parallelism', and - Value of 'spark.default.parallelism' is lower than max partitioner - Since max partitioner was discarded due to being at least an order smaller, default parallelism is worse - even though user specified. Under the rest cases, the changes should be no-op. ## How was this patch tested? Add corresponding test cases in `PairRDDFunctionsSuite` and `PartitioningSuite`. Author: Xingbo Jiang <[email protected]> Closes #20091 from jiangxb1987/partitioner. (cherry picked from commit 96cb60b) Signed-off-by: Mridul Muralidharan <[email protected]>

sujithjay added 2 commits December 16, 2017 17:46

[SPARK-22465][Core][WIP] Add a safety-check to RDD defaultPartitioner

176270b

that ignores existing Partitioners, if they are more than a single order of magnitude smaller than the max number of upstream partitions

Merge remote-tracking branch 'origin-apache/master' into SPARK-22465

be391a7

Merge remote-tracking branch 'origin-apache/master' into SPARK-22465

4b2dcac

[SPARK-22465][Core][WIP] Fix Scala style checks

ca6aa08

mridulm reviewed Dec 20, 2017

View reviewed changes

[SPARK-22465][Core][WIP] Changes after code review

961e384

[SPARK-22465][Core][WIP] Scala style checks

8b35452

[SPARK-22465][Core][WIP] More Scala style checks

4729d80

Merge branch 'master' of github.com:apache/spark into SPARK-22465

7d88e6c

Merge remote-tracking branch 'origin-apache/master' into SPARK-22465

6623227

[SPARK-22465][Core][WIP] Import scala.language.existentials to fix co…

3dd1ad8

…mpiler warnings

mridulm reviewed Dec 23, 2017

View reviewed changes

mridulm reviewed Dec 24, 2017

View reviewed changes

sujithjay added 2 commits December 24, 2017 10:53

[SPARK-22465][Core][WIP] Remove import scala.language.existentials; e…

62b17e9

…xplicitly mention type of existential type

[SPARK-22465][Core][WIP] Add test for 9 vs 11 numPartitions case

3b08951

sujithjay changed the title ~~[SPARK-22465][Core][WIP] Add a safety-check to RDD defaultPartitioner~~ [SPARK-22465][Core] Add a safety-check to RDD defaultPartitioner Dec 24, 2017

asfgit closed this in 0bf1a74 Dec 24, 2017

jiangxb1987 mentioned this pull request Dec 27, 2017

[SPARK-22465][FOLLOWUP] Update the number of partitions of default partitioner when defaultParallelism is set #20091

Closed

[SPARK-22465][Core] Add a safety-check to RDD defaultPartitioner #20002

[SPARK-22465][Core] Add a safety-check to RDD defaultPartitioner #20002

Conversation

sujithjay commented Dec 16, 2017

What changes were proposed in this pull request?

How was this patch tested?

sujithjay commented Dec 16, 2017

sujithjay commented Dec 16, 2017 • edited Loading

sujithjay commented Dec 16, 2017 • edited Loading

HyukjinKwon commented Dec 16, 2017 • edited Loading

HyukjinKwon commented Dec 17, 2017 • edited Loading

sujithjay commented Dec 17, 2017

sujithjay commented Dec 20, 2017

tgravescs commented Dec 20, 2017

SparkQA commented Dec 20, 2017

tgravescs commented Dec 20, 2017

tgravescs commented Dec 20, 2017

sujithjay commented Dec 20, 2017

mridulm Dec 20, 2017

Choose a reason for hiding this comment

mridulm Dec 20, 2017

Choose a reason for hiding this comment

sujithjay Dec 21, 2017

Choose a reason for hiding this comment

SparkQA commented Dec 20, 2017

SparkQA commented Dec 23, 2017

SparkQA commented Dec 23, 2017

SparkQA commented Dec 23, 2017

sujithjay commented Dec 23, 2017

HyukjinKwon commented Dec 23, 2017

sujithjay commented Dec 23, 2017

SparkQA commented Dec 23, 2017

mridulm Dec 23, 2017

Choose a reason for hiding this comment

sujithjay Dec 23, 2017 • edited Loading

Choose a reason for hiding this comment

mridulm Dec 24, 2017

Choose a reason for hiding this comment

SparkQA commented Dec 23, 2017

mridulm Dec 24, 2017

Choose a reason for hiding this comment

mridulm commented Dec 24, 2017

sujithjay commented Dec 24, 2017

SparkQA commented Dec 24, 2017

HyukjinKwon commented Dec 24, 2017

mridulm commented Dec 24, 2017

SparkQA commented Dec 24, 2017

sujithjay commented Dec 24, 2017 • edited Loading

HyukjinKwon commented Dec 24, 2017

SparkQA commented Dec 24, 2017

mridulm commented Dec 24, 2017

sujithjay commented Dec 16, 2017 •

edited

Loading

sujithjay commented Dec 16, 2017 •

edited

Loading

HyukjinKwon commented Dec 16, 2017 •

edited

Loading

HyukjinKwon commented Dec 17, 2017 •

edited

Loading

sujithjay Dec 23, 2017 •

edited

Loading

sujithjay commented Dec 24, 2017 •

edited

Loading