[SPARK-18788][SPARKR] Add API for getNumPartitions #16668

felixcheung · 2017-01-21T07:54:59Z

What changes were proposed in this pull request?

With doc to say this would convert DF into RDD

How was this patch tested?

unit tests, manual tests

SparkQA · 2017-01-21T07:57:36Z

Test build #71761 has started for PR 16668 at commit 34f9aa5.

felixcheung · 2017-01-21T09:56:26Z

Jenkins, retest this please

SparkQA · 2017-01-21T10:31:43Z

Test build #71764 has finished for PR 16668 at commit 34f9aa5.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-01-21T16:19:49Z

R/pkg/R/DataFrame.R

+#' df <- createDataFrame(cars, numPartitions = 2)
+#' getNumPartitions(df)
+#' }
+#' @note getNumPartitions since 2.1.1


@felixcheung, should this be since 2.2.0? Just curious.

I debated about this quite a bit - generally it should but we merged createDataFrame(..., numPartitions) to 2.1 and it felt important to have a getNumPartition in the same release too.

SparkQA · 2017-01-21T18:13:40Z

Test build #71772 has finished for PR 16668 at commit 7c057fc.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2017-01-21T23:43:51Z

R/pkg/R/DataFrame.R

+setMethod("getNumPartitions",
+          signature(x = "SparkDataFrame"),
+          function(x) {
+            getNumPartitionsRDD(toRDD(x))


As discussed in the JIRA I worry that this will be a very expensive operation for large data frames. Specifically instead of create an RRDD, can we do some operations on the Scala side which might be cheaper ?

cc @yhuai @cloud-fan who know more about DataFrame on the SQL side

Right, we agreed.
The conversion, especially into RRDD, is in particular concerning. From what I can see though this df.rdd.getNumPartitions is the recommended practice, which seems to be all over pyspark. (granted, DataFrame to RDD in pyspark is likely slightly more efficient)

An alternative, is we could wrap all of this on the JVM side - at least that should save us the around trip to RRDD.

But agreed, is there a more efficient way this could be exposed in DataFrame/Dataset directly instead?

shall we add the getNumPartitions to DataFrame/Dataset at scala side?

That would be great!

are you going to do it here? Or do we need to send a new PR for the scala side changes?

isn't just calling rdd.numPartitions? we need to materialize the RDD inside DataFrame anyway, but it's cheap at scala side.

ah, that we could do easily. is that something ok for Spark 2.1.1? If yes, I could go ahead with changes here for Scala, Python and R.

you said this filled a hole for Spark 2.1, what's this hole? is this Spark R only?

sorry, I should clarify. Yes, for R only - since SparkR only has DataFrame APIs and no (publicly supported) RDD APIs, users are left without a way to check number of partitions.

maybe we can add this slow implementation to Spark 2.1, and improve it in Spark 2.2

SparkQA · 2017-01-22T00:09:36Z

Test build #71779 has finished for PR 16668 at commit ad1bd14.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-22T01:49:26Z

Test build #71786 has finished for PR 16668 at commit bab7466.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-01-25T04:36:29Z

@shivaram how about we merge this to master & branch-2.1? then I can based off of this to Dataset/DataFrame API in Scala as @cloud-fan suggests - it would be easier than porting the little fixes to get around the getNumPartitions conflicts in R. And having this in 2.1.x is not likely much worse than people calling the non-public methods...

shivaram · 2017-01-26T19:42:24Z

@felixcheung Why dont we do something simpler where we call the scala function from R side. i.e. get a handle to the scala DF, call .rdd on it to get a handle to the scala RDD etc. ? That seems less expensive than running the conversion to RRDD and it doesn't involve scala side changes.

felixcheung · 2017-01-26T20:25:02Z

I like it! Done.

SparkQA · 2017-01-26T21:04:36Z

Test build #72041 has finished for PR 16668 at commit 0353978.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2017-01-26T21:17:25Z

R/pkg/R/DataFrame.R

+setMethod("getNumPartitions",
+          signature(x = "SparkDataFrame"),
+          function(x) {
+            callJMethod(callJMethod(x@sdf, "rdd"), "getNumPartitions")


One last thing - can we add a TODO and a pointer to a JIRA saying this needs to be fixed once getNumPartitions is added to scala API ?

so rxin is saying on #16708 that we don't want this to be a public API on Dataset. I'm leaving this for now since this implementation seems reasonably low overhead.

perhaps @shivaram and @cloud-fan want to comment in #16708?

shivaram

LGTM. Thanks @felixcheung - I actually think this should be good to merge into master as well and once the scala change is made, we get rid of this ?

felixcheung · 2017-01-26T22:59:46Z

I'll merge to branch-2.1 and master once another pass of Jenkins is done

SparkQA · 2017-01-26T23:36:30Z

Test build #72052 has finished for PR 16668 at commit c05f786.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? With doc to say this would convert DF into RDD ## How was this patch tested? unit tests, manual tests Author: Felix Cheung <[email protected]> Closes #16668 from felixcheung/rgetnumpartitions. (cherry picked from commit 90817a6) Signed-off-by: Felix Cheung <[email protected]>

## What changes were proposed in this pull request? With doc to say this would convert DF into RDD ## How was this patch tested? unit tests, manual tests Author: Felix Cheung <[email protected]> Closes apache#16668 from felixcheung/rgetnumpartitions.

getNumPartitions

34f9aa5

HyukjinKwon reviewed Jan 21, 2017

View reviewed changes

rename getNumPartitions for RDD

7c057fc

fix generic

ad1bd14

shivaram reviewed Jan 21, 2017

View reviewed changes

more broken calls

bab7466

felixcheung mentioned this pull request Jan 26, 2017

[SPARK-19366][SQL] add getNumPartitions to Dataset #16708

Closed

improve

0353978

shivaram reviewed Jan 26, 2017

View reviewed changes

remove note

c05f786

asfgit closed this in 90817a6 Jan 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18788][SPARKR] Add API for getNumPartitions #16668

[SPARK-18788][SPARKR] Add API for getNumPartitions #16668

felixcheung commented Jan 21, 2017

SparkQA commented Jan 21, 2017

felixcheung commented Jan 21, 2017

SparkQA commented Jan 21, 2017

HyukjinKwon Jan 21, 2017

felixcheung Jan 21, 2017

SparkQA commented Jan 21, 2017

shivaram Jan 21, 2017

felixcheung Jan 22, 2017

cloud-fan Jan 23, 2017

felixcheung Jan 23, 2017

cloud-fan Jan 24, 2017

cloud-fan Jan 24, 2017

felixcheung Jan 24, 2017

cloud-fan Jan 24, 2017

felixcheung Jan 24, 2017

cloud-fan Jan 24, 2017

SparkQA commented Jan 22, 2017

SparkQA commented Jan 22, 2017

felixcheung commented Jan 25, 2017

shivaram commented Jan 26, 2017

felixcheung commented Jan 26, 2017

SparkQA commented Jan 26, 2017

shivaram Jan 26, 2017

felixcheung Jan 26, 2017

shivaram left a comment

felixcheung commented Jan 26, 2017

SparkQA commented Jan 26, 2017

[SPARK-18788][SPARKR] Add API for getNumPartitions #16668

[SPARK-18788][SPARKR] Add API for getNumPartitions #16668

Conversation

felixcheung commented Jan 21, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jan 21, 2017

felixcheung commented Jan 21, 2017

SparkQA commented Jan 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 22, 2017

SparkQA commented Jan 22, 2017

felixcheung commented Jan 25, 2017

shivaram commented Jan 26, 2017

felixcheung commented Jan 26, 2017

SparkQA commented Jan 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivaram left a comment

Choose a reason for hiding this comment

felixcheung commented Jan 26, 2017

SparkQA commented Jan 26, 2017