-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18788][SPARKR] Add API for getNumPartitions #16668
Conversation
Test build #71761 has started for PR 16668 at commit |
Jenkins, retest this please |
Test build #71764 has finished for PR 16668 at commit
|
#' df <- createDataFrame(cars, numPartitions = 2) | ||
#' getNumPartitions(df) | ||
#' } | ||
#' @note getNumPartitions since 2.1.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung, should this be since 2.2.0
? Just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I debated about this quite a bit - generally it should but we merged createDataFrame(..., numPartitions) to 2.1 and it felt important to have a getNumPartition in the same release too.
Test build #71772 has finished for PR 16668 at commit
|
setMethod("getNumPartitions", | ||
signature(x = "SparkDataFrame"), | ||
function(x) { | ||
getNumPartitionsRDD(toRDD(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in the JIRA I worry that this will be a very expensive operation for large data frames. Specifically instead of create an RRDD, can we do some operations on the Scala side which might be cheaper ?
cc @yhuai @cloud-fan who know more about DataFrame on the SQL side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we agreed.
The conversion, especially into RRDD, is in particular concerning. From what I can see though this df.rdd.getNumPartitions
is the recommended practice, which seems to be all over pyspark. (granted, DataFrame to RDD in pyspark is likely slightly more efficient)
An alternative, is we could wrap all of this on the JVM side - at least that should save us the around trip to RRDD.
But agreed, is there a more efficient way this could be exposed in DataFrame/Dataset directly instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we add the getNumPartitions
to DataFrame/Dataset
at scala side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you going to do it here? Or do we need to send a new PR for the scala side changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't just calling rdd.numPartitions
? we need to materialize the RDD inside DataFrame anyway, but it's cheap at scala side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, that we could do easily. is that something ok for Spark 2.1.1? If yes, I could go ahead with changes here for Scala, Python and R.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you said this filled a hole for Spark 2.1, what's this hole? is this Spark R only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I should clarify. Yes, for R only - since SparkR only has DataFrame APIs and no (publicly supported) RDD APIs, users are left without a way to check number of partitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can add this slow implementation to Spark 2.1, and improve it in Spark 2.2
Test build #71779 has finished for PR 16668 at commit
|
Test build #71786 has finished for PR 16668 at commit
|
@shivaram how about we merge this to master & branch-2.1? then I can based off of this to Dataset/DataFrame API in Scala as @cloud-fan suggests - it would be easier than porting the little fixes to get around the getNumPartitions conflicts in R. And having this in 2.1.x is not likely much worse than people calling the non-public methods... |
@felixcheung Why dont we do something simpler where we call the scala function from R side. i.e. get a handle to the scala DF, call |
I like it! Done. |
Test build #72041 has finished for PR 16668 at commit
|
setMethod("getNumPartitions", | ||
signature(x = "SparkDataFrame"), | ||
function(x) { | ||
callJMethod(callJMethod(x@sdf, "rdd"), "getNumPartitions") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last thing - can we add a TODO and a pointer to a JIRA saying this needs to be fixed once getNumPartitions is added to scala API ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so rxin is saying on #16708 that we don't want this to be a public API on Dataset. I'm leaving this for now since this implementation seems reasonably low overhead.
perhaps @shivaram and @cloud-fan want to comment in #16708?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @felixcheung - I actually think this should be good to merge into master as well and once the scala change is made, we get rid of this ?
I'll merge to branch-2.1 and master once another pass of Jenkins is done |
Test build #72052 has finished for PR 16668 at commit
|
## What changes were proposed in this pull request? With doc to say this would convert DF into RDD ## How was this patch tested? unit tests, manual tests Author: Felix Cheung <[email protected]> Closes #16668 from felixcheung/rgetnumpartitions. (cherry picked from commit 90817a6) Signed-off-by: Felix Cheung <[email protected]>
## What changes were proposed in this pull request? With doc to say this would convert DF into RDD ## How was this patch tested? unit tests, manual tests Author: Felix Cheung <[email protected]> Closes apache#16668 from felixcheung/rgetnumpartitions.
## What changes were proposed in this pull request? With doc to say this would convert DF into RDD ## How was this patch tested? unit tests, manual tests Author: Felix Cheung <[email protected]> Closes apache#16668 from felixcheung/rgetnumpartitions.
What changes were proposed in this pull request?
With doc to say this would convert DF into RDD
How was this patch tested?
unit tests, manual tests