Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18788][SPARKR] Add API for getNumPartitions #16668

Closed
wants to merge 6 commits into from

Conversation

felixcheung
Copy link
Member

What changes were proposed in this pull request?

With doc to say this would convert DF into RDD

How was this patch tested?

unit tests, manual tests

@SparkQA
Copy link

SparkQA commented Jan 21, 2017

Test build #71761 has started for PR 16668 at commit 34f9aa5.

@felixcheung
Copy link
Member Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Jan 21, 2017

Test build #71764 has finished for PR 16668 at commit 34f9aa5.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#' df <- createDataFrame(cars, numPartitions = 2)
#' getNumPartitions(df)
#' }
#' @note getNumPartitions since 2.1.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung, should this be since 2.2.0? Just curious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated about this quite a bit - generally it should but we merged createDataFrame(..., numPartitions) to 2.1 and it felt important to have a getNumPartition in the same release too.

@SparkQA
Copy link

SparkQA commented Jan 21, 2017

Test build #71772 has finished for PR 16668 at commit 7c057fc.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

setMethod("getNumPartitions",
signature(x = "SparkDataFrame"),
function(x) {
getNumPartitionsRDD(toRDD(x))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the JIRA I worry that this will be a very expensive operation for large data frames. Specifically instead of create an RRDD, can we do some operations on the Scala side which might be cheaper ?

cc @yhuai @cloud-fan who know more about DataFrame on the SQL side

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we agreed.
The conversion, especially into RRDD, is in particular concerning. From what I can see though this df.rdd.getNumPartitions is the recommended practice, which seems to be all over pyspark. (granted, DataFrame to RDD in pyspark is likely slightly more efficient)

An alternative, is we could wrap all of this on the JVM side - at least that should save us the around trip to RRDD.

But agreed, is there a more efficient way this could be exposed in DataFrame/Dataset directly instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add the getNumPartitions to DataFrame/Dataset at scala side?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you going to do it here? Or do we need to send a new PR for the scala side changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't just calling rdd.numPartitions? we need to materialize the RDD inside DataFrame anyway, but it's cheap at scala side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, that we could do easily. is that something ok for Spark 2.1.1? If yes, I could go ahead with changes here for Scala, Python and R.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you said this filled a hole for Spark 2.1, what's this hole? is this Spark R only?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I should clarify. Yes, for R only - since SparkR only has DataFrame APIs and no (publicly supported) RDD APIs, users are left without a way to check number of partitions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add this slow implementation to Spark 2.1, and improve it in Spark 2.2

@SparkQA
Copy link

SparkQA commented Jan 22, 2017

Test build #71779 has finished for PR 16668 at commit ad1bd14.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 22, 2017

Test build #71786 has finished for PR 16668 at commit bab7466.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member Author

@shivaram how about we merge this to master & branch-2.1? then I can based off of this to Dataset/DataFrame API in Scala as @cloud-fan suggests - it would be easier than porting the little fixes to get around the getNumPartitions conflicts in R. And having this in 2.1.x is not likely much worse than people calling the non-public methods...

@shivaram
Copy link
Contributor

@felixcheung Why dont we do something simpler where we call the scala function from R side. i.e. get a handle to the scala DF, call .rdd on it to get a handle to the scala RDD etc. ? That seems less expensive than running the conversion to RRDD and it doesn't involve scala side changes.

@felixcheung
Copy link
Member Author

I like it! Done.

@SparkQA
Copy link

SparkQA commented Jan 26, 2017

Test build #72041 has finished for PR 16668 at commit 0353978.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

setMethod("getNumPartitions",
signature(x = "SparkDataFrame"),
function(x) {
callJMethod(callJMethod(x@sdf, "rdd"), "getNumPartitions")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing - can we add a TODO and a pointer to a JIRA saying this needs to be fixed once getNumPartitions is added to scala API ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so rxin is saying on #16708 that we don't want this to be a public API on Dataset. I'm leaving this for now since this implementation seems reasonably low overhead.

perhaps @shivaram and @cloud-fan want to comment in #16708?

Copy link
Contributor

@shivaram shivaram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @felixcheung - I actually think this should be good to merge into master as well and once the scala change is made, we get rid of this ?

@felixcheung
Copy link
Member Author

I'll merge to branch-2.1 and master once another pass of Jenkins is done

@SparkQA
Copy link

SparkQA commented Jan 26, 2017

Test build #72052 has finished for PR 16668 at commit c05f786.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

With doc to say this would convert DF into RDD

## How was this patch tested?

unit tests, manual tests

Author: Felix Cheung <[email protected]>

Closes #16668 from felixcheung/rgetnumpartitions.

(cherry picked from commit 90817a6)
Signed-off-by: Felix Cheung <[email protected]>
@asfgit asfgit closed this in 90817a6 Jan 27, 2017
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

With doc to say this would convert DF into RDD

## How was this patch tested?

unit tests, manual tests

Author: Felix Cheung <[email protected]>

Closes apache#16668 from felixcheung/rgetnumpartitions.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

With doc to say this would convert DF into RDD

## How was this patch tested?

unit tests, manual tests

Author: Felix Cheung <[email protected]>

Closes apache#16668 from felixcheung/rgetnumpartitions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants