[SPARK-3580] add 'partitions' property to PySpark RDD #2478

mattf · 2014-09-21T13:53:40Z

'rdd.partitions' is available in scala&java, primarily used for its
size() method to get the number of partitions. pyspark instead has a
getNumPartitions() call and no access to 'partitions'

this change adds 'partitions' to pyspark's rdd, allowing for
len(rdd.partitions) to get the number of partitions in a way familiar
to python developers

'rdd.partitions' is available in scala&java, primarily used for its size() method to get the number of partitions. pyspark instead has a getNumPartitions() call and no access to 'partitions' this change adds 'partitions' to pyspark's rdd, allowing for len(rdd.partitions) to get the number of partitions in a way familiar to python developers

SparkQA · 2014-09-21T13:59:31Z

QA tests have started for PR 2478 at commit 96316e4.

This patch merges cleanly.

SparkQA · 2014-09-21T15:07:33Z

QA tests have finished for PR 2478 at commit 96316e4.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging

davies · 2014-09-23T16:54:45Z

RDD._jrdd is very heavy for PipelinedRDD, but getNumPartitions() could be optimized for PipelinedRDD to avoid the creation of _jrdd (could be rdd.prev.getNumPartitions()).

Also, partitions() is one Java Object, it should be an implementation detail, it's better to keep it as internal interface.

len(rdd.partitions()) sounds more Pythonic, how about len(rdd)? cc @JoshRosen

JoshRosen · 2014-09-24T18:33:59Z

I think len(rdd) has the potential to be confused with rdd.count(), since calling len() on a Python collection usually returns the size of that collection.

I also agree that we shouldn't expose Java Partition objects to users. Is there any reason to expose Partition objects besides allowing len(rdd.partitions()) to work? If not, I'm not sure that we should add this feature.

mattf · 2014-09-25T11:22:49Z

RDD._jrdd is very heavy for PipelinedRDD, but getNumPartitions() could be optimized for PipelinedRDD to avoid the creation of _jrdd (could be rdd.prev.getNumPartitions()).

very true

Also, partitions() is one Java Object, it should be an implementation detail, it's better to keep it as internal interface.

also true. when testing this i noticed the list should essentially be treated as full of black boxes, but that's difficult to enforce w/o wrapping the job object in a python version of Partition.

@davies @JoshRosen, what's the purpose of exposing an array of partition objects in Scala&Java?

I think len(rdd) has the potential to be confused with rdd.count(), since calling len() on a Python collection usually returns the size of that collection.

i agree

JoshRosen · 2014-09-25T18:05:19Z

what's the purpose of exposing an array of partition objects in Scala&Java?

In Scala / Java, I think we expose Partition objects for use in custom RDD implementations. There are a bunch of methods like iterator(), compute(), preferredLocations(), etc. that take Partition objects. Outside of this context, I'm not sure that they're useful to end-users.

(If you use IntelliJ, try running "Find Usages" on Partition and look at the "Method Parameter Declaration" usages).

mattf · 2014-09-30T02:15:56Z

@JoshRosen a partition itself doesn't have much in the way of a user api. it wouldn't be difficult to wrap the java objects in a python Partition. we should then start implementing the rdd functions that take partitions in python.

mattf · 2014-10-13T12:22:00Z

@JoshRosen @pwendell any further comment on this?

JoshRosen · 2014-10-13T16:23:31Z

@mattf I'm not sure that it's worth exposing those Partition-accepting methods in Python, since I don't think that they're really intended to be called by users. I guess I don't really see the benefit of adding a Partition wrapper to Python just so that we can write len(partitions); it seems like a lot of work for not much benefit.

JoshRosen · 2014-11-07T22:04:28Z

Hi @mattf,

If you don't have any additional comments, do you mind closing this pull request? Thanks!

asfgit closed this in f73b56f Nov 10, 2014

mattf deleted the SPARK-3580 branch January 6, 2015 15:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3580] add 'partitions' property to PySpark RDD #2478

[SPARK-3580] add 'partitions' property to PySpark RDD #2478

mattf commented Sep 21, 2014

SparkQA commented Sep 21, 2014

SparkQA commented Sep 21, 2014

davies commented Sep 23, 2014

JoshRosen commented Sep 24, 2014

mattf commented Sep 25, 2014

JoshRosen commented Sep 25, 2014

mattf commented Sep 30, 2014

mattf commented Oct 13, 2014

JoshRosen commented Oct 13, 2014

JoshRosen commented Nov 7, 2014

[SPARK-3580] add 'partitions' property to PySpark RDD #2478

[SPARK-3580] add 'partitions' property to PySpark RDD #2478

Conversation

mattf commented Sep 21, 2014

SparkQA commented Sep 21, 2014

SparkQA commented Sep 21, 2014

davies commented Sep 23, 2014

JoshRosen commented Sep 24, 2014

mattf commented Sep 25, 2014

JoshRosen commented Sep 25, 2014

mattf commented Sep 30, 2014

mattf commented Oct 13, 2014

JoshRosen commented Oct 13, 2014

JoshRosen commented Nov 7, 2014