-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3580] add 'partitions' property to PySpark RDD #2478
Conversation
'rdd.partitions' is available in scala&java, primarily used for its size() method to get the number of partitions. pyspark instead has a getNumPartitions() call and no access to 'partitions' this change adds 'partitions' to pyspark's rdd, allowing for len(rdd.partitions) to get the number of partitions in a way familiar to python developers
QA tests have started for PR 2478 at commit
|
QA tests have finished for PR 2478 at commit
|
RDD._jrdd is very heavy for PipelinedRDD, but getNumPartitions() could be optimized for PipelinedRDD to avoid the creation of _jrdd (could be rdd.prev.getNumPartitions()). Also, partitions() is one Java Object, it should be an implementation detail, it's better to keep it as internal interface. len(rdd.partitions()) sounds more Pythonic, how about |
I think I also agree that we shouldn't expose Java |
very true
also true. when testing this i noticed the list should essentially be treated as full of black boxes, but that's difficult to enforce w/o wrapping the job object in a python version of Partition. @davies @JoshRosen, what's the purpose of exposing an array of partition objects in Scala&Java?
i agree |
In Scala / Java, I think we expose Partition objects for use in custom RDD implementations. There are a bunch of methods like (If you use IntelliJ, try running "Find Usages" on Partition and look at the "Method Parameter Declaration" usages). |
@JoshRosen a partition itself doesn't have much in the way of a user api. it wouldn't be difficult to wrap the java objects in a python Partition. we should then start implementing the rdd functions that take partitions in python. |
@JoshRosen @pwendell any further comment on this? |
@mattf I'm not sure that it's worth exposing those |
Hi @mattf, If you don't have any additional comments, do you mind closing this pull request? Thanks! |
'rdd.partitions' is available in scala&java, primarily used for its
size() method to get the number of partitions. pyspark instead has a
getNumPartitions() call and no access to 'partitions'
this change adds 'partitions' to pyspark's rdd, allowing for
len(rdd.partitions) to get the number of partitions in a way familiar
to python developers