Quick explanation? #1

twiecki · 2014-02-18T15:42:09Z

This looks super interesting. I know the basics of pyspark but apparently it's not enough to understand what the trick is here as I can't find any reference to pyspark in the code. I assume writing functional python-code using map+reduce etc will allow mapping to spark?

Any hints on the central idea here would be greatly appreciated!

ogrisel · 2014-02-18T18:13:29Z

Right now this is more like a raw brain dump to identify issues when trying to PySpark to scale typical PyData operations on large collections.

The code uses PySpark via RDD instances. Have a look at the tests that create toy RDD instances using SparkContext.parallelize.

MLnick · 2014-02-18T18:46:24Z

In essence this is using PySpark as a backend for distribution, as opposed to say IPython parallel.

The advantages vs the alternatives may include the powerful programming

model, fault tolerance, HDFS compatibility and Sparks broadcast and accumulator variables (and hopefully PySpark Streaming at some point).

(Disadvantages may include performance due to java/Python interoperability but see the code

for blocking of RDDs of numpy arrays as an example of something that should improve performance substantially).

In my view the core focuses should probably be distributed versions of:

linear models (part is there already)
random forests perhaps
clustering
cross validation / multiple model training in parallel
feature extraction
linear algebra (I see SVD has appeared already)

—
Sent from Mailbox for iPhone

On Tue, Feb 18, 2014 at 8:13 PM, Olivier Grisel [email protected]
wrote:

Right now this is more like a raw brain dump to identify issues when trying to PySpark to scale typical PyData operations on large collections.

The code uses PySpark via RDD instances. Have a look at the tests that create toy RDD instance using SparkContext.parallelize.

Reply to this email directly or view it on GitHub:
#1 (comment)

freeman-lab · 2014-02-18T19:47:05Z

Great thoughts.

Re: Thomas's specific question, the dependence on PySpark might seem opaque, but the key idea is that most of the operations (e.g. all the functions in blocked_rdd_math) are being performed on PySpark RDDs (Spark's primary abstraction), created from a SparkContext when data are loaded (see the unit tests for example creation of RDDs). So the maps and reduces are transformations and actions, respectively, performed on an RDD.

Re: the roadmap, also worth adding that this project overlaps a bit with the AmpLab's MLlib, but I see it as both complementary, and potentially much broader and faster to develop for, because native PySpark implementation means we can draw on many existing libraries in sklearn/numpy/scipy within RDD operations (as in the linear_model example).

sryza · 2014-02-19T09:47:05Z

I also think that, where appropriate, it would be useful to supply sklearn-style frontends to existing MLLib algorithms. Not all algorithms are as easy transplants from sklearn as SGD, and until we have distributed python implementations, providing access to these algorithms on Spark through a familiar interface could benefit PyData users.

MLnick · 2014-02-19T09:54:24Z

@sryza yes absolutely, cf the k-means example. Will be nice to do a similar wrapper for recommendation code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick explanation? #1

Quick explanation? #1

twiecki commented Feb 18, 2014

ogrisel commented Feb 18, 2014

MLnick commented Feb 18, 2014

The code uses PySpark via RDD instances. Have a look at the tests that create toy RDD instance using `SparkContext.parallelize`.

freeman-lab commented Feb 18, 2014

sryza commented Feb 19, 2014

MLnick commented Feb 19, 2014

Quick explanation? #1

Quick explanation? #1

Comments

twiecki commented Feb 18, 2014

ogrisel commented Feb 18, 2014

MLnick commented Feb 18, 2014

The code uses PySpark via RDD instances. Have a look at the tests that create toy RDD instance using SparkContext.parallelize.

freeman-lab commented Feb 18, 2014

sryza commented Feb 19, 2014

MLnick commented Feb 19, 2014

The code uses PySpark via RDD instances. Have a look at the tests that create toy RDD instance using `SparkContext.parallelize`.