Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quick explanation? #1

Open
twiecki opened this issue Feb 18, 2014 · 5 comments
Open

Quick explanation? #1

twiecki opened this issue Feb 18, 2014 · 5 comments

Comments

@twiecki
Copy link

twiecki commented Feb 18, 2014

This looks super interesting. I know the basics of pyspark but apparently it's not enough to understand what the trick is here as I can't find any reference to pyspark in the code. I assume writing functional python-code using map+reduce etc will allow mapping to spark?

Any hints on the central idea here would be greatly appreciated!

@ogrisel
Copy link
Owner

ogrisel commented Feb 18, 2014

Right now this is more like a raw brain dump to identify issues when trying to PySpark to scale typical PyData operations on large collections.

The code uses PySpark via RDD instances. Have a look at the tests that create toy RDD instances using SparkContext.parallelize.

@MLnick
Copy link
Collaborator

MLnick commented Feb 18, 2014

In essence this is using PySpark as a backend for distribution, as opposed to say IPython parallel.

The advantages vs the alternatives may include the powerful programming

model, fault tolerance, HDFS compatibility and Sparks broadcast and accumulator variables (and hopefully PySpark Streaming at some point).

(Disadvantages may include performance due to java/Python interoperability but see the code

for blocking of RDDs of numpy arrays as an example of something that should improve performance substantially).

In my view the core focuses should probably be distributed versions of:

  • linear models (part is there already)
  • random forests perhaps
  • clustering
  • cross validation / multiple model training in parallel
  • feature extraction
  • linear algebra (I see SVD has appeared already)


Sent from Mailbox for iPhone

On Tue, Feb 18, 2014 at 8:13 PM, Olivier Grisel [email protected]
wrote:

Right now this is more like a raw brain dump to identify issues when trying to PySpark to scale typical PyData operations on large collections.

The code uses PySpark via RDD instances. Have a look at the tests that create toy RDD instance using SparkContext.parallelize.

Reply to this email directly or view it on GitHub:
#1 (comment)

@freeman-lab
Copy link
Collaborator

Great thoughts.

Re: Thomas's specific question, the dependence on PySpark might seem opaque, but the key idea is that most of the operations (e.g. all the functions in blocked_rdd_math) are being performed on PySpark RDDs (Spark's primary abstraction), created from a SparkContext when data are loaded (see the unit tests for example creation of RDDs). So the maps and reduces are transformations and actions, respectively, performed on an RDD.

Re: the roadmap, also worth adding that this project overlaps a bit with the AmpLab's MLlib, but I see it as both complementary, and potentially much broader and faster to develop for, because native PySpark implementation means we can draw on many existing libraries in sklearn/numpy/scipy within RDD operations (as in the linear_model example).

@sryza
Copy link
Collaborator

sryza commented Feb 19, 2014

I also think that, where appropriate, it would be useful to supply sklearn-style frontends to existing MLLib algorithms. Not all algorithms are as easy transplants from sklearn as SGD, and until we have distributed python implementations, providing access to these algorithms on Spark through a familiar interface could benefit PyData users.

@MLnick
Copy link
Collaborator

MLnick commented Feb 19, 2014

@sryza yes absolutely, cf the k-means example. Will be nice to do a similar wrapper for recommendation code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants