-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quick explanation? #1
Comments
Right now this is more like a raw brain dump to identify issues when trying to PySpark to scale typical PyData operations on large collections. The code uses PySpark via RDD instances. Have a look at the tests that create toy RDD instances using |
In essence this is using PySpark as a backend for distribution, as opposed to say IPython parallel. The advantages vs the alternatives may include the powerful programming model, fault tolerance, HDFS compatibility and Sparks broadcast and accumulator variables (and hopefully PySpark Streaming at some point). (Disadvantages may include performance due to java/Python interoperability but see the code for blocking of RDDs of numpy arrays as an example of something that should improve performance substantially). In my view the core focuses should probably be distributed versions of:
— On Tue, Feb 18, 2014 at 8:13 PM, Olivier Grisel [email protected]
|
Great thoughts. Re: Thomas's specific question, the dependence on PySpark might seem opaque, but the key idea is that most of the operations (e.g. all the functions in blocked_rdd_math) are being performed on PySpark RDDs (Spark's primary abstraction), created from a SparkContext when data are loaded (see the unit tests for example creation of RDDs). So the maps and reduces are transformations and actions, respectively, performed on an RDD. Re: the roadmap, also worth adding that this project overlaps a bit with the AmpLab's MLlib, but I see it as both complementary, and potentially much broader and faster to develop for, because native PySpark implementation means we can draw on many existing libraries in sklearn/numpy/scipy within RDD operations (as in the linear_model example). |
I also think that, where appropriate, it would be useful to supply sklearn-style frontends to existing MLLib algorithms. Not all algorithms are as easy transplants from sklearn as SGD, and until we have distributed python implementations, providing access to these algorithms on Spark through a familiar interface could benefit PyData users. |
@sryza yes absolutely, cf the k-means example. Will be nice to do a similar wrapper for recommendation code. |
This looks super interesting. I know the basics of pyspark but apparently it's not enough to understand what the trick is here as I can't find any reference to pyspark in the code. I assume writing functional python-code using map+reduce etc will allow mapping to spark?
Any hints on the central idea here would be greatly appreciated!
The text was updated successfully, but these errors were encountered: