To run Spark within Jupyter we recommend using the Toree kernel. We are going to assume you already have the following installed:
- Python 2.x
- PIP
- Docker (required to install Toree)
virtualenv venv
source ./venv/bin/activate
pip install jupyter
Clone master into your working directory from Toree's github repo.
For this next step, you'll need to make sure that docker is running.
cd incubator-toree
make release
cd dist/toree-pip
pip install .
SPARK_HOME=<path to spark> jupyter toree install
The most error-proof way to add mleap to your project is to modify the kernel directly (or create a new one for Toree and Spark 2.0).
Kernel config files are typically located in /usr/local/share/jupyter/kernels/apache_toree_scala/kernel.json
Go ahead and add or modify __TOREE_SPARK_OPTS__
like so:
"__TOREE_SPARK_OPTS__": "--packages com.databricks:spark-avro_2.11:3.0.1,ml.combust.mleap:mleap-spark_2.12:0.21.0,"
An alternative way is to use AddDeps Magics, but we've run into dependency collisions, so do so at your own risk:
%AddDeps ml.combust.mleap mleap-spark_2.12 0.21.0 --transitive
First go through the steps above for launching a notebook with MLeap for Spark, then add the following to PYTHONPATH
"PYTHONPATH": "/usr/local/spark-2.0.0-bin-hadoop2.7/python:/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip:/<git directory>/combust/combust-mleap/python",
No need to modify the kernel.json
directly, just instantiate the libraries like described here.