You must be signed in to change notification settings - Fork 943
We illustrate how to use TensorFlowOnSpark on a Spark Standalone cluster running on a single machine. Note that TensorFlowOnSpark does not work in Spark Local (single-process) mode, since it expects the executors to be running in separate processes.
git clone https://github.com/yahoo/TensorFlowOnSpark.git
cd TensorFlowOnSpark
export TFoS_HOME=$(pwd)
Download Apache Spark per instructions. Note: we use version 1.6.0 here, but you can install a later version if you like.
rm spark-1.6.0-bin-hadoop2.6.tar
export SPARK_HOME=$(pwd)/spark-1.6.0-bin-hadoop2.6
export PATH=${SPARK_HOME}/bin:${PATH}
Please build and install TensorFlow per instructions.
For example, using the pip install method, you should be able to install TensorFlow as follows:
sudo pip install tensorflow
sudo pip install tensorflowonspark
To view the installed packages:
pip list
mkdir ${TFoS_HOME}/mnist
pushd ${TFoS_HOME}/mnist
curl -O "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz"
Launch standalone Spark cluster
export MASTER=spark://$(hostname):7077
${SPARK_HOME}/sbin/start-master.sh; ${SPARK_HOME}/sbin/start-slave.sh -c $CORES_PER_WORKER -m 3G ${MASTER}
Go to MasterWebUI, make sure that you have the exact number of workers launched.
Start a pyspark shell and import tensorflow and tensorflowonspark. If everything is setup correctly, you shouldn't see any errors.
>>> import tensorflow as tf
>>> from tensorflowonspark import TFCluster
>>> exit()
cd ${TFoS_HOME}
# rm -rf examples/mnist/csv
${SPARK_HOME}/bin/spark-submit \
--master ${MASTER} \
${TFoS_HOME}/examples/mnist/mnist_data_setup.py \
--output examples/mnist/csv \
--format csv
ls -lR examples/mnist/csv
# rm -rf mnist_model
${SPARK_HOME}/bin/spark-submit \
--master ${MASTER} \
--py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
--cluster_size ${SPARK_WORKER_INSTANCES} \
--images examples/mnist/csv/train/images \
--labels examples/mnist/csv/train/labels \
--format csv \
--mode train \
--model mnist_model
ls -l mnist_model
# rm -rf predictions
${SPARK_HOME}/bin/spark-submit \
--master ${MASTER} \
--py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
--cluster_size ${SPARK_WORKER_INSTANCES} \
--images examples/mnist/csv/test/images \
--labels examples/mnist/csv/test/labels \
--mode inference \
--format csv \
--model mnist_model \
--output predictions
less predictions/part-00000
The prediction result should look like:
2017-02-10T23:29:17.009563 Label: 7, Prediction: 7
2017-02-10T23:29:17.009677 Label: 2, Prediction: 2
2017-02-10T23:29:17.009721 Label: 1, Prediction: 1
2017-02-10T23:29:17.009761 Label: 0, Prediction: 0
2017-02-10T23:29:17.009799 Label: 4, Prediction: 4
2017-02-10T23:29:17.009838 Label: 1, Prediction: 1
2017-02-10T23:29:17.009876 Label: 4, Prediction: 4
2017-02-10T23:29:17.009914 Label: 9, Prediction: 9
2017-02-10T23:29:17.009951 Label: 5, Prediction: 6
2017-02-10T23:29:17.009989 Label: 9, Prediction: 9
2017-02-10T23:29:17.010026 Label: 0, Prediction: 0
Install additional software required by Jupyter Notebooks.
sudo pip install jupyter jupyter[notebook]
Launch IPython notebook on Master node.
pushd ${TFoS_HOME}/examples/mnist
PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --ip=`hostname`" \
pyspark --master ${MASTER} \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME"
Launch a browser to access notebook
${SPARK_HOME}/sbin/stop-slave.sh; ${SPARK_HOME}/sbin/stop-master.sh
For multi-host Spark Standalone clusters, you will still need some form of distributed filesystem that spans the multiple hosts, which in many cases is HDFS. If your setup uses HDFS, TensorFlow requires you to add the path to the libhdfs.so
file to your LD_LIBRARY_PATH
in order for it to read/write files to HDFS. This can be done on the Spark executors by adding the following config to your spark-submit commands:
--conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_HDFS
where you can set LIB_HDFS
to the path to libhdfs.so
on your setup.