This recipe shows how to run Horovod distributed training framework for Tensorflow using Batch AI.
- Standard Horovod tensorflow_mnist.py example will be used;
- tensorflow_mnist.py downloads training data on its own during execution;
- The job will be run on standard tensorflow container
tensorflow/tensorflow:1.8.0-gpu
. You can run the same job directly on GPU nodes by choosing Ubuntu DSVM as an image and removing container settings from the job definition.; - Horovod framework will be installed in the container using job preparation command line. Note, you can build your own docker image containing tensorflow and horovod instead.
- Standard output of the job will be stored on Azure File Share.
You can find Jupyter Notebook for this recipe in Horovod-Tensorflow.ipynb.
You can find Azure CLI 2.0 instructions for this recipe in cli-instructions.md.
Under construction...
If you have any problems or questions, you can reach the Batch AI team at [email protected] or you can create an issue on GitHub.
We also welcome your contributions of additional sample notebooks, scripts, or other examples of working with Batch AI.