Add support for Horovod as a distributed backend #1518
Labels
feature
Is an improvement or enhancement
help wanted
Open to be worked on
priority: 0
High priority task
Milestone
🚀 Feature
Horovod is a framework for performing data-parallel distributed training for PyTorch (in addition to other frameworks like TensorFlow and MXNet). It uses the allreduce technique to synchronously aggregate gradients across workers, similar to PyTorch's DDP API.
The goal of this feature is to implemented support for Horovod as another
distributed_backend
option for PyTorch Lightning, providing an abstraction layer over the Horovod API so users don't need to modify their training code when scaling from one GPU to many.Motivation
At Uber, many of our researchers are interested in adopting PyTorch Lightning as a standard platform-level API. Because our infrastructure is highly integrated with Horovod, one of the prerequisites for adoption is to be able to run PyTorch Lightning using Horovod for distributed training.
We considered making this an internal layer built on top of PyTorch Lightning, but because Horovod is a popular API used by other companies in industry, we thought this would make the most sense as a contribution to PyTorch Lightning.
Pitch
With this change, all users would need to do to add Horovod support would be to make the following change to their Trainer to run on GPU (single or multiple):
Or to run on CPU:
Then the training script can be launched via the
horovodrun
command-line tool, where the host/GPU allocation is specified:Alternatives
Build Horovod support outside of PyTorch Lightning. This has been some by some users in the past, but requires building a separate abstraction of Lightning. It'll be difficult to keep such solutions in sync as Lightning continues to add new features, or to make it fully compatible with user LightningModules (if we need to use the same methods/hooks to implement the required functionality).
Launch Horovod in-process as opposed to from a driver application. Horovod supports launching programmatically via the
horovod.run
API. However, this requires pickling code, which is prone to serialization errors for some models. Most Horovod users are accustomed to using horovodrun / mpirun to launch their jobs. Also, usinghorovodrun
allows us to decouple the training code from the resource requirements (num_gpus, etc.) which is useful for our users.Additional context
A proof of concept has been implemented here: master...tgaddair:horovod
Docs and unit tests are forthcoming. But before creating a full PR, I wanted to get the thoughts of the PyTorch Lightning devs to see if this implementation aligns with your goals for the project.
cc @alsrgv
The text was updated successfully, but these errors were encountered: